Statistical Office of the European Communities PRACTICAL GUIDE TO DATA VALIDATION EUROSTAT

Size: px
Start display at page:

Download "Statistical Office of the European Communities PRACTICAL GUIDE TO DATA VALIDATION EUROSTAT"

Transcription

1 EUROSTAT Statistical Office of the European Communities PRACTICAL GUIDE TO DATA VALIDATION IN EUROSTAT

2 TABLE OF CONTENTS 1. Introduction Data editing Literature review Main general procedures adopted in Member States Foreign Trade Industrial Output Commercial Companies Survey Employment Survey Private Sector Statistics on Earnings Survey of Business Owners Building Permits Survey Main general procedures adopted in Eurostat Harmonization of national data Corrections using data from the same Member State Corrections using data from other Member States Foreign Trade Transport Statistics Labour Force Survey Eurofarm Guidelines for data editing Stages of data editing Micro data Error detection Error correction Country data Error detection Error correction Aggregate (Eurostat) data Concluding remarks Missing data and imputation Literature review Single imputation methods Explicit modelling Mean imputation Regression imputation Implicit modelling Hot deck imputation Substitution Cold deck imputation Composite methods Multiple imputation methods

3 3.2 Main general procedures adopted in Member States Foreign Trade Industrial Output Commercial Companies Survey Employment Survey Annual Survey of Hours and Earnings Survey of Business Owners Building Permits Survey Housing Rents Survey Basic Monthly Survey Main general procedures adopted in Eurostat Community Innovation Survey Continuing Vocational Training Survey European Community Household Panel Guidelines for data imputation Stages of imputation of missing data Micro data Country data Aggregate (Eurostat) data Concluding remarks Advanced validation Literature review Strategies for handling outliers Testing for discordancy Exploratory data analysis Statistical testing for outliers Single outlier tests Multiple outlier tests Multivariate data Methods of accommodation Estimation of location Estimation of dispersion Time series analysis Main general procedures adopted in Member States Foreign Trade Consumer Price Index Main general procedures adopted in Eurostat Community Innovation Survey Guidelines for advanced validation Stages of advanced validation Micro data Advanced detection of problems Error correction Country data Aggregate (Eurostat) data Concluding remarks References

4 1. INTRODUCTION A main goal of any statistical organization is the dissemination of high-quality information and this is particularly true in Eurostat. Quality implies that the data available to users have the ability to satisfy their needs and requirements concerning statistical information and is defined in a multidimensional way involving six criteria: Relevance, Accuracy, Timeliness and punctuality, Accessibility and clarity, Comparability and Coherence. Broadly speaking, data validation may be defined as supporting all the other steps of the data production process in order to improve the quality of statistical information. In the Handbook on improving quality by analysis of process variables (LEG on Quality project by ONS UK, Statistics Sweden, National Statistical Service of Greece, and INE PT) it is described as the method of detecting errors resulting from data collection. In short, it is designed to check plausibility of the data and to correct possible errors and is one of the most complex operations in the life cycle of statistical data, including steps and procedures of two main categories: checks (or edits) and transformations (or imputations). Its three main components are the following: Data editing The application of checks that identify missing, invalid or inconsistent entries or that point to data records that are potentially in error.. Missing data and imputation Analysis of imputation and reweighting methods used to correct for missing data caused by non-response. Non-response can be total, when there is no information on a given respondent (unit non-response), or partial, when only part of the information on the respondent is missing (item non-response). Imputation is a procedure used to estimate and replace missing or inconsistent (unusable) data items in order to provide a complete data set. Advanced validation Advanced statistical methods can be used to improve data quality. Many of them are related to outlier detection since the conclusions and inferences obtained from a contaminated (by outliers) data set may be seriously biased. Before Eurostat dissemination, data validation has to be performed at different stages depending on who is processing the data: The first stage is at the end of the collection phase and concerns micro data. Member States are responsible for it, since they conduct the surveys. The second stage concerns country data, i.e., the micro-data country aggregates sent by Member States to Eurostat. Validation has to be performed by the latter at this stage. The third and last stage concerns aggregate (Eurostat) data before their dissemination and it is also performed by Eurostat. Validation should be performed according to a set of common (to what? all sources and records? For one application or for all) and specific rules depending on the stage and on the data aggregation level. In this document, some general and common guidance are provided for each stage. More detailed rules and procedures can only be provided when looking at a specific survey, i.e., since each one has its own particular characteristics and problems. A thorough set of validation guidelines can only then be defined for a specific statistical project. Nevertheless, this document intends to discuss the most important issues that arise concerning validation of any statistical data set, describing its main problems and how to handle them. It lists as thoroughly as possible the different aspects that need to be analyzed for error diagnostic and checking, the most adequate methods and procedures for that purpose and 3

5 finally possible ways to correct the errors found. It should be seen as an introduction to data validation and provide references to further reading by any statistician or staff of a statistical organization working on this matter. That is, being the general starting point for data validation, this document may be applied and adapted to any particular statistical project or data set and may also be used as input for the building block for specific handbooks defining a set of rules and procedures common to Member States and Eurostat. In short, the text of this document should be regarded as the guidelines for general approach to data validation and should be followed by subsequent rules and procedures specifically designed for any statistical project and shared by Member States and Eurostat whose responsibilities also have to be clearly defined. In fact, the ultimate purpose should be the set-up of Current Best Methods (the description of the best methods available for a specific process) in validation for Member States and Eurostat, leading to efficiency gains and to an improvement in data quality as mentioned above. To this end, the introduction of new processes or of process changes, the adoption of new solutions and methods and the promotion of know-how and information exchange are sought. Therefore, the rules, procedures and methods should be discussed and recommendations provided that are not only based on strong statistical methodology but are also commonly used and widely tested in practice. The structure of this document is the following: the next sections discuss the three validation components mentioned above in that order, listing the main problems that may arise, providing some guidance for their detection and correction and indicating who should run validation at each stage. Some examples of validation procedures in surveys conducted in Member States, USA and Canada are also provided. They are only a few illustrative examples of the main rules and procedures used. 4

6 2. DATA EDITING Data validation checks the data have to be checked for their correctness, consistency and completeness in terms of number and content, because several errors can arise in the collection process, such as: The failure to identify some population units or include units outside the scope (under and over coverage). Difficulties to define and classify statistical units. Differences in the interpretation of questions. Errors in recording or coding the data obtained. Other errors of collection, response, coverage, processing, and estimation for missing or misreported data. The purpose of any checks is to ensure a higher level of data quality. It is also important to reduce the time required for the data editing process and the following procedures can help: Electronic data processing Data should be checked and corrected already when provided by the respondents. Therefore, supplying data by electronic means should be encouraged (electronic questionnaires and electronic data interchange). Application of statistical methods Faulty, incomplete, and missing data can be corrected by queries with the respondents but errors can also be corrected through the application of statistical models, largely keeping the data structure and still meeting the requirements in terms of accuracy and timeliness of the output. Continuous improvement of data editing procedures For repeated statistics, data editing settings should be adjusted to meet changing requirements and knowledge from previous editing of statistical data should be taken into account to improve questionnaires and make data editing more efficient. Omitting editing and/or correction of data such that the change would have only negligible impact on the estimates or aggregates. 2.1 Literature review Although there is a large number of papers on data editing in the literature, the seminal paper by Fellegi and Holt (1976) is still the main reference, where these authors introduced the normal set of edits as a systematic approach to automatic editing (and imputation) based on set theory. Following these authors, the logical edits for qualitative variables are based on combinations of code values in different fields that are not acceptable. Therefore, any edit can be broken down into a series of statements of the form a specified combination of code values is not permissible. The subset of the code space such that any record in it fails an edit is called the normal form of edits. Any complex edit statement can be broken down into a series of edits, each having the normal form. Edit specifications contains essentially two types of statements: Simple validation edits, specifying the set of permissible code values for a given field in a record, any other value being an error. This can be converted into the normal form very easily and automatically. More complex consistency edits, involving a finite set of codes. These are typically of the form that whenever a record has certain combinations of code values in some fields, it should have some other combinations of code values in some other fields. Then, the edit statement is that if a record does not respect this condition on the intersection of 5

7 combinations of code values, the record fails the edit. This statement can also be converted into the normal form. Hence, whether the edits are given in a form defining edit failures explicitly or in a form describing conditions that must be satisfied, the edit can be converted into a series of edits in the normal form, each specifying conditions of edit failure. The normal form of edits is originally designed for qualitative variables, but it can be extended to quantitative variables even though, for the latter, this is not its natural form. The edits are expressed as equalities or inequalities and a record that does not respect them for all the quantitative variables, fails the edit. A record which passes all the stated edits is said to be a clean record, not in need of any correction. Conversely, a record which fails any of the edits is in need of some corrections. The advantage of this methodology is that it eliminates the necessity for a separate set of specifications for data corrections. The need for corrections is automatically deduced from the edits themselves which will ensure that the corrections are always consistent with the edits. Another important aspect is that the corrections required for a record to satisfy all edits change the fewest possible data items (fields) so that the maximum amount of original data is kept unchanged, subject to the edit constraints. The methods and procedures described and discussed next as well as the proposed guidelines on data editing and correction fit into this model of normal form of edits as will become clear. 2.2 Some general procedures applied in Countries Data editing procedures depend on the specific data they concern. Therefore, as illustrative examples, we describe some of the main procedures applied by the Statistical Institutes. Error detection usually implies contact with the respondents leading to the correction of those errors Foreign Trade Some responses can only be accepted if they belong to a given list of categories (nomenclatures). Therefore, the admissibility of the response is checked according to that list (for example, delivery conditions, transaction nature or means of transport can only be accepted if they assume a category of the corresponding list). The combination of the values of some variables has to respect a set of rules. Otherwise, the value of one or several of those variables is incorrect. Detection of large differences between the invoice and the statistical values for those respondents who have to provide both values. Detection of large differences between the current period of time and historical data. Detection of non-admissible invoice or statistical values, net weight, supplementary units or prices. The detection is based on the computation from historical data of admissibility intervals for these variables at a highly disaggregated level. Detection of large differences between the response and the values provided by other sources, e.g., VAT data Industrial Output Detection of large differences (quantities, values, prices, etc) between the response of the current period t and the values in past periods (t-1) and (t-2). For infra-annual data, the differences between the response of the current period and the response of the same period 6

8 in the previous year are also checked. For example, for monthly data, the differences between the values in time t and (t-12) are checked, for quarterly data, the differences are between the values in time t and (t-4), etc. Detection of large differences between the response and those provided by similar respondents, namely those companies of the same industrial branch and/or in the same region and/or variables (quantities, values, prices, etc) Commercial Companies Survey Automatic checking of the main activity code with what?. Coherence of the companies responses, mainly their balance sheets. Correction of small errors is automatically carried out. Coherence with the previous period is also checked Employment Survey Error detection The respondents are surveyed twice in the same period and the detection of large differences between the two responses leads to the deletion of the first one, i.e., the second response is considered correct and the first is considered wrong. Error assessment a global error measure may be computed from the comparison between the first and the second responses for every respondent. Therefore, for any given characteristic with k categories C 1, C 2,,C k, responses can be classified in the following table: 1 st resp. C 1 C... 2 C... 2 nd j C k resp. C 1 n 11 n n... 1j n 1k C 2 n 21 n n... 2j n 2k M M M M M M M C i n i1 n... i2 n... ij n ik M M M M M... M C n... n... n k n k1 k2 where n ij represents the number of respondents classified in category C i in the second response and in category C j in the first response. If there are no errors in the n respondents correctly surveyed, only the elements in the main diagonal will not be zero. The global quality index is computed as QI = i n kj 100%. n If both responses agree for every respondent, we have QI = 100, and QI = 0 if they disagree for every respondent. This indicator is a global measure of the quality of the data in the entire survey, i.e., for every characteristic in the survey. It is also computed for every variable in the survey Private Sector Statistics on Earnings Automated checking of different items concerning salaries and occupation, namely number of employees, salary item averages and salary item average changes relatively to the previous year. What is being checked? ii kk 7

9 Every item (such as the basic monthly salary) is subject to specific checking routines in order to detect errors such as negative salaries, values under the minimum salary, low or high salaries or other benefits and low or high growth rates. Data are also examined at different levels of aggregation: total level, industry level and company level. coherence check with aggregates or with t-1? If errors are found, data are analysed and corrected at the micro level. Minimum and maximum values for each salary item are checked (and corrected if wrong). Top n or p%? Survey of Business Owners Data errors are detected and corrected through an automated data edit designed to review the data for reasonableness and consistency. Quality control techniques are used to verify that operating (collection? processing? See point below) procedures were carried out as specified Building Permits Survey Most reporting and data entry errors are corrected through computerized input and complex data review procedures. Strict quality control procedures are applied to ensure that collection, coding and data processing are as accurate as possible Checks are also performed on totals and the magnitude of data What checks?) Comparisons to assess the quality and consistency of the data series The data and trends from the survey are periodically compared with data on housing starts from other sources, with other public and private surveys data for the non-residential sector and with data published by some municipalities on the number of building permits issued. 2.3 Some general procedures applied in Eurostat Eurostat checks the internal and external consistency of each data set received from Member States (country data). The main checks and corrections concerning several statistical projects made by Eurostat after discussion with the Member State involved are as follows: Ex post harmonization of national data to EU norms. Data format checking. Classification of data according to the appropriate nomenclature. Rules on relationships between variables (consistency). Non negativity constraints for statistics mirror flows. What is that? Plausibility checks of data. What is that? The balance checks like differences between credits and debits. Aggregation of items and general consistency when breaking down information (e.g. geographical, activity breakdowns). Time evolution checking. 8

10 More precisely, different kinds of corrections can be envisaged Harmonization of national data It is necessary to ensure the comparability and consistency of national data. Statistical tables for each Member State can then be compiled and published based on the common Eurostat classification. To this end, Eurostat checks that the instructions to fill in the questionnaire have been followed by the reporting countries. When relevant differences relatively to the definitions are detected, Eurostat reallocates national statistics according to the common classification. This involves the followingverifications: On the country and economic zone, to ensure that the contents of each country and economic zone have been filled in the same way. On the economic activity, to check if all the items (sub-items) have been aggregated in the same way by Member States Corrections (deterministic imputation) using data from the same Member State Corrections with direct data Correction of a variable using the difference between two others such as the net flows with credit and debit flows or flows for an individual item with flows of two other aggregated items. Correction of a variable using the sum of other variables such as flows for an aggregated item with individual given items. Correction of a variable using others such as flows for an aggregated partner zone with flows of other(s) partner zone(s). Correction of a variable by computing net amounts such as the flows of Insurance services with the available gross flows, i.e. by deducting from Gross flows, Gross claims received and Gross claims paid. Corrections (imputation using estimators?) with weighted structure Correction of flows for a given partner zone and a given year using an average proportion involving another partner zone and other years. Correction of flows for a given item and a given year using an average proportion involving another item zone and other years. Correction of flows for a given item and a given partner zone using an average proportion involving another item zone and another partner zone. Correction of flows for a given item using a proportion involving two other items Corrections (deterministic imputation) using data from other Member States Corrections with direct data Correction of flows for partner zone intra-eu using available bilateral flows of main EU partners. Corrections with (imputation using estimators?) weighted structure Correction of flows for a given item and a given year using an average proportion involving a mixed item, other EU Member States and several years. Correction of flows for partner zone extra-eu using an average proportion involving partner(s) intra-eu, partner(s) (intra-eu + extra-eu) and other EU Member States. Correction of flows for a given partner zone and a given year using an average proportion involving another partner zone, other EU Member States and another year. 9

11 We next present examples of surveys where validation is performed by Eurostat Foreign Trade The data sets received by Eurostat are checked according to a set of rules the same rules as those applied by Member States, such as the following examples. Checking for invalid nomenclature codes, i.e., some variables have to assume values of a given list (nomenclature). Checking for invalid combinations of values in different variables. Detection of non-admissible values, i.e., checking if a variable is within a certain interval range Transport Statistics Transport Statistics are available for Maritime, Air, Road and Rail transport modes. Some of the main checks are the following. Checking the variables attributes such as data format, length, and type or nomenclature codes. Detection of non-admissible values. Checking for invalid combinations and relationships of values in different variables Labour Force Survey The Labour Force Unit collects data for employment in the Member States. The main checks are as follows. Checking the variables attributes such as data format, length, type or nomenclature codes. Comparison of variables to detect eventual inconsistencies Eurofarm Eurofarm is a system aiming at processing and storing statistics on the structure of agricultural holdings that are derived from surveys carried out by Member States. Its main checks are the following. Checking the variables attributes such as data format, length, and type or nomenclature codes. Checking for non-response. Detection of non-admissible values. Comparison of variables to detect eventual inconsistencies. 2.4 Guidance on data editing Stages of data editing Before dissemination, data checking and editing may have to be performed at the three different validation stages mentioned in the introduction, depending on who is processing the data and the phase of the production process. The first stage for error checking and correcting is the collection stage and concerns micro data. In general, Member States (MS) are responsible for it, since they conduct the surveys, even when Eurostat receives this type of data. The second stage concerns country data, i.e., the micro-data country aggregates sent by Member States to Eurostat. Data checking at this stage has to be made by 10

12 Eurostat(presumably after thorough verification by the data source) and, if errors are detected, the data set could be sent back to the country involved for correction. If the sending back is not possible, Eurostat has to make the necessary adjustments and estimations. The third stage concerns aggregated (Eurostat) data before their dissemination and a last check has to be run by Eurostat since it might be possible that some inconsistency or errors in the data can be found only at this stage. This requires further corrections by Eurostat.. Since data editing and correction depend on the specific data, we propose several procedures that can be generally applied at each stage. The actual application should choose the appropriate procedures Micro data Validation checks on micro data should be run by Member States, i.e., when they send their data sets to Eurostat, these sets should have been scrutinized and error-free already. This also applies to those situations where Eurostat receives the micro data because MS conduct the surveys and therefore are closer to the respondents and can detect and correct errors more efficiently. In fact, as it will be discussed later, error correction very often requires new contacts with the respondents which can be done much more quickly and better by national statistical agencies. As mentioned above, it is important that the time required for the data checking and editing process is reduced and to this end automated data processing, application of statistical methods and continuous improvement of data editing procedures should be pursued Error detection Since checking and editing depend on the specific data concerned, we next propose some procedures that can be generally applied and adapted to any particular survey: 2. Checking of the data sender, particularly for electronic submission. Example: foreign trade statistics (Intrastat). 3. Checking for non-responses in many surveys, several respondents are known, especially the largest or most important ones. If their responses are not received, it usually means that they failed to respond and this may have a significant impact on the final data. Thus, checking for missing responses is very important. Examples: in foreign trade, industrial statistics, or building permits survey, the most important respondents (companies in the former two cases and municipalities in the latter) are perfectly known by the national statistical organizations and if they fail to send their information, the impact on the final data may be very strong. 1. Checking of the data format the data must have a predefined format (data attributes, number of fields or records, etc.). Example: foreign trade, industrial statistics or employment survey. 5. Detection of large differences between the invoice and the statistical values for those respondents who have to provide both values, or between the response and VAT data. Examples: foreign trade, industrial statistics. 4. Detection of non-admissible responses Checking of the response category of qualitative variables, since responses on this type of variables have to assume a category of a given list (nomenclatures). Therefore, only responses belonging to that list can be accepted. Examples: delivery conditions, 11

13 transaction nature, means of transport, gender, occupation, main activity sector such as industrial branch. Quantitative variables whose values cannot be outside a given range. Examples: salaries, income, sales, output, exports, imports, prices, weights, number, age, etc., have to be positive. Quantitative variables whose values have to be within a given interval. These admissibility intervals have to be computed from historical data at a highly disaggregated level. Examples: unit values or prices, unit weights, height of a building, age of a person, number of hours worked, income, etc., have to be inside a given interval of admissible values; salaries cannot be lower than the minimum salary, etc. 6 Detection of large differences between current and past values (growth rates). In particular, the value on time t (current value) should be compared with the values on time (t-1) and (t-2) for example and, for infra-annual data, with the corresponding period of the previous year, i.e., time (t-12) for monthly data, (t-4) for quarterly data and so forth. 7 Detection of large differences between the response and those provided by similar respondents. Examples: companies of the same industrial branch and/or in the same region and/or variables (quantities, values, prices, etc). 9 Outlier detection the last four items are related with outlier detection which will be discussed in section 4. 8 Detection of incoherencies in the responses from the same respondent and error assessment, since there are usually relationships and restrictions among several variables. Examples: exports or imports and total output of the same company (these variables have to be coherent); coherence in a company s balance sheets; age and marital status (for instance, a two-year old person who is a widow). When the respondents are surveyed more than once, coherence between the responses has to be checked (usually, this is the purpose of surveying the same respondent more than once). Large differences between the two responses require corrections and a global error measure as the QI statistic in the Employment survey mentioned above can be computed (error assessment). Low values of this indicator mean significant incoherencies requiring error correction. The number and variety of data editing and checking procedures is very large since they depend on the specific data and country, thus requiring that the general procedures described above are adapted. Some categories, reference (admissible) values or intervals, however, are common to the different countries Error correction When errors are detected in the micro data, they have to be corrected which should be done by Member States, even in those cases where Eurostat receives these disaggregated data. Like error detection, correction procedures depend on the particular data and disaggregation level. Therefore, we discuss the main procedures that can be generally adopted: Generation of the list of errors as a starting point for the correction process. The errors may have attributes such as severity and size of impact. A score function (footnote to Latouche) can be used to assign the importance. Correction of the coding, classification or typing errors and other data attributes such as the format. Correction of those variables whose values can be obtained from other variables of the same respondent. Example: unit prices can be computed from the total value and the corresponding quantity. 12

14 Contact with the respondents most of the errors has to be solved through the contact with the respondents. Moreover, the values questioned are often correct and end up by being confirmed which can only be done by the respondents themselves, requiring such contact. Imputation of missing or erroneous data in case the contact is not possible, too expensive or its outcome is not received on time, the values requiring correction have to be discarded from the data base, thus originating non-responses. These values in question will have to be imputed with methods as discussed in section 3. These last two procedures are the main reasons why Member States should be in charge of validation of micro data, i.e., they should run validation at this stage even when Eurostat receives these data. In fact, if validation was performed by Eurostat, it would have to return the error list to the country involved for correction which is an important loss of efficiency and may jeopardize the deadlines for dissemination. Therefore, it is very important that validation is run by MS at this stage. Note that the procedures of editing and imputations should be as uniform (identical) as possible among all data sources Country data The country data received by Eurostat should already be validated at the micro level by the national statistical organizations. Nevertheless, some errors or problems can only be detected when data from the different countries are combined, compared or analysed, such as bilateral flows in foreign trade. When these errors are detected and the problem is significant, the correction should be made by Eurostat, consulting the country involved whenever possible Error detection As for micro data, checking and editing depend on the specific data concerned and therefore we propose some general procedures for error detection by Eurostat that can be applied and adapted to any particular survey: 4 Detection of different definitions in national statistics common definitions and classifications in national statistics. The data sets supplied by Member States can only be compiled and published by Eurostat if they are based on the same (or able to map 1:1 or n:1) classification in order to ensure the comparability and consistency of national data. If divergences are found, they have to be corrected. 1. Checking the data format the data must have a predefined format. 2. Checking for incomplete data checking whether the data are complete or there are missing data. The more extreme situation is when a Member State does not send its data set at all. Other examples of partially missing data are when the country total is received, but not the regional breakdown or, in foreign trade, when the country total is received but not some or all the bilateral flows. 3 Checking the classification of variables this classification has to follow the appropriate nomenclatures. 5 Changes in the definitions and classifications used when the definitions and classifications adopted are changed (such as concepts, methodologies, surveyed population, data processing), the data will show the differences. 6 Detection of non-admissible values the value of some variables has to be within a given range. For example, age, salaries, foreign trade flows, output or price indices cannot be negative; indices with values that make no sense, such as decimal values or values in the order of tens of thousands (with base 100). 13

15 8 Detection of large differences between the country s current and past values (growth rates) the current value should be compared with the previous values and, for infraannual data, with the corresponding period of the previous year. The occurrence of such differences is usually caused by errors. 7 Detection of incoherencies among variables there are often relationships and restrictions among variables that have to be satisfied. When they are not, the incoherence found has to be corrected. A very simple example, among many others, is that the balance has to equal the difference between credits and debits. 9 Search for breaks in the series, i.e., large jumps or differences in the data from a period to the next these differences are probably caused by an error or by a change in the definitions and classifications adopted. 10 Large changes in the series length if the number of observations in a data series supplied by a Member State suffers an important change, the reason for this difference has to be checked because it may be caused by error, or by changes in data processing, or by retropolation of the series, etc. 11 Aggregated items correspond to the sum of sub-items when the country provides the breakdown of a given data, the total has to equal the sum of the parts. Similarly, when a country provides different breakdowns of the same data, such as turnover in companies by region and activity, the total of the two breakdowns has to be the same. 12 Cross checking with other sources the data from a given country should be checked for coherence with other data from the same country or with data from another country. If differences are found, they have to be investigated and corrected. For example, industrial and foreign trade statistics from the same country; in foreign trade statistics, the bilateral flows from a country should be checked with the corresponding bilateral flows from its partners (mirror statistics) Error correction When errors are detected in country data, they have to be corrected by Eurostat, possibly after discussion with the national statistical organization involved. Like error detection, correction procedures depend on the particular data and consequently we discuss the main procedures that can be generally adopted, taking into account that the corrections performed at Eurostat described above are appropriate. Harmonization of national data if significant discrepancies arise in the statistics of a given country because of relevant differences relatively to the definitions (concepts and classifications), Eurostat has to check whether the instructions to fill in the questionnaire have been followed by the reporting countries and ask the country to recompute the national statistics according to the common definition or classification. This involves the following steps: On the country and economic zone, to ensure that the contents of each country and economic zone have been filled in the same way. On the economic activity, to check if all the items (sub-items) have been aggregated in the same way by Member States. Moreover, the statistical agency of the country involved has to correct the problem in the future, i.e., it has to stop using its own definitions and classifications and start using those set up by Eurostat. Correction of the data format and variable classification this may require a considerable programming and computational effort for large data sets, thus being time consuming. If 14

16 classification of the data is wrong, they have to be regrouped based on the correspondence between the two classifications (nomenclatures) used. Changes in the definitions and classifications used a warning has to be issued about those changes and when they occurred. Retropolation of the series should be computed based on the new definitions, if possible. Imputation of incomplete data when part or the whole data set is missing, Eurostat should first try to make the Member State send the missing data. If this is not possible in a timely manner, it is equivalent to a non-response (total or partial) and Eurostat has to impute it (with the methods discussed in section 3). The solution of flagging it as Non-available is inadequate and should be avoided. Correction of non-admissible values when this type of errors occurs, it may be possible to determine the correct values by using other variables in the same or in other data sets. If this is not possible, the non-admissible values have to be imputed with the methods discussed in section 3. Correction of large differences relatively to the country s past values Eurostat should first try to make the Member State involved to correct or confirm the values leading to such differences in a timely manner. However, if it is not possible, the correction has to be made by Eurostat. This issue is related to outlier detection and correction discussed in section 4. Nevertheless, some errors can be corrected (by Eurostat) with methods like the following that are very straightforward and easy to apply. Corrections using data from the same Member State. Examples: correction of a variable using the difference of two others (such as net flows with the positive and negative flows), or the sum of others (such as flows of an aggregated item with the individual items); correction of a variable using the net amounts, i.e., by comparing the available net amounts with the result of computing those amounts from the difference of the variables involved; likewise for sums; correction of a variable using others (such as flows for an aggregated partner zone with flows of other partner zones). Corrections using data from other Member States. Examples: correction of intra-eu flows of a Member State using the available bilateral flows of its partners; correction of extra-eu flows of a Member State using published data from other sources (such as OECD, IMF or UN) with extra-eu bilateral flows to or from that Member State. Note also that these simple procedures can also be applied to the correction of the previous two items, namely the imputation of missing data and the correction of non-admissible values, which is very straightforward. Correction of incoherencies among variables when incoherencies are found, they have to be corrected. It is sometimes possible to correct them by using other variables from the same country, such as computing the balance from the difference between credits and debits, or the first set of examples in the previous item. In other situations, data from another country has to be used, such as the second set of examples in the previous item. When such corrections are not possible, Eurostat has to impute the values of the incoherent variable(s) by using the methods of section 3. Series breaks if they are caused by error, it has to be corrected. If they are caused by other factors, such as changes in the definitions or classifications used, these changes have to be flagged or the data have to be recomputed with the previous parameters. If this is not possible on time, it is preferable to impute the values after the break(s) with the methods of section 3 and correct them later. Series length if it changes because of error, Eurostat should return to the old series. Otherwise, the change should be flagged. 15

17 Correction of incoherencies in the aggregation of data if aggregated variables do not correspond to the sum of their parts in the breakdown, the former, i.e., the aggregate has to be corrected. If two different breakdowns of the same data do not have the same total, the parts of each breakdown have to be checked and corrected and it is possible that some of them have to be imputed (section 3). Correction of incoherencies with other sources if differences between alternative sources are found for the data of a given country, they have to be corrected by using the most reliable source. Sometimes, the highest value is chosen from the alternatives. For example, in foreign trade statistics, when the bilateral flows between two Member States do not agree, the highest value should be used and the appropriate corrections made to the total flows of the partner that had the smallest value (mirror statistics) Aggregate (Eurostat) data The data sets received from the Member States have to be scrutinized and error-free before their dissemination and the two previous editing and correction stages should be sufficient to this purpose. However, some problems or inconsistencies may become apparent only when aggregate (Eurostat) data are computed, such as growth rates, European aggregates, or bilateral flows with other geographical zones or economic entities. Moreover, the aggregate values computed for different geographical zones have to correspond to the aggregation (sum) of the countries involved. Another issue, particularly important for dissemination purposes, is that the figures published by Eurostat have to compatible with national statistics. When such problems are detected, their cause has to be identified and corrected at the country level since simply discarding the data received from a country (or several countries) is not an adequate solution because it provides no information on that (those) country(ies) and prevents the computation of Eurostat aggregates. Consequently, that solution should not be considered as an option and we are back to the previous stage of editing and correction which means that the same methods described above for country data apply here. This is the final stage where these methods can be applied and, if no correction is possible on time for dissemination, imputation should be performed (section 3). It is preferable to use imputed data (assuming that the imputation method used is appropriate) than a wrong value or no value at all. After this final stage of corrections is complete, the data are ready for dissemination Concluding remarks Error detection and correction in Eurostat statistical data may be performed at each of the three stages of the production process: at the micro (collection) level, at the country level and at aggregate (Eurostat) level. Moreover, it should be performed at the earliest stage possible. In the ideal situation, each stage should seek the complete detection and correction of any errors, leaving as few problems as possible to be solved later, because the earlier the detection, the more accurate the correction can be. This will simplify the task of the following stages achieving a higher quality and speeding up the process of data production and dissemination. State Members are responsible for the first stage and Eurostat for the other two. Nevertheless, the latter should play an important role in the coordination and harmonization of editing procedures by the former. The checks and corrections applied depend on the stage and on the data set under scrutiny. The more efficient the detection and correction procedures are, the higher the quality of the data and the better those inferences will be. Quality assessment can be made by comparing the corrected values with the corresponding revised data that will be obtained later. To this end, accuracy measures such as the mean squared error or the QI statistic in the employment 16

18 survey may be calculated. It is also important to keep a record of the errors detected, their sources and the corrections required in order to avoid the former and to improve and speed up the latter in the future. 17

19 3. MISSING DATA AND IMPUTATION Missing data caused by non-response is a source of error in any data set requiring correction. To this end, imputation methods can be used in order to fill those gaps and provide a complete data set. Non-response errors are often the major sources of error in surveys and they can lead to serious problems in statistical analysis. It is usual to distinguish missing data caused by unit non-response (total non-response) and missing data caused by item non-response (partial nonresponse). The former is usually corrected by imputation whereas the latter is usually dealt with by reweighting. 3.1 Literature review The literature on Imputation of missing data is vast and covering it thoroughly is far beyond the scope of this document. Nevertheless, we briefly discuss the main and most commonly used methods, including in Eurostat and in Member States. The main references in this field are Lehtonen and Pahkinen (2004), Little and Rubin (2002), which we will follow closely, and Rubin (2004). Moreover, time series models can also be used and it is in fact a valid, useful and easy to implement approach to this problem. However, since they are another class of methods and a different perspective, totally based on historical data, we will not consider them here. There are two main classes of methods: single imputation methods, where one value is imputed to each missing item, and multiple imputation methods, where more than one value is imputed to allow the assessment of imputation uncertainty. Each method has advantages and disadvantages, but discussing them is beyond the scope of this document. Such a discussion is in the references mentioned above or in Eurostat Internal Document (2000). We start by describing the former methods Single imputation methods There are two generic approaches to single imputation of missing data based on the observed values, Explicit and Implicit modeling and they will be briefly described next Explicit modeling Imputation is based on a formal statistical model and hence the assumptions are explicit. The methods included here are discussed next Mean imputation Unconditional mean imputation The missing values are replaced (estimated) by the mean of the observed (i.e., respondent) values. Conditional mean imputation Respondents and non-respondents are previously classified in classes (strata) based on the observed variables and the missing values are replaced by the mean of the respondents of the same class. In order to avoid the effect of outliers, the median may be used instead of the mean. For categorical data, the mode is used for the imputation. 18

20 Regression imputation Deterministic regression imputation This method replaces the missing values by predicted values from a regression of the missing item on items observed for the unit. Consider X1, K, X k 1 fully observed and X k observed for the first r observations and missing for the last n r observations. Regression imputation computes the regression of X k on X1, K, X k 1 based on the r complete cases and then fills in the missing values as predictions from the regression, Suppose case i has X ik missing and Xi 1, K, Xi,k 1 observed. The missing value is imputed using the fitted regression equation Xˆ = βˆ + βˆ X + K βˆ X (3.1) ik 0 1 where β 0 is the intercept (which may be zero, leading to a regression through the origin) and β1, K, βk 1 are respectively the regression coefficients of X1, K, X k 1 in the regression of X k on X1, K, X k 1 based on the r complete cases (estimated parameters or predicted values of a variable are denoted by a ^). Note that if the observed variables are dummies for a categorical variable, the predictions from regression (3.1) are respondent means within classes defined by that variable and this method reduces to conditional mean imputation. The above regression equation has no residual (stochastic) variable and therefore this method is called deterministic regression imputation. Stochastic regression imputation It is a similar approach to the previous one, but a residual random variable is added to the right-hand side of the regression equation. Consequently, instead of imputing the mean (3.1), we impute a draw: Xˆ βˆ + βˆ X + K βˆ X + U (3.2) i1 k 1 ik = 0 1 i1 k 1 i,k 1 2 where U ik is a random normal residual variable with mean zero and variance ˆσ which is the residual variance from the regression of X k on X1, K, X k 1 based on the r complete cases. The addition of the random normal variable makes the imputation a draw from the prediction distribution of the missing values, rather than the mean. If the observed variables are dummies for a categorical variable, the predictions from regression (3.2) are conditional draws (instead of conditional means as in regression 3.1) Implicit modelling The focus is on an algorithm, which implies an underlying model. Assumptions are implicit, but it is necessary to check if they are reasonable Hot deck imputation This is a common method in survey practice. Missing data are replaced by values drawn from similar respondents called donors and there are several donor sampling schemes. Suppose that a sample of out of N units is selected and r out of the n sampled values of a variable X are recorded. The mean of X may then be estimated as the mean of the responding and the imputed units: rx R + ( n r) X NR X HD = (3.3) n where X R is the mean of the respondent units and r HiXi X NR = (3.4) i= 1 n r i,k 1 ik 19

How To Understand The Data Collection Of An Electricity Supplier Survey In Ireland

How To Understand The Data Collection Of An Electricity Supplier Survey In Ireland COUNTRY PRACTICE IN ENERGY STATISTICS Topic/Statistics: Electricity Consumption Institution/Organization: Sustainable Energy Authority of Ireland (SEAI) Country: Ireland Date: October 2012 CONTENTS Abstract...

More information

Corporations: T2 tax data

Corporations: T2 tax data WP.7 ENGLISH ONLY UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing (Ljubljana, Slovenia, 9-11 May 2011) Topic (i): Editing of

More information

Data quality and metadata

Data quality and metadata Chapter IX. Data quality and metadata This draft is based on the text adopted by the UN Statistical Commission for purposes of international recommendations for industrial and distributive trade statistics.

More information

Implementing wage tax data in structural business statistics

Implementing wage tax data in structural business statistics Using Admin Data Estimation approaches Vilnius 26 and 27 May, 211 Implementing wage tax in structural business statistics Gerlinde Dinges, Elisabeth Gruber, Sabine Zach Statistics Austria, e-mail: Gerlinde.Dinges@statistk.gv.at

More information

Quality and Methodology Information

Quality and Methodology Information 29 July 2015 Issued by Office for National Statistics, Government Buildings, Cardiff Road, Newport, NP10 8XG Media Office 08456 041858 Business Area 01633 455923 Lead Statistician: Michael Hardie Quality

More information

Using Administrative Data in the Production of Business Statistics - Member States experiences

Using Administrative Data in the Production of Business Statistics - Member States experiences Using Administrative Data in the Production of Business Statistics - Member States experiences Rome, 18 th and 19 th March 2010 The use of national and international administrative data for the compilation

More information

Problem of Missing Data

Problem of Missing Data VASA Mission of VA Statisticians Association (VASA) Promote & disseminate statistical methodological research relevant to VA studies; Facilitate communication & collaboration among VA-affiliated statisticians;

More information

REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION

REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION Pilar Rey del Castillo May 2013 Introduction The exploitation of the vast amount of data originated from ICT tools and referring to a big variety

More information

Dimensions of Statistical Quality

Dimensions of Statistical Quality Inter-agency Meeting on Coordination of Statistical Activities SA/2002/6/Add.1 New York, 17-19 September 2002 22 August 2002 Item 7 of the provisional agenda Dimensions of Statistical Quality A discussion

More information

Handling attrition and non-response in longitudinal data

Handling attrition and non-response in longitudinal data Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein

More information

Topic (i): Automated editing and imputation and software applications

Topic (i): Automated editing and imputation and software applications WP. 5 ENGLISH ONLY UNITED NATIONS STATISTICAL COMMISSION and ECONOMIC COMMISSION FOR EUROPE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing (Neuchâtel, Switzerland, 5 7 October

More information

Methodological summary: VALUE ADDED TAX (VAT)

Methodological summary: VALUE ADDED TAX (VAT) Methodological summary: VALUE ADDED TAX (VAT) Objectives and contents of the statistics INTRODUCTION Since the year 1999, each year, the State Tax Administration Agency (AEAT), belonging to the Ministry

More information

THE JOINT HARMONISED EU PROGRAMME OF BUSINESS AND CONSUMER SURVEYS

THE JOINT HARMONISED EU PROGRAMME OF BUSINESS AND CONSUMER SURVEYS THE JOINT HARMONISED EU PROGRAMME OF BUSINESS AND CONSUMER SURVEYS List of best practice for the conduct of business and consumer surveys 21 March 2014 Economic and Financial Affairs This document is written

More information

A Basic Introduction to Missing Data

A Basic Introduction to Missing Data John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item

More information

Intro to Data Analysis, Economic Statistics and Econometrics

Intro to Data Analysis, Economic Statistics and Econometrics Intro to Data Analysis, Economic Statistics and Econometrics Statistics deals with the techniques for collecting and analyzing data that arise in many different contexts. Econometrics involves the development

More information

Attempt of reconciliation between ESSPROS social protection statistics and EU-SILC

Attempt of reconciliation between ESSPROS social protection statistics and EU-SILC Attempt of reconciliation between ESSPROS social protection statistics and EU-SILC Gérard Abramovici* * Eurostat, Unit F3 (gerard.abramovici@ec.europa.eu) Abstract: Two Eurostat data collection, ESSPROS

More information

Income Distribution Database (http://oe.cd/idd)

Income Distribution Database (http://oe.cd/idd) Income Distribution Database (http://oe.cd/idd) TERMS OF REFERENCE OECD PROJECT ON THE DISTRIBUTION OF HOUSEHOLD INCOMES 2014/15 COLLECTION October 2014 The OECD income distribution questionnaire aims

More information

Farm Business Survey - Statistical information

Farm Business Survey - Statistical information Farm Business Survey - Statistical information Sample representation and design The sample structure of the FBS was re-designed starting from the 2010/11 accounting year. The coverage of the survey is

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

Index of import prices

Index of import prices Statistisches Bundesamt Quality Report Index of import prices Periodicity: irregular Published in 17.08.2012 For further information about this puplication please contact: Phone: +49 (0) 611 / 75 24 44;

More information

TURKEY Questions to be raised during the bilateral screening meeting on 17-18 July 2006

TURKEY Questions to be raised during the bilateral screening meeting on 17-18 July 2006 TURKEY Questions to be raised during the bilateral screening meeting on 17-18 July 2006 At the beginning of the meeting you might wish to briefly explain the statistical system of your country, the legal

More information

CROATIAN BUREAU OF STATISTICS REPUBLIC OF CROATIA MAIN (STATISTICAL) BUSINESS PROCESSES INSTRUCTIONS FOR FILLING OUT THE TEMPLATE

CROATIAN BUREAU OF STATISTICS REPUBLIC OF CROATIA MAIN (STATISTICAL) BUSINESS PROCESSES INSTRUCTIONS FOR FILLING OUT THE TEMPLATE CROATIAN BUREAU OF STATISTICS REPUBLIC OF CROATIA MAIN (STATISTICAL) BUSINESS PROCESSES INSTRUCTIONS FOR FILLING OUT THE TEMPLATE CONTENTS INTRODUCTION... 3 1. SPECIFY NEEDS... 4 1.1 Determine needs for

More information

Measurement Information Model

Measurement Information Model mcgarry02.qxd 9/7/01 1:27 PM Page 13 2 Information Model This chapter describes one of the fundamental measurement concepts of Practical Software, the Information Model. The Information Model provides

More information

Documentation of statistics for International Trade in Service 2016 Quarter 1

Documentation of statistics for International Trade in Service 2016 Quarter 1 Documentation of statistics for International Trade in Service 2016 Quarter 1 1 / 11 1 Introduction Foreign trade in services describes the trade in services (imports and exports) with other countries.

More information

Census of International Trade in Services and Royalties: Year ended June 2005

Census of International Trade in Services and Royalties: Year ended June 2005 Embargoed until 10:45 AM - Wednesday, October 26, 2005 Census of International Trade in Services and Royalties: Year ended June 2005 Highlights Major exports of commercial services were: communication,

More information

Underwriting put to the test: Process risks for life insurers in the context of qualitative Solvency II requirements

Underwriting put to the test: Process risks for life insurers in the context of qualitative Solvency II requirements Underwriting put to the test: Process risks for life insurers in the context of qualitative Solvency II requirements Authors Lars Moormann Dr. Thomas Schaffrath-Chanson Contact solvency-solutions@munichre.com

More information

Chapter 7. Maintenance of SBR

Chapter 7. Maintenance of SBR Chapter 7. Maintenance of SBR INTRODUCTION This chapter discusses the maintenance of the SBR. Key elements for the maintenance of a SBR are the sources of information about statistical units, the main

More information

Introduction to time series analysis

Introduction to time series analysis Introduction to time series analysis Margherita Gerolimetto November 3, 2010 1 What is a time series? A time series is a collection of observations ordered following a parameter that for us is time. Examples

More information

Statistical data editing near the source using cloud computing concepts

Statistical data editing near the source using cloud computing concepts Distr. GENERAL WP.20 6 May 2011 ENGLISH ONLY UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE (UNECE) CONFERENCE OF EUROPEAN STATISTICIANS EUROPEAN COMMISSION STATISTICAL OFFICE OF THE EUROPEAN UNION (EUROSTAT)

More information

Chained unit value indexes of Galicia, the Basque Country and the Murcia Region. Methodological document.

Chained unit value indexes of Galicia, the Basque Country and the Murcia Region. Methodological document. Chained unit value indexes of Galicia, the Basque Country and the Murcia Region. Methodological document. 1. Introduction Commercial foreign trade is an indicator of the increasingly greater openness of

More information

METHODOLOGICAL NOTE GENERAL GOVERNMENT DEBT COMPILED ACCORDING TO THE METHODOLOGY OF THE EXCESSIVE DEFICIT PROCEDURE (EDP)

METHODOLOGICAL NOTE GENERAL GOVERNMENT DEBT COMPILED ACCORDING TO THE METHODOLOGY OF THE EXCESSIVE DEFICIT PROCEDURE (EDP) METHODOLOGICAL NOTE GENERAL GOVERNMENT DEBT COMPILED ACCORDING TO THE METHODOLOGY OF THE EXCESSIVE DEFICIT PROCEDURE (EDP) Introduction The functions attributed to the Banco de España by Spanish legislation

More information

Export Pricing and Credit Constraints: Theory and Evidence from Greek Firms. Online Data Appendix (not intended for publication) Elias Dinopoulos

Export Pricing and Credit Constraints: Theory and Evidence from Greek Firms. Online Data Appendix (not intended for publication) Elias Dinopoulos Export Pricing and Credit Constraints: Theory and Evidence from Greek Firms Online Data Appendix (not intended for publication) Elias Dinopoulos University of Florida Sarantis Kalyvitis Athens University

More information

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and

More information

Revenue Administration: Performance Measurement in Tax Administration

Revenue Administration: Performance Measurement in Tax Administration T e c h n i c a l N o t e s a n d M a n u a l s Revenue Administration: Performance Measurement in Tax Administration William Crandall Fiscal Affairs Department I n t e r n a t i o n a l M o n e t a r

More information

Multiple Imputation for Missing Data: A Cautionary Tale

Multiple Imputation for Missing Data: A Cautionary Tale Multiple Imputation for Missing Data: A Cautionary Tale Paul D. Allison University of Pennsylvania Address correspondence to Paul D. Allison, Sociology Department, University of Pennsylvania, 3718 Locust

More information

Northumberland Knowledge

Northumberland Knowledge Northumberland Knowledge Know Guide How to Analyse Data - November 2012 - This page has been left blank 2 About this guide The Know Guides are a suite of documents that provide useful information about

More information

Integration of Registers and Survey-based Data in the Production of Agricultural and Forestry Economics Statistics

Integration of Registers and Survey-based Data in the Production of Agricultural and Forestry Economics Statistics Integration of Registers and Survey-based Data in the Production of Agricultural and Forestry Economics Statistics Paavo Väisänen, Statistics Finland, e-mail: Paavo.Vaisanen@stat.fi Abstract The agricultural

More information

Description of the Sample and Limitations of the Data

Description of the Sample and Limitations of the Data Section 3 Description of the Sample and Limitations of the Data T his section describes the 2007 Corporate sample design, sample selection, data capture, data cleaning, and data completion. The techniques

More information

Changes to UK NEQAS Leucocyte Immunophenotyping Chimerism Performance Monitoring Systems From April 2014. Uncontrolled Copy

Changes to UK NEQAS Leucocyte Immunophenotyping Chimerism Performance Monitoring Systems From April 2014. Uncontrolled Copy Changes to UK NEQAS Leucocyte Immunophenotyping Chimerism Performance Monitoring Systems From April 2014 Contents 1. The need for change 2. Current systems 3. Proposed z-score system 4. Comparison of z-score

More information

H. The Study Design. William S. Cash and Abigail J. Moss National Center for Health Statistics

H. The Study Design. William S. Cash and Abigail J. Moss National Center for Health Statistics METHODOLOGY STUDY FOR DETERMINING THE OPTIMUM RECALL PERIOD FOR THE REPORTING OF MOTOR VEHICLE ACCIDENTAL INJURIES William S. Cash and Abigail J. Moss National Center for Health Statistics I. Introduction

More information

Credit Card Market Study Interim Report: Annex 4 Switching Analysis

Credit Card Market Study Interim Report: Annex 4 Switching Analysis MS14/6.2: Annex 4 Market Study Interim Report: Annex 4 November 2015 This annex describes data analysis we carried out to improve our understanding of switching and shopping around behaviour in the UK

More information

Mario Guarracino. Data warehousing

Mario Guarracino. Data warehousing Data warehousing Introduction Since the mid-nineties, it became clear that the databases for analysis and business intelligence need to be separate from operational. In this lecture we will review the

More information

Inequality, Mobility and Income Distribution Comparisons

Inequality, Mobility and Income Distribution Comparisons Fiscal Studies (1997) vol. 18, no. 3, pp. 93 30 Inequality, Mobility and Income Distribution Comparisons JOHN CREEDY * Abstract his paper examines the relationship between the cross-sectional and lifetime

More information

Programming Period 2014-2020 Monitoring and Evaluation of European Cohesion Policy European Social Fund

Programming Period 2014-2020 Monitoring and Evaluation of European Cohesion Policy European Social Fund Programming Period 2014-2020 Monitoring and Evaluation of European Cohesion Policy European Social Fund Guidance document Annex D - Practical guidance on data collection and validation May 2016 (Based

More information

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, fabian.gruening@informatik.uni-oldenburg.de Abstract: Independent

More information

Seminar on Registers in Statistics - methodology and quality 21-23 May, 2007 Helsinki

Seminar on Registers in Statistics - methodology and quality 21-23 May, 2007 Helsinki Seminar on Registers in Statistics - methodology and quality 21-23 May, 2007 Helsinki Administrative Data in Statistics Canada s Business Surveys: The Present and the Future Wesley Yung, Eric Rancourt

More information

Data Preprocessing. Week 2

Data Preprocessing. Week 2 Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.

More information

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance

More information

Generic Statistical Business Process Model

Generic Statistical Business Process Model Joint UNECE/Eurostat/OECD Work Session on Statistical Metadata (METIS) Generic Statistical Business Process Model Version 4.0 April 2009 Prepared by the UNECE Secretariat 1 I. Background 1. The Joint UNECE

More information

COUNTRY PRACTICE IN ENERGY STATISTICS

COUNTRY PRACTICE IN ENERGY STATISTICS COUNTRY PRACTICE IN ENERGY STATISTICS Topic/Statistics: EP 8-01 Institution/Organization: Czech Statistical Office (CzSO) Country: Czech Republic Date: March 2012 CONTENTS Abstract... 3 1. General information...

More information

Integrated Data Collection System on business surveys in Statistics Portugal

Integrated Data Collection System on business surveys in Statistics Portugal Integrated Data Collection System on business surveys in Statistics Portugal Paulo SARAIVA DOS SANTOS Director, Data Collection Department, Statistics Portugal Carlos VALENTE IS advisor, Data Collection

More information

STATISTICAL DATA EDITING Quality Measures. Section 2.2

STATISTICAL DATA EDITING Quality Measures. Section 2.2 STATISTICAL DATA EDITING Quality Measures 95 Section. IMPACT OF THE EDIT AND IMPUTATION PROCESSES ON DATA QUALITY AND EXAMPLES OF EVALUATION STUDIES Foreword Natalie Shlomo, Central Bureau of Statistics,

More information

Labour Market Flows: February 2016 (Experimental Statistics)

Labour Market Flows: February 2016 (Experimental Statistics) Article Labour Market Flows: February 2016 (Experimental Statistics) These estimates of labour market flows are experimental statistics which have been produced as an aid to understanding the movements

More information

Sampling solutions to the problem of undercoverage in CATI household surveys due to the use of fixed telephone list

Sampling solutions to the problem of undercoverage in CATI household surveys due to the use of fixed telephone list Sampling solutions to the problem of undercoverage in CATI household surveys due to the use of fixed telephone list Claudia De Vitiis, Paolo Righi 1 Abstract: The undercoverage of the fixed line telephone

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

GSBPM. Generic Statistical Business Process Model. (Version 5.0, December 2013)

GSBPM. Generic Statistical Business Process Model. (Version 5.0, December 2013) Generic Statistical Business Process Model GSBPM (Version 5.0, December 2013) About this document This document provides a description of the GSBPM and how it relates to other key standards for statistical

More information

4. Continuous Random Variables, the Pareto and Normal Distributions

4. Continuous Random Variables, the Pareto and Normal Distributions 4. Continuous Random Variables, the Pareto and Normal Distributions A continuous random variable X can take any value in a given range (e.g. height, weight, age). The distribution of a continuous random

More information

This has been categorized into two. A. Master data control and set up B. Utilizing master data the correct way C. Master data Reports

This has been categorized into two. A. Master data control and set up B. Utilizing master data the correct way C. Master data Reports Master Data Management (MDM) is the technology and tool, the processes and personnel required to create and maintain consistent and accurate lists and records predefined as master data. Master data is

More information

Data Analysis Tools. Tools for Summarizing Data

Data Analysis Tools. Tools for Summarizing Data Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool

More information

E-Commerce and ICT Activity, 2012. 95% of businesses had broadband Internet and 82% had a website.

E-Commerce and ICT Activity, 2012. 95% of businesses had broadband Internet and 82% had a website. Statistical Bulletin E-Commerce and ICT Activity, 2012 Coverage: UK Date: 04 December 2013 Geographical Area: UK Theme: Business and Energy Key points E-commerce sales represented 18% of business turnover

More information

Quality Assurance and Quality Control in Surveys

Quality Assurance and Quality Control in Surveys Quality Assurance and Quality Control in Surveys Lars Lyberg Statistics Sweden and Stockholm University PSR Conference on Survey Quality April 17, 2009 Email: lars.lyberg@scb.se The survey process revise

More information

Discussion. Seppo Laaksonen 1. 1. Introduction

Discussion. Seppo Laaksonen 1. 1. Introduction Journal of Official Statistics, Vol. 23, No. 4, 2007, pp. 467 475 Discussion Seppo Laaksonen 1 1. Introduction Bjørnstad s article is a welcome contribution to the discussion on multiple imputation (MI)

More information

Technical Advice Note: Retail Impact Assessments

Technical Advice Note: Retail Impact Assessments Technical Advice Note: Retail Impact Assessments 1 A GUIDE FOR RETAIL IMPACT ASSESSMENTS INTRODUCTION This Technical Advice Note (TAN) has been prepared to assist applicants seeking planning permission

More information

Quality Control of Web-Scraped and Transaction Data (Scanner Data)

Quality Control of Web-Scraped and Transaction Data (Scanner Data) Quality Control of Web-Scraped and Transaction Data (Scanner Data) Ingolf Boettcher 1 1 Statistics Austria, Vienna, Austria; ingolf.boettcher@statistik.gv.at Abstract New data sources such as web-scraped

More information

Approaches for Analyzing Survey Data: a Discussion

Approaches for Analyzing Survey Data: a Discussion Approaches for Analyzing Survey Data: a Discussion David Binder 1, Georgia Roberts 1 Statistics Canada 1 Abstract In recent years, an increasing number of researchers have been able to access survey microdata

More information

Fixed-Effect Versus Random-Effects Models

Fixed-Effect Versus Random-Effects Models CHAPTER 13 Fixed-Effect Versus Random-Effects Models Introduction Definition of a summary effect Estimating the summary effect Extreme effect size in a large study or a small study Confidence interval

More information

ANNUAL QUALITY REPORT

ANNUAL QUALITY REPORT ANNUAL QUALITY REPORT FOR THE SURVEY ANNUAL STATISTICAL SURVEY ON THE QUANTITY OF WASTE AT WASTE LANDFILL SITES (KO-U) FOR 2013 Prepared by: Mojca Žitnik, Marko Polh, Department for Environment and Energy

More information

Glossary of Terms Ability Accommodation Adjusted validity/reliability coefficient Alternate forms Analysis of work Assessment Battery Bias

Glossary of Terms Ability Accommodation Adjusted validity/reliability coefficient Alternate forms Analysis of work Assessment Battery Bias Glossary of Terms Ability A defined domain of cognitive, perceptual, psychomotor, or physical functioning. Accommodation A change in the content, format, and/or administration of a selection procedure

More information

MULTIPLE REGRESSION WITH CATEGORICAL DATA

MULTIPLE REGRESSION WITH CATEGORICAL DATA DEPARTMENT OF POLITICAL SCIENCE AND INTERNATIONAL RELATIONS Posc/Uapp 86 MULTIPLE REGRESSION WITH CATEGORICAL DATA I. AGENDA: A. Multiple regression with categorical variables. Coding schemes. Interpreting

More information

ANOMALIES IN FORM 5500 FILINGS: LESSONS FROM SUPPLEMENTAL DATA FOR GROUP HEALTH PLAN FUNDING

ANOMALIES IN FORM 5500 FILINGS: LESSONS FROM SUPPLEMENTAL DATA FOR GROUP HEALTH PLAN FUNDING ANOMALIES IN FORM 5500 FILINGS: LESSONS FROM SUPPLEMENTAL DATA FOR GROUP HEALTH PLAN FUNDING Final Report December 14, 2012 Michael J. Brien, PhD Deloitte Financial Advisory Services LLP 202-378-5096 michaelbrien@deloitte.com

More information

Training Course on the Production of ICT Statistics on Households and Businesses. PART A: Statistics on Businesses and on the ICT sector

Training Course on the Production of ICT Statistics on Households and Businesses. PART A: Statistics on Businesses and on the ICT sector Training Course on the Production of ICT Statistics on Households and Businesses PART A: Statistics on Businesses and on the ICT sector MODULE B3: Designing an ICT business survey After completing this

More information

Market Analysis The Nature and Scale of OTC Equity Trading in Europe April 2011

Market Analysis The Nature and Scale of OTC Equity Trading in Europe April 2011 Association for Financial Markets in Europe The Nature and Scale of Equity Trading in Europe April 2011 Executive Summary It is often reported that the proportion of European equities trading that is over-the-counter

More information

Big Data uses cases and implementation pilots at the OECD

Big Data uses cases and implementation pilots at the OECD Distr. GENERAL Working Paper 28 February 2014 ENGLISH ONLY UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE (ECE) CONFERENCE OF EUROPEAN STATISTICIANS ORGANISATION FOR ECONOMIC COOPERATION AND DEVELOPMENT

More information

Schools Value-added Information System Technical Manual

Schools Value-added Information System Technical Manual Schools Value-added Information System Technical Manual Quality Assurance & School-based Support Division Education Bureau 2015 Contents Unit 1 Overview... 1 Unit 2 The Concept of VA... 2 Unit 3 Control

More information

The Elasticity of Taxable Income: A Non-Technical Summary

The Elasticity of Taxable Income: A Non-Technical Summary The Elasticity of Taxable Income: A Non-Technical Summary John Creedy The University of Melbourne Abstract This paper provides a non-technical summary of the concept of the elasticity of taxable income,

More information

Lessons Learned International Evaluation

Lessons Learned International Evaluation 2012 Reusing lessons i-eval THINK Piece, No. 1 i-eval THINK Piece, No. 1 i-eval THINK Piece, No. 3 Lessons Learned Rating Lessons Systems Learned in International Evaluation Utilizing lessons learned from

More information

Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random

Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random [Leeuw, Edith D. de, and Joop Hox. (2008). Missing Data. Encyclopedia of Survey Research Methods. Retrieved from http://sage-ereference.com/survey/article_n298.html] Missing Data An important indicator

More information

Statistics Canada s National Household Survey: State of knowledge for Quebec users

Statistics Canada s National Household Survey: State of knowledge for Quebec users Statistics Canada s National Household Survey: State of knowledge for Quebec users Information note December 2, 2013 INSTITUT DE LA STATISTIQUE DU QUÉBEC Statistics Canada s National Household Survey:

More information

Improvements of the Census Operation of Japan by Using Information Technology

Improvements of the Census Operation of Japan by Using Information Technology Paper to be presented at the 22nd Population Census Conference March 7-9, 2005, Seattle, Washington, USA Improvements of the Census Operation of Japan by Using Information Technology Statistics Bureau

More information

TIPS DATA QUALITY STANDARDS ABOUT TIPS

TIPS DATA QUALITY STANDARDS ABOUT TIPS 2009, NUMBER 12 2 ND EDITION PERFORMANCE MONITORING & EVALUATION TIPS DATA QUALITY STANDARDS ABOUT TIPS These TIPS provide practical advice and suggestions to USAID managers on issues related to performance

More information

APPENDIX N. Data Validation Using Data Descriptors

APPENDIX N. Data Validation Using Data Descriptors APPENDIX N Data Validation Using Data Descriptors Data validation is often defined by six data descriptors: 1) reports to decision maker 2) documentation 3) data sources 4) analytical method and detection

More information

Statistical Bulletin. Annual Survey of Hours and Earnings, 2014 Provisional Results. Key points

Statistical Bulletin. Annual Survey of Hours and Earnings, 2014 Provisional Results. Key points Statistical Bulletin Annual Survey of Hours and Earnings, 2014 Provisional Results Coverage: UK Date: 19 November 2014 Geographical Areas: Country, European (NUTS), Local Authority and County, Parliamentary

More information

HANDBOOK ON PRICE AND VOLUME MEASURES IN NATIONAL ACCOUNTS

HANDBOOK ON PRICE AND VOLUME MEASURES IN NATIONAL ACCOUNTS HANDBOOK ON PRICE AND VOLUME MEASURES IN NATIONAL ACCOUNTS Version prepared for the seminar on Price and Volume Measures, 14-16 March 2001, Statistics Netherlands, Voorburg Handbook on Price and Volume

More information

Retirement routes and economic incentives to retire: a cross-country estimation approach Martin Rasmussen

Retirement routes and economic incentives to retire: a cross-country estimation approach Martin Rasmussen Retirement routes and economic incentives to retire: a cross-country estimation approach Martin Rasmussen Welfare systems and policies Working Paper 1:2005 Working Paper Socialforskningsinstituttet The

More information

Distribution Training Guide. D110 Sales Order Management: Basic

Distribution Training Guide. D110 Sales Order Management: Basic Distribution Training Guide D110 Sales Order Management: Basic Certification Course Prerequisites The combined D110 Sales Order Management certification course consists of a hands- on guide that will walk

More information

Government of Ireland 2013. Material compiled and presented by the Central Statistics Office.

Government of Ireland 2013. Material compiled and presented by the Central Statistics Office. Government of Ireland 2013 Material compiled and presented by the Central Statistics Office. Reproduction is authorised, except for commercial purposes, provided the source is acknowledged. Print ISSN

More information

GUIDELINES FOR CLEANING AND HARMONIZATION OF GENERATIONS AND GENDER SURVEY DATA. Andrej Kveder Alexandra Galico

GUIDELINES FOR CLEANING AND HARMONIZATION OF GENERATIONS AND GENDER SURVEY DATA. Andrej Kveder Alexandra Galico GUIDELINES FOR CLEANING AND HARMONIZATION OF GENERATIONS AND GENDER SURVEY DATA Andrej Kveder Alexandra Galico Table of contents TABLE OF CONTENTS...2 1 INTRODUCTION...3 2 DATA PROCESSING...3 2.1 PRE-EDITING...4

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

EXTERNAL DEBT AND LIABILITIES OF INDUSTRIAL COUNTRIES. Mark Rider. Research Discussion Paper 9405. November 1994. Economic Research Department

EXTERNAL DEBT AND LIABILITIES OF INDUSTRIAL COUNTRIES. Mark Rider. Research Discussion Paper 9405. November 1994. Economic Research Department EXTERNAL DEBT AND LIABILITIES OF INDUSTRIAL COUNTRIES Mark Rider Research Discussion Paper 9405 November 1994 Economic Research Department Reserve Bank of Australia I would like to thank Sally Banguis

More information

Elements of statistics (MATH0487-1)

Elements of statistics (MATH0487-1) Elements of statistics (MATH0487-1) Prof. Dr. Dr. K. Van Steen University of Liège, Belgium December 10, 2012 Introduction to Statistics Basic Probability Revisited Sampling Exploratory Data Analysis -

More information

Short-Term Forecasting in Retail Energy Markets

Short-Term Forecasting in Retail Energy Markets Itron White Paper Energy Forecasting Short-Term Forecasting in Retail Energy Markets Frank A. Monforte, Ph.D Director, Itron Forecasting 2006, Itron Inc. All rights reserved. 1 Introduction 4 Forecasting

More information

Data Management Procedures

Data Management Procedures Data Management Procedures Introduction... 166 Data management at the National Centre... 167 Data cleaning by the international contractor... 170 Final review of the data... 172 Next steps in preparing

More information

Meeting of the Group of Experts on Business Registers. Luxemburg, 21 23 September 2015

Meeting of the Group of Experts on Business Registers. Luxemburg, 21 23 September 2015 Meeting of the Group of Experts on Business Registers Luxemburg, 21 23 September 2015 Rico Konen Statistics Netherlands Session No.1 The Dutch Satellite of Self-Employment (SOS) Producing entrepreneurship

More information

IAB Evaluation Study of Methods Used to Assess the Effectiveness of Advertising on the Internet

IAB Evaluation Study of Methods Used to Assess the Effectiveness of Advertising on the Internet IAB Evaluation Study of Methods Used to Assess the Effectiveness of Advertising on the Internet ARF Research Quality Council Paul J. Lavrakas, Ph.D. November 15, 2010 IAB Study of IAE The effectiveness

More information

Standard Quality Profiles for Longitudinal Studies

Standard Quality Profiles for Longitudinal Studies Standard Quality Profiles for Longitudinal Studies Peter Lynn, ULSC DRAFT Last update 28-9-2001 Standard Quality Profiles for Longitudinal Studies Contents 1. Introduction...1 2. Standard Contents of a

More information

AUDIT PROCEDURES RECEIVABLE AND SALES

AUDIT PROCEDURES RECEIVABLE AND SALES 184 AUDIT PROCEDURES RECEIVABLE AND SALES Ștefan Zuca Abstract The overall objective of the audit of accounts receivable and sales is to determine if they are fairly presented in the context of the financial

More information

Insurance and Pension Funding Industry, Except Compulsory Social Services Review

Insurance and Pension Funding Industry, Except Compulsory Social Services Review Methodology of the Monthly Index of Services Insurance and Pension Funding Industry, Except Compulsory Social Services Review Introduction At the launch of the experimental Index of Services (IoS) in December

More information

Producing official statistics via voluntary surveys the National Household Survey in Canada. Marc. Hamel*

Producing official statistics via voluntary surveys the National Household Survey in Canada. Marc. Hamel* Proceedings 59th ISI World Statistics Congress, 25-30 August 2013, Hong Kong (Session STS034) p.1762 Producing official statistics via voluntary surveys the National Household Survey in Canada Marc. Hamel*

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

More information

The Life-Cycle Motive and Money Demand: Further Evidence. Abstract

The Life-Cycle Motive and Money Demand: Further Evidence. Abstract The Life-Cycle Motive and Money Demand: Further Evidence Jan Tin Commerce Department Abstract This study takes a closer look at the relationship between money demand and the life-cycle motive using panel

More information