A system to monitor the quality of automated coding of textual answers to open questions



Similar documents
SAMPLE DESIGN FOR THE TERRORISM RISK INSURANCE PROGRAM SURVEY

Geometric Stratification of Accounting Data

Verifying Numerical Convergence Rates

Optimized Data Indexing Algorithms for OLAP Systems

2.23 Gambling Rehabilitation Services. Introduction

Catalogue no XIE. Survey Methodology. December 2004

How To Ensure That An Eac Edge Program Is Successful

Comparison between two approaches to overload control in a Real Server: local or hybrid solutions?

Schedulability Analysis under Graph Routing in WirelessHART Networks

Can a Lump-Sum Transfer Make Everyone Enjoy the Gains. from Free Trade?

The EOQ Inventory Formula

- 1 - Handout #22 May 23, 2012 Huffman Encoding and Data Compression. CS106B Spring Handout by Julie Zelenski with minor edits by Keith Schwarz

College Planning Using Cash Value Life Insurance

Strategic trading in a dynamic noisy market. Dimitri Vayanos

What is Advanced Corporate Finance? What is finance? What is Corporate Finance? Deciding how to optimally manage a firm s assets and liabilities.

Research on the Anti-perspective Correction Algorithm of QR Barcode


Working Capital 2013 UK plc s unproductive 69 billion

Unemployment insurance/severance payments and informality in developing countries

Instantaneous Rate of Change:

Research on Risk Assessment of PFI Projects Based on Grid-fuzzy Borda Number

Analyzing the Effects of Insuring Health Risks:

A strong credit score can help you score a lower rate on a mortgage

M(0) = 1 M(1) = 2 M(h) = M(h 1) + M(h 2) + 1 (h > 1)

Operation go-live! Mastering the people side of operational readiness

Pre-trial Settlement with Imperfect Private Monitoring

OPTIMAL FLEET SELECTION FOR EARTHMOVING OPERATIONS

Derivatives Math 120 Calculus I D Joyce, Fall 2013

SHAPE: A NEW BUSINESS ANALYTICS WEB PLATFORM FOR GETTING INSIGHTS ON ELECTRICAL LOAD PATTERNS


h Understanding the safe operating principles and h Gaining maximum benefit and efficiency from your h Evaluating your testing system's performance

Area-Specific Recreation Use Estimation Using the National Visitor Use Monitoring Program Data

The Dynamics of Movie Purchase and Rental Decisions: Customer Relationship Implications to Movie Studios

An Orientation to the Public Health System for Participants and Spectators

2 Limits and Derivatives

Simultaneous Location of Trauma Centers and Helicopters for Emergency Medical Service Planning

Training Robust Support Vector Regression via D. C. Program

Free Shipping and Repeat Buying on the Internet: Theory and Evidence

Tis Problem and Retail Inventory Management

Pioneer Fund Story. Searching for Value Today and Tomorrow. Pioneer Funds Equities

An inquiry into the multiplier process in IS-LM model

Distances in random graphs with infinite mean degrees

Strategic trading and welfare in a dynamic market. Dimitri Vayanos

Staffing and routing in a two-tier call centre. Sameer Hasija*, Edieal J. Pinker and Robert A. Shumsky

The modelling of business rules for dashboard reporting using mutual information

Guide to Cover Letters & Thank You Letters

Using Intelligent Agents to Discover Energy Saving Opportunities within Data Centers

ANALYTICAL REPORT ON THE 2010 URBAN EMPLOYMENT UNEMPLOYMENT SURVEY

Model Quality Report in Business Statistics

DEPARTMENT OF ECONOMICS HOUSEHOLD DEBT AND FINANCIAL ASSETS: EVIDENCE FROM GREAT BRITAIN, GERMANY AND THE UNITED STATES

RISK ASSESSMENT MATRIX

Performance Evaluation of Selected Category of Public Sector Mutual Fund Schemes in India

Yale ICF Working Paper No May 2005

ON LOCAL LIKELIHOOD DENSITY ESTIMATION WHEN THE BANDWIDTH IS LARGE

Predicting the behavior of interacting humans by fusing data from multiple sources

Overview of Component Search System SPARS-J

Human Capital, Asset Allocation, and Life Insurance

His solution? Federal law that requires government agencies and private industry to encrypt, or digitally scramble, sensitive data.

Pretrial Settlement with Imperfect Private Monitoring

For Sale By Owner Program. We can help with our for sale by owner kit that includes:

Welfare, financial innovation and self insurance in dynamic incomplete markets models

WORKING PAPER SERIES THE INFORMATIONAL CONTENT OF OVER-THE-COUNTER CURRENCY OPTIONS NO. 366 / JUNE by Peter Christoffersen and Stefano Mazzotta

FINANCIAL SECTOR INEFFICIENCIES AND THE DEBT LAFFER CURVE

On Distributed Key Distribution Centers and Unconditionally Secure Proactive Verifiable Secret Sharing Schemes Based on General Access Structure

SAT Subject Math Level 1 Facts & Formulas

Math Test Sections. The College Board: Expanding College Opportunity

Theoretical calculation of the heat capacity

1. Case description. Best practice description

Tangent Lines and Rates of Change

THE NEISS SAMPLE (DESIGN AND IMPLEMENTATION) 1997 to Present. Prepared for public release by:

CHAPTER 7. Di erentiation

Design and Analysis of a Fault-Tolerant Mechanism for a Server-Less Video-On-Demand System

Care after stroke. or transient ischaemic attack. Information for patients and their carers

Referendum-led Immigration Policy in the Welfare State

Artificial Neural Networks for Time Series Prediction - a novel Approach to Inventory Management using Asymmetric Cost Functions

CHAPTER TWO. f(x) Slope = f (3) = Rate of change of f at 3. x 3. f(1.001) f(1) Average velocity = s(0.8) s(0) 0.8 0

Bonferroni-Based Size-Correction for Nonstandard Testing Problems

Chapter 11. Limits and an Introduction to Calculus. Selected Applications

NAFN NEWS SPRING2011 ISSUE 7. Welcome to the Spring edition of the NAFN Newsletter! INDEX. Service Updates Follow That Car! Turn Back The Clock

SAT Math Must-Know Facts & Formulas

Notes: Most of the material in this chapter is taken from Young and Freedman, Chap. 12.

Decision Criteria for Using Automated Coding in Survey Processing

2.12 Student Transportation. Introduction

Torchmark Corporation 2001 Third Avenue South Birmingham, Alabama Contact: Joyce Lane NYSE Symbol: TMK

Dynamic Competitive Insurance

EC201 Intermediate Macroeconomics. EC201 Intermediate Macroeconomics Problem set 8 Solution

a joint initiative of Cost of Production Calculator

Chapter 10: Refrigeration Cycles

Cyber Epidemic Models with Dependences

How doctors can close the gap

Ending tobacco smoking in Britain. Radical strategies for prevention and harm reduction in nicotine addiction

Math 113 HW #5 Solutions

Staying in-between Music Technology in Higher Education

Article. Variance inflation factors in the analysis of complex survey data. by Dan Liao and Richard Valliant

Writing Mathematics Papers

Orchestrating Bulk Data Transfers across Geo-Distributed Datacenters

Government Debt and Optimal Monetary and Fiscal Policy

Determine the perimeter of a triangle using algebra Find the area of a triangle using the formula

Factoring Synchronous Grammars By Sorting

To motivate the notion of a variogram for a covariance stationary process, { Ys ( ): s R}

Transcription:

Researc in Official Statistics Number 2/2001 A system to monitor te quality of automated coding of textual answers to open questions Stefania Maccia * and Marcello D Orazio ** Italian National Statistical Institute (ISTAT), Metodological Studies Department Via C. Balbo, 16. 00184, Rome ITALY Keywords: ACTR; Coding Precision; Labour Force Survey; Stratified sampling. Abstract Te Italian National Statistical Institute (ISTAT) carried out some tests of automated coding of textual answers regarding Occupation, Education level, Industry, etc, using te ACTR system (Automated Coding by Text Recognition). Te good results obtained led ISTAT to perform a furter analysis of a large sample of textual data, in order to define a standardised procedure for reference wen using ACTR instead of manual coding during a survey. Te analysis sown in tis paper aims at building up a system to monitor te quality of te results of automated coding and at verifying te improvements wic can be acieved using te results of te monitoring activity to integrate te automated coding environment. 1. Introduction Coding written-in answers for open questions of statistical surveys typically requires dealing wit te problem of teir variability, depending on te cultural background of te respondents, on teir ways of speaking and finally on ow interested tey are in cooperating. Te written answers are often generic or ambiguous, since respondents are not expert in te classifications and so tey respond witout tinking tat teir answers ave to be coded. Te same may appen wit interviewers (even if trained in advance) wo are not always used to obtaining answers suitable to be easily coded. Automated coding can elp in solving te specific problems of costs, time and quality connected wit te coding activity. In fact, manual coding implies ig costs due to iring, training and supervising coding personnel. It requires a long time, especially for complex questions suc as Industry or Occupation, and an attempt to reduce time can negatively affect te quality of results. Finally, manual coding does not ensure any standardisation of te process (it is not sure tat two different people assign te same correct code to te same textual description). Te process is strongly influenced by factors related to knowledge of te classifications, skill and conscientiousness of coding clerks. * E-mail: maccia@istat.it; tel. +39 06 4673 2157; fax +39 06 4788 8069. ** E-mail: madorazi@istat.it; tel. +39 06 4673 2278; fax +39 06 4788 8069. 5

S. Maccia and M. D Orazio A system to monitor te quality of automated coding of textual answers to open questions For tese reasons various countries, for instance France, te United States, Canada and te United Kingdom ave developed and are successfully using automated coding systems. In France, Lorigny (1988) developed QUID (QUestionnaires d IDentification), wic was used in a number of socio-economic surveys. Later, te SICORE system (Système Informatique de COdage des Réponses aux Enquêtes) was designed to code different variables (Rivière, 1994). In te United States, automated coding as been deeply studied and investigated. First papers appeared in te 1970's (O Reagan, 1972; Corbett, 1972); oter interesting papers are tose of Hellerman (1982) and Appel and Hellermann (1983). For te 1990 Census, te U.S. Bureau of te Census developed te Automated Industry and Occupation Coding System (AIOCS); tis system was adopted in te Current Population Survey and in te Survey of Income and Program Participation too (Lyberg and Dean, 1992). During te 1990's, te Bureau of te Census continued researc by considering oter possible tecniques and systems for automated coding of Industry and Occupation (Creecy et al., 1990 and 1992; Gillman and Appel, 1994 and 1999). Still in te U.S., te Center for Healt Statistics developed CLIO (Classification of Industry and Occupation), a system derived directly from te one used to code cause of deat (Harris and Camblee, 1994). Researc by Statistics Canada led to release of te ACTR (Automated Coding by Text Recognition) system, wic went into production in 1986 (Wenzowski, 1988). Actually, an updated release of ACTR is used to code different variables for te Census of Population and te Labour Force survey. Te United Kingdom currently uses PDC (Precision Data Coder), wic is a language-specific software, initially designed for Industry coding. However, to code te different textual variables observed in te (UK s) current Census of Population, it was decided to adopt a more generalised software program suc as ACTR. Considering all tese experiences, in 1998 ISTAT decided to test an automated coding system. Instead of developing new software it was decided to use te tird release of te ACTR systemsupplied by Statistics Canada. ACTR was cosen since it was language-independent and seemed easily adaptable to te Italian language, unlike oter systems tat were language-specific, suc as for instance PDC. Moreover, as already mentioned, ACTR is a generalised system, so it can be used for more tan one coding application. In addition, it as already been successfully used by oter National Statistics Institutes (Tourigny and Moloney, 1995). 2. Te ACTR system ACTR's pilosopy is based on metods originally developed at te U.S. Bureau of te Census (Hellerman, 1982), but uses matcing algoritms developed at Statistics Canada (Wenzowski, 1988). Te coding activity follows a quite complex pase of text standardisation, called parsing, tat provides fourteen different functions suc as caracter mapping, deletion of trivial words, definition of synonyms, removal of suffixes (tese functions are completely managed by te users). Te parsing aims to remove grammatical or syntactical differences so as to make equal two different descriptions wit te same 6

Researc in Official Statistics Number 2/2001 semantic content. Te parsed answer to be coded is ten compared wit te parsed descriptions of te dictionary, te so-called reference file. If tis searc returns a perfect matc, called direct matc, a unique code is assigned, oterwise te software uses an algoritm to find te best suitable partial (or fuzzy) matces, giving an indirect matc. In practice, in te latter case te software takes out of te reference file all te descriptions tat ave at least one parsed word in common wit te one given by te respondent and assigns tem a score. Tis score, standardised between 0 and 10 (10 corresponds to a perfect matc), is computed as a function of te weigt given to eac single word in common, wic is in inverse relation to its frequency of occurrence in te dictionary. Te system orders by decreasing score ( S1 S2 Sn ) te descriptions selected from te reference file and compares tem wit tree user-defined tresolds: te lower limit ( S min ), te upper limit ( S max ) and te minimum score difference ( S ). If S1 Smax and S 1 S 2 S te description wit te score S 1 is said to be a unique winner and a unique code is assigned to it. If te first two (or more) descriptions are greater or equal to S max ( S1 Smax and S2 Smax ) but teir difference is less tan te minimum score difference ( S 1 S 2 S ), te system returns bot as winners (multiple winners).te same appens if Smin S2 S1 Smax ;notice tat in tis case te similarity between te description to be coded and tose selected from te reference file is lower tan in te previous case. Finally, tere are no winners if all te scores are less tan S min ( S1 Smin ) and te system returns a failed message. For unique winners no uman intervention is required, wile all te oter cases need to be evaluated by expert coders to coose wic of multiples will be te rigt one or weter to code at all te failed matces. Te following example, concerning Occupation, clarifies ow te indirect matc works. Te description esercente di art. di abbigliamento di vario genere (esclusi i pellami) [ trader of clotes art. of various kinds (wit exception of leater) ], after te parsing process we defined, becomes abbigliament commerciant [ clotes dealer, suffixes removed] and matces wit te sentence of te reference file (actually used): esercente di negozio di abbigliamento [ sop trader of clotes ]. In practice, te parsing first operates on strings, eliminating certain clauses ( esclusi i pellami ), deleting non-informative strings ( di vario genere ), replacing strings wit synonyms and so on. It ten operates on words, replacing words wit synonymous ( esercente becomes commerciante ), deleting non-informative words ( di, i ) and removing suffixes from all words tat do not ave to be treated as exceptions. As te two sentences are similar but not identical, tere is an indirect matc wit a score of 9.33; tis score is greater tan te tresold Smax 8. 0 and, given tat S S S 0 2, a unique code is assigned to te starting description. 1 2. Unfortunately, te indirect matcing mecanism can produce errors. For example consider te description addetto ai servizi ausiliari [ assigned to auxiliary services ]; it would matc (wit te actual reference file) wit addetto ai servizi ausiliari del reattore 7

S. Maccia and M. D Orazio A system to monitor te quality of automated coding of textual answers to open questions [ assigned to auxiliary services of te reactor ] and, aving a ig score, it would be uniquely coded. But, as it can be seen, te original description does not refer to any reactor, instead it sould matc wit te code corresponding to te description personale inserviente negli uffici [ office attendant ]. Hence, wen an automatic coding system is in production, te quality of its results as to be monitored and coding errors ave to be used to update te application environment so as to prevent furter errors of te same kind. 3. Te construction of te automatic coding environment Using ACTR requires a pase of training, wic involves building te environment of te coding system. Te first step of te training requires te construction of coding dictionaries (lists of texts wit te corresponding codes); afterwards te system as to be adapted to te language and to eac classification; and finally it as to be tested. Te building of coding dictionaries (reference files) is te eaviest activity, as teir quality and teir size deeply affects te performance of automated coding. Basically, it involves: (i) re-elaborating te textual descriptions used in classification manuals in order to make tem simple, analytical and unambiguous; and (ii) integrating te classification dictionaries wit information based on expert knowledge, wit descriptions coming from oter related official classifications and wit empirical response patterns taken from previous surveys (in order to reproduce te respondents natural language as close as possible). Te already-mentioned parsing functions, wic are managed troug parsing files, allow users to adapt te system to te language and to te classification. Te implementation of tese parsing files is very easy and does not require te user to be a computer expert. As far as te adaptation to Italian is concerned, in all te applications we built, we decided to define as irrelevant te articles, conjunctions and prepositions, and we removed suffixes wic determine singular and plural. Only in considering Occupations was it necessary to remove te gender suffixes too. On te oter and, te definition of synonyms, bot at string and at word level, is a job tat requires more effort, since te classification is complex and answers can vary in teir wording. In order to clarify tis aspect wit figures, well over 2,000 synonyms were necessary wen Occupation type was considered, wereas just 287 ave been defined for Education Level. Up to now, we ave trained te system to work wit tree variables: Occupation, Industry and Education Level. Eac variable sows a different level of complexity, due to te corresponding classification complexity and to te expected variability in te wording of answers (bot tese aspects influence te results of automated coding, as confirmed by te experiences in oter countries). 8

Researc in Official Statistics Number 2/2001 Te bencmark files we used to train te system for te tree mentioned variables were a sample of 9,000 ouseolds drawn to perform a Quality Survey on 1991 Population Census and a sample drawn from te Intermediate Census of Industry and Services ( Sort Form survey ). To train ACTR we ran it repeatedly on tese samples, selecting eac time te empirical answers to be added to te dictionaries and at te same time, improving te parsing process until te igest possible number of correct unique matces was reaced. Te rates of matcing (answer prase: single code) obtained at te end of te runs were: 72.5% for Occupation; 86.6% for Education Level; 54.5% and 73.0% respectively for Industry on te first and second sample (tis difference is due to ouseolds difficulty in answering tis question). Hence tey were in line wit te results obtained by oter Countries (Lyberg and Dean, 1992). 4. First results of automated coding After training te system, it needs to be tested in order to verify if te application environment, built using small samples, is suitable to be used for data-sets of bigger size. For tis purpose te quality of automated coding as to be measured in terms of recall, i.e. te percentage of codes automatically assigned, as well as in terms of precision, i.e. te percentage of correct codes automatically assigned. Table 1 sows te results obtained in terms of recall on data collected in te 1994 Healt Survey, te 1998 Labour Force survey (four quarters collected and already manually coded), te 1999 Labour Force pilot survey and te 1998 Intermediate Census of Industry and Services ( Long Form survey ). Tese results are consistent wit tose obtained during te system training. Source of texts Occupation Industry No. Texts Recall No. Texts Recall Healt Survey 33,735 72.3% 1998 Labour Force Survey 356,231 71.9% 1999 Labour Force Pilot Survey 1,307 67.6% 1,252 44.6% "Long Form Survey" 37,161 63.0% Table 1 Some results on recall of automatic coding As far as precision is concerned, wit te aid of expert coders wo analysed all te automatically assigned codes, it was possible to acieve te results sown in te following table. 9

S. Maccia and M. D Orazio A system to monitor te quality of automated coding of textual answers to open questions Source of texts Occupation Industry Uniquely Coded Precision Uniquely Coded Precision Healt Survey 24,404 97.0% 1999 Labour Force Pilot Survey 884 99.0% 558 86.0% Table 2 Precision of automatic coding 1 It was not possible to do te same ting in te Labour Force survey, due to its great amount of texts (256,748 texts coded as unique); ere te precision can be evaluated only on a sample basis. Hence te need to build a system to monitor te quality of automatic coding, wic can determine te extraction of sample of texts tat ave to be submitted to expert coders (section 5.). 5. Monitoring and enancing te quality of automatic coding of great amount of texts We analysed te textual answers for te 1998 Labour Force survey (four quarters collected) wit te purpose of: (i) torougly evaluating te performance of te automatic coding; (ii) building up a quality monitoring system; and (iii) doing furter training of te coding environment, wose main purpose is to enric te dictionary wit new texts. As a first step we quantified ow many different texts existed in te original file and defined some frequency classes, so as to evaluate te performance of te system, class by class. To identify te different texts, we performed a kind of raw standardisation wit only a few parsing functions, so as to delete from descriptions te articles, conjunctions, prepositions and suffixes (in practice all te elements tat determine te gender of words, te singular/plural, etc.). As can be seen in Table 3, te initial 356,231 texts were reduced to 59,562 different ways of describing te occupation. On te oter and, 74% of tese descriptions occurred only once in te original file, tus proving a ig variance in wording of answers, if compared wit te 6,319 official elementary definitions derived from just 599 occupations listed in te classification manual. Original Different Occurrence Texts Texts 1 2 3 10 11 50 51 1,000 1,001 10,000 356,207 59,562 (100.00%) 43,349 (73.78%) 7,344 (12.33%) 6,404 (10.75%) 1,783 (2.99%) 640 (1.07%) 41 (0.07%) Table 3 Distribution of different texts by classes of occurrence 1 Evaluation of precision for Long Form survey is still in progress. 10

Researc in Official Statistics Number 2/2001 5.1. Evaluating te performance of automatic coding environment A primary indicator of te performance of te automatic coding environment is acieved by comparing its recall on te original data-set (te one wit all nonparsed texts) wit tat of different texts. Obviously te system recall on tis latter file is lower, as can be seen in table below. ACTR output Recall N. texts % Unique 19,404 32.5 Multiple 20,537 34.5 Failed 19,620 33.0 Total 59,561 100.0 Table 4 ACTR results on different texts: recall Recall grows as frequency class becomes iger (Table 5). In particular, for different texts occurring only once, ACTR assigned a unique code in 27% of cases, wile for texts occurring more tan 100 times, tis rate goes beyond 79%. Tis means tat te actual reference file already includes most of te occupation descriptions tat occur frequently in common speaking. ACTR Occurrence output 1 2 10 11 100 101 1,000 1,001 10,000 No. % No. % No. % No. % No. % Unique 11,786 27.2 5,869 42.7 1,437 69.0 273 79.6 39 95.1 Multiple 15,735 36.3 4,303 31.3 431 20.8 66 19.2 2 4.9 Failed 15,828 36.5 3,576 26.0 212 10.2 4 1.2 0 0.0 Total 43,349 100.0 13,748 100.0 2,080 100.0 343 100.0 41 100.0 Table 5 ACTR results on frequency classes of different texts: recall 11

S. Maccia and M. D Orazio A system to monitor te quality of automated coding of textual answers to open questions 5.2. Lack of standardisation of manual coding process Te quality of automated coding can be furter evaluated by comparing it wit te level of standardisation in te manual coding process. As te Labour Force data were previously manually coded, we could quantify te different codes assigned by manual coders to te same text (Table 6). Texts frequency Different codes assigned to equal texts classes Max N. Mean Median Mode 2 2 1.27 2 1 3 5 5 1.84 3 1 6 10 10 2.68 3 1 11 50 33 4.65 4 2 51 100 42 10.05 8 4 101 1,000 119 18.65 14 7 1,001 10,000 389 67.46 51 33 Table 6 Lack of standardisation in manual coding Te results in tis table sow ow low te level of standardisation of manual coding is. Te discrepancy between codes assigned by different operators can usually be ascribed to different interpretations of te response text, to different knowledge of te classification and to misunderstandings. On te oter and, tere is surely a percentage of texts (wic we could not quantify) to wic operators assigned different codes in view of some oter information taken from oter correlated questions in te questionnaire (for instance Industry). 5.3. Te system to monitor te quality Given te caracteristics of ACTR, te sample of n different texts to be cecked as to be drawn from tose uniquely coded wit a score less tan 10 ( N 13, 821 ). In fact a text coded wit a score of 10, corresponding to a direct matc, as a correct code (unless tere are some mistakes in te reference file). We decided to use a stratified random sampling design to draw te sample. In practice, texts were first stratified according to teir frequency of occurrence M ; ten, witin eac stratum a simple random sample (witout replacement) of texts was selected. Te strata coincided wit te previously defined classes of occurrences wit exception of te 1,001 10,000 one, given tat all its 41 different texts ad a coding score equal to 10, i.e. tey were all correctly coded. In deciding te sample size it is possible to coose between two different strategies: (i) to compute te overall sample size and ten allocate it between te strata; or (ii) to compute j 12

Researc in Official Statistics Number 2/2001 te sample size independently for eac stratum, according to te precision of estimates required in eac of tem. Wit te first strategy te overall optimal sample size can be approximately computed by using Neyman allocation (see e.g. Cocran, 1977, p. 105). In tis circumstance it is important to decide a priori ow te sample sould be allocated between strata. For example, wit proportional allocation, te sample is allocated according to te relative size of eac stratum W N N. However, wit te problem at and, a better approac could be tat of allocating te sample according to te relative sum of frequencies of different texts in te same class, so as to sample more different texts wit iger importance. Te advantage of tis procedure is tat of computing directly te overall sample size, given an allocation criterion. Te disadvantage is tat for some stratum te optimal sample size may be greater ten te entire stratum size ( N ); ere, one as to revise te allocation following Cocran (1977, p. 104). Te alternative strategy avoids tis last problem; it involves deciding te optimal sample size independently from stratum to stratum using eac time te following expression (see Cocran, 1997, pp. 75-76): wit n n 0 n 1 N 1 0 n 0 2 1 2 z ~ 2 d 1 ~, 1, 2,, L. In tis expression ~ is te ypotesised precision of automated coding for texts belonging to class ; d represents te overall margin of error allowed in estimating te unknown precision,, of automated coding and z is te percentile of standardised normal distribution suc tat Pr ˆ d. Ten, te overall sample size is acieved by summing up te so obtained optimal sample sizes: n n1 n2 nl. Te problem wit tis procedure is tat n may easily explode if some n values are too large. We used tis latter strategy in deciding te size of te sample of text to submit to expert coders. An equal precision rate of automated coding in eac class, ~ 0. 75 was ypotesised, wile te margin of error d was progressively reduced (4 t column of te Table below) in iger classes of occurrences of different texts; tis guaranteed estimates wit a iger precision for eaviest different texts. Te approximate sample 13

S. Maccia and M. D Orazio A system to monitor te quality of automated coding of textual answers to open questions size computed for various classes wit 0. 05 ( z 0. 975 1. 96 ) can be found in te 4 t result column. Classes of occurrences Number of different texts ( N ) Hypotesised Margin precision of autom. coding ( ~ of error ) d ) ( Approximate optimal sample size ( n ) Sampling fraction n N ) ( f 1 10,007 75.0% ±5.0% 148 1.48% 2 1,756 75.0% ±5.0% 138 7.86% 3 5 1,187 75.0% ±4.5% 160 13.48% 6 10 473 75.0% ±3.0% 222 46.93% 11 50 349 75.0% ±2.5% 221 63.32% 51 100 33 75.0% ±1.0% 33 100.00% 101 1,000 16 75.0% ±1.0% 16 100.00% Tot. 13,821 938 6.79% Table 7 Optimal sample sizes in te strata Te sample of 938 texts was ten submitted to expert coders, in order to evaluate if ACTR ad assigned correct codes. In tis way it was possible to estimate precision for eac class of occurrences and ence for all te 13,821 different texts. Te estimates, computed using te teory of stratified random sampling (cf. Cocran, 1977, pp. 90-96), can be found in Table 8, wit te corresponding values useful to derive te 95%-confidence interval (last column of te table). As can be seen, we estimated tat 75.77% of te 13,821 different texts were correctly coded by ACTR. True precision lies between 70.58% ( 75. 77 5. 19 ) and 80.95% ( 75. 77 5. 19 ) approximately wit a probability of 0.95. Te precision tends to be iger (over te 80%) for te last classes. Notice tat for te last two classes we do not ave an estimate but te true precision, as all texts (rater tan a sample) were cecked. Here te coding precision is over te 80% and tis furter proves tat te system works well wit more frequent descriptions. Classes of occurrences Different texts Sample size Sampling fraction (%) Estimated precision (%) Estimated margin of error 1 10,007 148 1.48 74.32 ±6.99 2 1,756 138 7.86 81.88 ±6.17 3 5 1,187 160 13.48 78.13 ±5.96 6 10 473 222 46.93 73.42 ±4.23 11 50 349 221 63.32 80.09 ±3.19 51 100 33 33 100.00 87.88 101 1,000 16 16 100.00 81.25 Tot. 13,821 938 6.79 75.77 ±5.19 Table 8 Estimated precision of automatic coding of different texts 14

Researc in Official Statistics Number 2/2001 If we consider also te 6,083 ( 19, 904 13, 821) different texts coded wit a score of 10 (all correctly coded), te overall estimated precision goes up to 83.17% of 19,904 different texts. Te estimated precision of automated coding wen applied to original texts can be easily derived from tat of te different texts, by considering te occurrences of tese latter ones (Table 9). In practice eac different text can be viewed as a cluster of original texts and te teory of cluster sampling allows us to derive te estimates of precision and te corresponding 95%-confidence intervals reported in Table below. Classes of occurrences Different texts Original Texts Estimated precision (%) Estimated margin of error 1 10,007 10,007 74.32 ±7.01 2 1,756 3,512 81.88 ±6.19 3-5 1,187 4,337 78.34 ±6.55 6-10 473 3,492 73.40 ±4.52 11-50 349 7,320 86.29 ±5.08 51-100 33 2,214 87.49 101-1,000 16 3,731 81,96 Tot. 13,821 34,613 79.70 ±2.57 Table 9 Estimated precision of automatic coding of original texts It is estimated tat te 79,7% (27,586 texts) of te 34,613 original texts uniquely coded wit a score less tan 10 were coded correctly. Te true precision lies between 77.13% ( 79. 7 2. 57 ) and 82.26% ( 79. 7 2. 57 ), wit an approximate confidence of 0.95. Here too, if we consider te 222,135 original texts uniquely coded wit a score equal to 10, it comes out tat 249,721 of te 256,748 original texts uniquely coded ad a correct code (i.e. 97.26%). Tis last estimate is in line wit te one obtained for te Healt survey (see Table 2). Tus, wit a small but well designed sample (in tis case 6.79% of single texts) it was possible to evaluate te precision of automated coding results wit a ig confidence. 5.4. First results of te furter training of coding environment Te furter training pase consists in adding new texts to te dictionary and updating te coding environment: (i) to prevent furter texts being processed as coding errors found ; and (ii) to increase te future recall rates. To prevent furter coding errors, te sample of different texts for wic ACTR did not assign a correct code, as determined byexpert coders, needs to be analysed. In order to increase te future recall rate, texts for wic ACTR was not successful in assigning a single code need to be examined, including coded texts aving enoug informative content to be assigned a unique code (i.e. tose wic are not too generic, or 15

S. Maccia and M. D Orazio A system to monitor te quality of automated coding of textual answers to open questions wic describe concepts wic can be directly linked wit single codes). In tis regard, it is convenient to examine first te more frequent ones, wile te analysis of texts belonging to lower frequency classes, given teir minor importance, can be restricted to only a sample. We analysed te failed matces returned by ACTR wen coding te file of different texts occurring more tan 10 times. By analysing only 216 different texts (212 belonging to te 11 100 class of occurrences and 4 to te 101 1,000 one) we added 299 new texts to te reference file and 46 synonyms (at bot te level of string and of te word). Te recall rate obtained on te original text data-set after tis furter training activity are sown in te following table. ACTR output Recall No. Texts % Unique 269,485 75.6 Multiple 58,848 16.6 Failed 27,898 7.8 Total 356,231 100.0 Table 10 ACTR results on original Labour Force survey sample after te furter training: recall As can be seen, te percentage rises from 71.9% to 75.6% and is likely to be even iger if we ad also analysed te multiple matces. Finally, we verified if te update of te coding environment for Occupation, acieved by analysing Labour Force descriptions, could imply better results for oter coding applications performed on data from oter surveys. For tis purpose, we automatically coded again te Healt survey texts, as it was te next biggest file we ad at our disposal, after te Labour Force one. As sown in table 11, te recall rate grows from 72.3% to 75.1%, tus confirming tat te outcomes of eac coding application represent a precious feed-back to update te coding environment and give te cance of acieving iger recall rates. ACTR output Recall No. Texts % Unique 25,337 75.1 Multiple 5,827 17.3 Failed 2,571 7.6 Total 33,735 100.0 Table 11 ACTR results on Healt survey sample after furter training: recall 16

Researc in Official Statistics Number 2/2001 6. Conclusions As mentioned in te previous sections, ISTAT spent muc work and time in order to introduce automatic coding of written-inanswers to open questions regarding Occupation, Education Level and Industry by means of te ACTR system. Most of te work involved building te reference files and te corresponding parsing files for bot Occupation and Industry. Te first results obtained in tis direction (section 4.) were encouraging, especially if compared wit tose of manual coding, and led us to furter improve te automatic coding environment, using all available sources of textual descriptions to enance te reference files and to refine te parsing step. Alongside tis activity, we tougt it was necessary to introduce an evaluation procedure so as to quantify te quality of ACTR output (section 5.). Tis procedure was kept as general as possible in order to get a reliable idea, even if on a sample basis, of ow well ACTR codes texts tat do not exactly correspond to descriptions of te reference file. We performed tis evaluation step on a large amount of texts regarding Occupation (section 5.3) and te results obtained were particularly satisfactory (overall coding precision was estimated to be about 97%). All te work invested in ACTR training and te good results obtained in te testing/evaluation pase convinced us tat it can successfully be adopted for use in different surveys, even to code suc complex descriptions as te Occupation and Industry ones, giving more consistent results tan tose of manual coders. Moreover, tese results seem to be acievable at a lower cost; gains are likely to increase wit te amount of descriptions collected. In any case, te application of ACTR sould constantly be monitored in all its pases. Despite te advantages, it as to be kept in mind tat te application of ACTR still presents a problem in cases were te system fails in assigning a unique code. Different solutions are available. One, for example, could be tat of trying to code by making use of additional information derived from related questions of te same form. Tis could be acieved automatically or wit te intervention of expert coders, maybe aided by an assisted coding system. If tis does not work, and if te number of unsolved cases is not ig, it may be necessary to consider re-contacting te respondents. Terefore furter investigation is needed in order to coose te strategy tat performs better in terms of te number of cases solved, costs and quality of final results. 7. References [1] Appel, M. and Hellerman, E., 'Census Bureau experience wit Automated Industry and Occupation Coding', "Proceedings of Section on Survey Researc Metods", American Statistical Association, 1983, pp. 32-40. [2] Cen, B., Creecy, R. and Appel, M., 'Error control of automated industry and occupation coding', Journal of Official Statistics, Vol. 9, 1993, pp. 729-745. 17

S. Maccia and M. D Orazio A system to monitor te quality of automated coding of textual answers to open questions [3] Cocran, W. G., Sampling Tecniques, 3 rd edition, Wiley, New York, 1977. [4] Corbett, J.P., Encoding from free word descriptions, Unpublised manuscript, U.S. Bureau of te Census, 1972. [5] Creecy, R. H., Causey, B. D., and Appel, M. V., A Bayesian classification approac to automated industry and occupation coding, Paper presented at te American Statistical Associations Joint Statistical Meetings, Anaeim, CA, 1990. [6] Creecy, R. H., Masand, B. M., Smit, S. J. and Waltz, D. L., Trading MIPS and memory for knowledge engineering, Communications of te ACM, Vol. 35, 1992, pp. 48-68. [7] De Angelis, R. and Maccia, S., 'Applying automated coding to te pilot survey of next population census: a callange', Paper presented at Conference on New Tecniques and Tecnologies for Statistics, Sorrento, Italy 4-6 November 1998, pp 309-314. [8] Dumicic, S. and Dumicic, K., 'Optical reading and automatic coding in te Census 91 in Croatia', Paper presented at Conference of European Statisticians, Work Session on Statistical Data Editing, Cork, Ireland 17 20 October 1994. [9] Gillman, D. and Appel, M. V., 'Automated coding Researc at te Census Bureau'. Researc Report, N. 4, US Bureau of te Census, Wasington, 1994. [10] Gillman, D. and Appel, M. V., 'Developing an automated industry and occupation coding system for te 2000 census'. Paper presented at Joint Statistical Meeting eld in Baltimore, 1999. [11] Harris, K. W. And Camblee, R. F., 'Evaluation of an automated industry and occupation coding system', Paper presented at Joint Statistical Meeting eld in Toronto, 1994. [12] Hellermann, E., 'Overview of te Hellerman I&O Coding System'. US Bureau of te Census internal paper, Wasington, 1982. [13] Kalpic, D., 'Automated coding of census data', Journal of Official Statistics, Vol. 10, 1994, pp. 449-463. [14] Knaus, R., 'Metods and problems in coding natural language survey data', Journal of Official Statistics, Vol. 1, 1987, pp. 45-67. [15] Lorigny, J., QUID, A general automatic coding metod, Survey Metodology, Vol. 14, 1988, pp. 289-298. 18

Researc in Official Statistics Number 2/2001 [16] Lyberg, L. and Dean, P., 'Automated Coding of Survey Responses: an international review', Paper presented at Conference of European Statisticians, Work session on Statistical Data Editing, Wasington DC, 1992. [17] Massingam, R., 'Data capture and Coding for te 2001 Great Britain Census', paper presented at XIV Annual International Symposium on Metodology Issues, 5 7 November 1997, Hull, Canada. [18] O Reagan R. T., Computer assigned codes from verbal responses, Communications from te ACM, Vol 15, 1972, pp. 455-459. [19] Riviere, P., Le système de codification automatique SICORE, Working paper, Conference des Statisticiens Européens, Séminaire ISIS 1994, Bratislava [20] Tourigny, J. Y. and Moloney, J., 'Te 1991 Canadian Census of Population experience wit automated coding', Paper presented at United Nations Statistical Commission on Statistical Data Editing, 1995. [21] Wenzowski, M. J., 'ACTR A Generalised Automated Coding System', Survey Metodology, Vol. 14, 1988, pp. 299-308. 19