ActiveClean: Interactive Data Cleaning While Learning Convex Loss Models

Transcription

1 ActveClean: Interactve Data Cleanng Whle Learnng Convex Loss Models Sanjay Krshnan, Jannan Wang, Eugene Wu, Mchael J. Frankln, Ken Goldberg UC Berkeley, Columba Unversty {sanjaykrshnan, jnwang, frankln, ABSTRACT Data cleanng s often an mportant step to ensure that predctve models, such as regresson and classfcaton, are not affected by systematc errors such as nconsstent, out-of-date, or outler data. Identfyng drty data s often a manual and teratve process, and can be challengng on large datasets. However, many data cleanng workflows can ntroduce subtle bases nto the tranng processes due to volaton of ndependence assumptons. We propose ActveClean, a progressve cleanng approach where the model s updated ncrementally nstead of re-tranng and can guarantee accuracy on partally cleaned data. ActveClean supports a popular class of models called convex loss models (e.g., lnear regresson and SVMs). ActveClean also leverages the structure of a user s model to prortze cleanng those records lkely to affect the results. We evaluate ActveClean on fve real-world datasets UCI Adult, UCI EEG, MNIST, Dollars For Docs, and World- Bank wth both real and synthetc errors. Our results suggest that our proposed optmzatons can mprove model accuracy by up-to 2.5x for the same amount of data cleaned. Furthermore for a fxed cleanng budget and on all real drty datasets, ActveClean returns more accurate models than unform samplng and Actve Learnng. 1. INTRODUCTION Machne Learnng on large and growng datasets s a key data management challenge wth sgnfcant nterest n both ndustry and academa [1, 5, 1, 2]. Despte a number of breakthroughs n reducng tranng tme, predctve modelng can stll be a tedous and tme-consumng task for an analyst. Data often arrve drty, ncludng mssng, ncorrect, or nconsstent attrbutes, and analysts wdely report that data cleanng and other forms of pre-processng account for up to 8% of ther effort [3, 22]. Whle data cleanng s an extensvely studed problem, the predctve modelng settng poses a number of new challenges: (1) hgh dmensonalty can amplfy even a small amount of erroneous records [36], (2) the complexty can make t dffcult to trace the consequnces of an error, and (3) there are often subtle techncal ccondtons (e.g., ndependent and dentcally dstrbuted) that can be volated by data cleanng. Consequently, technques that have been desgned for tradtonal SQL analytcs may be neffcent or even unrelable. In ths paper, we study the relatonshp between data cleanng and model tranng workflows and explore how to apply exstng data cleanng approaches wth provable guarantees. One of the man bottlenecks n data cleanng s the human effort n determnng whch data are drty and then developng rules or software to correct the problems. For some types of drty data, such as nconsstent values, model tranng may seemngly succeed albet, wth potental subtle naccuraces n the model. For example, battery-powered sensors can transmt unrelable measurements when battery levels are low [21]. Smlarly, data entered by humans can be susceptble to a varety of nconsstences (e.g., typos), and unntentonal cogntve bases [23]. Such problems are often addressed n tme-consumng loop where the analys trans a model, nspects the model and ts predctons, clean some data, and re-tran. Ths teratve process s the de facto standard, but wthout approprate care, can lead to several serous statstcal ssues. Due to the well-known Smpson s paradox, models traned on a mx of drty and clean data can have very msleadng results even n smple scenaros (Fgure 1). Furthermore, f the canddate drty records are not dentfed wth a known samplng dstrbuton, the statstcal ndependence assumptons for most tranng methods are volated. The volatons of these assumptons can ntroduce confoundng bases. To ths end, we desgned ActveClean whch trans predctve models whle allowng for teratve data cleanng and has accuracy guarantees. ActveClean automates the drty data dentfcaton process and the model update process, thereby abstractng these two error-prone steps away from the analyst. ActveClean s nspred by the recent success of progressve data cleanng where a user can gradually clean more data untl the desred accuracy s reached [6, 34, 3, 17, 26, 38, 37]. We focus on a popular class of models called convex loss models (e.g., ncludes lnear regresson and SVMs) and show that the Smpson s paradox problem can be avoded usng teratve mantenance of a model rather than re-tranng. Ths process leverages the convex structure of the model rather than treatng t lke a black-box, and we apply convergence arguments from convex optmzaton theory. We propose several novel optmzatons that leverage nformaton from the model to gude data cleanng towards the records most lkely to be drty and most lkely to affect the results. To summarze the contrbutons: Correctness (Secton 5). We show how to update a drty model gven newly cleaned data. Ths update converges monotoncally n expectaton. For a batch sze b and teratons T, t converges wth rate O( 1 bt ). Effcency (Secton 6). We derve a theoretcal optmal samplng dstrbuton that mnmzes the update error and an approxmaton to estmate the theoretcal optmum. Detecton and Estmaton (Secton 7). We show how

2 ActveClean can be ntegrated wth data detecton to gude data cleanng towards records expected to be drty. The experments evaluate these components on four datasets wth real and synthetc corrupton (Secton 8). Results suggests that for a fxed cleanng budget, ActveClean returns more accurate models than unform samplng and Actve Learnng when systematc corrupton s sparse. 2. BACKGROUND AND PROBLEM SETUP Ths secton formalzes the teratve data cleanng and tranng process and hghlghts an example applcaton. 2.1 Predctve Modelng The user provdes a relaton R and wshes to tran a model usng the data n R. Ths work focuses on a class of wellanalyzed predctve analytcs problems; ones that can be expressed as the mnmzaton of convex loss functons. Convex loss mnmzaton problems are amenable to a varety of ncremental optmzaton methodologes wth provable guarantees (see Fredman, Haste, and Tbshran [15] for an ntroducton). Examples nclude generalzed lnear models (ncludng lnear and logstc regresson), support vector machnes, and n fact, means and medans are also specal cases. We assume that the user provdes a featurzer F ( ) that maps every record r R to a feature vector x and label y. For labeled tranng examples {(x, y )} N =1, the problem s to fnd a vector of model parameters θ by mnmzng a loss functon φ over all tranng examples: θ = arg mn θ N φ(x, y, θ) =1 Where φ s a convex functon n θ. For example, n a lnear regresson φ s: φ(x, y, θ) = θ T x y 2 2 Typcally, a regularzaton term r(θ) s added to ths problem. r(θ) penalzes hgh or low values of feature weghts n θ to avod overfttng to nose n the tranng examples. θ = arg mn θ N φ(x, y, θ) + r(θ) (1) =1 In ths work, wthout loss of generalty, we wll nclude the regularzaton as part of the loss functon.e., φ(x, y, θ) ncludes r(θ). 2.2 Data Cleanng We consder corrupton that affects the attrbute values of records. Ths does not cover errors that smultaneously affect multple records such as record duplcaton or structure such as schema transformaton. Examples of supported cleanng operatons nclude, batch resolvng common nconsstences (e.g., mergng U.S.A" and Unted States"), flterng outlers (e.g., removng records wth values > 1e6), and standardzng attrbute semantcs (e.g., 1.2 mles" and 1.93 km"). We are partcularly nterested n those errors that are dffcult or tme-consumng to clean, and requre the analyst to examne an erroneous record, and determne the approprate acton possbly leveragng knowledge of the current best model. We represent ths operaton as Clean( ) whch can be appled to a record r (or a set of records) to recover the clean record r = Clean(r). Formally, we treat the Clean( ) as an expensve user-defned functon composed of determnstc schema-preservng map and flter operatons appled to a subset of rows n the relaton. A relaton s defned as clean f R clean = Clean(R clean ). Therefore, for every r R clean there exsts a unque r R n the drty data. The map and flter cleanng model s not a fundamental restrcton of Actve- Clean, and Appendx A dscusses a compatble set of records" cleanng model. 2.3 Iteraton As an example of how Clean( ) fts nto an teratve analyss process, consder an analyst tranng a regresson and dentfyng outlers. When she examnes one of the outlers, she realzes that the base data (pror to featurzaton) has a formattng nconsstency that leads to ncorrect parsng of the numercal values. She apples a batch fx (.e., Clean( )) to all of the outlers wth the same error, and re-trans the model. Ths teratve process can be descrbed as the followng pseudocode loop: 1. Int(ter) 2. current_model = Tran(R) 3. For each t n {1,..., ter} (a) drty_sample = Identfy(R,current_model) (b) clean_sample = Clean(drty_sample) (c) current_model = Update(clean_sample, R) 4. Output: current_model Whle we have already dscussed T ran( ) and Clean( ), the analyst stll has to defne the prmtves Identfy( ) and Update( ). For Identfy( ), gven a the current best model, the analyst must specfy some crtera to select a set of records to examne. And n Update( ), the analyst must decde how to update the model gven newly cleaned data. It turns out that these prmtves are not trval to mplement snce the straght-forward solutons can actually lead to dvergence of the traned models. 2.4 Challenges Correctness: Let us assume that the analyst has mplemented an Identfy( ) functon that returns k canddate drty records. The straght-forward applcaton data cleanng s to repar the corrupton n place, and re-tran the model after each repar. Suppose k N records are cleaned, but all of the remanng drty records are retaned n the dataset. Fgure 1 hghlghts the dangers of ths approach on a very smple drty dataset and a lnear regresson model.e., the best ft lne for two varables. One of the varables s systematcally corrupted wth a translaton n the x-axs (Fgure 1a). The drty data s marked n red and the clean data n blue, and they are shown wth ther respectve best ft lnes. After cleanng only two of the data ponts (Fgure 1b), the resultng best ft lne s n the opposte drecton of the true model. Aggregates over mxtures of dfferent populatons of data can result n spurous relatonshps due to the well-known phenomenon called Smpson s paradox [32]. Smpson s paradox s by no means a corner case, and t has affected the valdty of a number of hgh-profle studes [35]; even n the smple case of takng an average over a dataset. Predctve models are hgh-dmensonal generalzatons of these aggregates wthout closed form technques to compensate for these

3 bases. Thus, tranng models on a mxture of drty and clean data can lead to unrelable results, where artfcal trends ntroduced by the mxture can be confused for the effects of data cleanng. Fgure 1: (a) Systematc corrupton n one varable can lead to a shfted model. (b) Mxed drty and clean data results n a less accurate model than no cleanng. (c) Small samples of only clean data can result n smlarly naccurate models. An alternatve s to avod the drty data altogether nstead of mxng the two populatons, and the model re-tranng s restrcted to only data that are known to be clean. Ths approach s smlar to SampleClean [33], whch was proposed to approxmate the results of aggregate queres by applyng them to a clean sample of data. However, hgh-dmensonal models are hghly senstve to sample sze. Fgure 1c llustrates that, even n two dmensons, models traned from small samples can be as ncorrect as the mxng soluton descrbed before. Effcency: Conversely, hypothetcally assume that the analyst has mplemented a correct Update( ) prmtve and mplements Identfy( ) wth a technque such as Actve Learnng to select records to clean [37, 38, 16]. Actve learnng s a technque to carefully select the set of examples to learn the most accurate model. However, these selecton crtera are desgned for statonary data dstrbutons, an assumpton whch s not true n ths settng. As more data are cleaned, the data dstrbuton changes. Data whch may look unmportant n the drty data mght be very valuable to clean n realty, and thus any prortzaton has to predct a record s value wth respect to an antcpated clean model. 2.5 The Need For Automaton ActveClean s a framework that mplements the Identfy( ) and Update( ) prmtves for the analyst. By automatng the teratve process, ActveClean ensures relable models wth convergence guarantees. The analyst frst ntalzes Actve- Clean wth a drty model. ActveClean carefuly selects small batches of data to clean based on data that are lkely to be drty and lkely to affect the model. The analyst apples data cleanng to these batches, and ActveClean updates the model wth an ncremental optmzaton technque. Machne learnng has been appled n pror work to mprove the effcency of data cleanng [37, 38, 16]. Human nput, ether for cleanng or valdaton of automated cleanng, s often expensve and mpractcal for large datasets. A model can learn rules from a small set of examples cleaned (or valdated) by a human, and actve learnng s a technque to carefully select the set of examples to learn the most accurate model. Ths model can be used to extrapolate repars to not-yet-cleaned data, and the goal of these approaches s to provde the cleanest possble dataset ndependent of the subsequent analytcs or query processng. These approaches, whle very effectve, suffer from composblty problems when placed nsde cleanng and tranng loops. To summarze, ActveClean consders data cleanng durng model tranng, whle these technques consder model tranng for data cleanng. One of the prmary contrbutons of ths work s an ncremental model update algorthm wth correctness guarantees for mxtures of data. 2.6 Use Case: Dollars for Docs [2] ProPublca collected a dataset of corporate donatons to doctors to analyze conflcts of nterest. They reported that some doctors receved over $5, n travel, meals, and consultaton expenses [4]. ProPublca laborously curated and cleaned a dataset from the Centers for Medcare and Medcad Servces that lsted nearly 25, research donatons, and aggregated these donatons by physcan, drug, and pharmaceutcal company. We collected the raw unaggregated data and explored whether suspect donatons could be predcted wth a model. Ths problem s typcal of analyss scenaros based on observatonal data seen n fnance, nsurance, medcne, and nvestgatve journalsm. The dataset has the followng schema: C o n t r b u t o n ( p _ s p e c a l t y, drug_name, devce_name, c o r p o r a t o n, amount, d s p u t e, s t a t u s ) p_specalty s a textual attrbute descrbng the specalty of the doctor recevng the donaton. drug_name s the branded name of the drug n the research study (null f not a drug). devce_name s the branded name of the devce n the study (null f not a devce). corporaton s the name of the pharmaceutcal provdng the donaton. amount s a numercal attrbute representng the donaton amount. dspute s a Boolean attrbute descrbng whether the research was dsputed. status s a strng label descrbng whether the donaton was allowed under the declared research protocol. The goal s to predct dsallowed donaton. However, ths dataset s very drty, and the systematc nature of the data corrupton can result n an naccurate model. On the ProPublca webste [2], they lst numerous types of data problems that had to be cleaned before publshng the data (see Appendx I). For example, the most sgnfcant donatons were made by large companes whose names were also more often nconsstently represented n the data, e.g., Pfzer Inc.", Pfzer Incorporated", Pfzer". In such scenaros, the effect of systematc error can be serous. Duplcate representatons could artfcally reduce the correlaton between these enttes and suspected contrbutons. There were nearly 4, of the 25, records that had ether namng nconsstences or other nconsstences n labelng the allowed or dsallowed status. Wthout data cleanng, the detecton rate usng a Support Vector Machne was 66%. Applyng the data cleanng to the entre dataset mproved ths rate to 97% n the clean data (Secton 8.6.1), and the experments descrbe how ActveClean can acheve an 8% detecton rate for less than 1.6% of the records cleaned. 3. PROBLEM FORMALIZATION Ths secton formalzes the problems addressed n the paper. 3.1 Notaton and Setup

4 The user provdes a relaton R, a cleaner C( ), a featurzer F ( ), and a convex loss problem defned by the loss φ( ). A total of k records wll be cleaned n batches of sze b, so there wll be k teratons. We use the followng notaton to represent relevant ntermedate b states: Drty Model: θ (d) s the model traned on R (wthout cleanng) wth the featurzer F ( ) and loss φ( ). Ths serves as an ntalzaton to ActveClean. Drty Records: R drty R s the subset of records that are stll drty. As more data are cleaned R drty {}. Clean Records: R clean R s the subset of records that are clean,.e., the complement of R drty. Samples: S s a sample (possbly non-unform but wth known probabltes) of the records R drty. The clean sample s denoted by S clean = C(S). Clean Model: θ (c) s the optmal clean model,.e., the model traned on a fully cleaned relaton. Current Model: θ (t) s the current best model at teraton t {1,..., k }, and b θ() = θ (d). There are two metrcs that we wll use to measure the performance of ActveClean: Model Error. The model error s defned as θ (t) θ (c). Testng Error. Let T (θ (t) ) be the out-of-sample testng error when the current best model s appled to the clean data, and T (θ (c) ) be the test error when the clean model s appled to the clean data. The testng error s defned as T (θ (t) ) T (θ (c) ) 3.2 Problem 1. Correct Update Problem Gven newly cleaned data S clean and the current best model θ (t), the model update problem s to calculate θ (t+1). θ (t+1) wll have some error wth respect to the true model θ (c), whch we denote as: error(θ (t+1) ) = θ (t+1) θ (c) Snce a sample of data are cleaned, t s only meanngful to talk about expected errors. We call the update algorthm relable" f the expected error s upper bounded by a monotoncally decreasng functon µ of the amount of cleaned data: E(error(θ new )) = O(µ( S clean )) Intutvely, relable" means that more cleanng should mply more accuracy. The Correct Update Problem s to relably update the model θ (t) wth a sample of cleaned data. 3.3 Problem 2. Effcency Problem The effcency problem s to select S clean such that the expected error E(error(θ (t) )) s mnmzed. ActveClean uses prevously cleaned data to estmate the value of data cleanng on new records. Then t draws a sample of records S R drty. Ths s a non-unform sample where each record r has a samplng probablty p(r) based on the estmates. We derve the optmal samplng dstrbuton for the SGD updates, and show how the theoretcal optmum can be approxmated. The Effcency Problem s to select a samplng dstrbuton p( ) over all records such that the expected error w.r.t to the model f traned on fully clean data s mnmzed. 4. ARCHITECTURE Ths secton presents the ActveClean archtecture. 4.1 Overvew Fgure 2 llustrates the ActveClean archtecture. The dotted boxes descrbe optonal components that the user can provde to mprove the effcency of the system Requred User Input Model: The user provdes a predctve model (e.g., SVM) specfed as a convex loss optmzaton problem φ( ) and a featurzer F ( ) that maps a record to ts feature vector x and label y. Cleanng Functon: The user provdes a functon C( ) (mplemented va software or crowdsourcng) that maps drty records to clean records as per our defnton n Secton??. Batches: Data are cleaned n batches of sze b and the user can change these settngs f she desres more or less frequent model updates. The choce of b does affect the convergence rate. Secton 5 dscusses the effcency and convergence trade-offs of dfferent values of b. We emprcally fnd that a batch sze of 5 performs well across dfferent datasets and use that as a default. A cleanng budget k can be used as a stoppng crteron once C( ) has been called k tmes, and so the number of teratons of ActveClean s T = k. Alternatvely, the user can clean data untl the model s of suffcent b accuracy to make a decson Basc Data Flow The system frst trans the model φ( ) on the drty dataset to fnd an ntal model θ (d) that the system wll subsequently mprove. The sampler selects a sample of sze b records from the dataset and passes the sample to the cleaner, whch executes C( ) for each sample record and outputs ther cleaned versons. The updater uses the cleaned sample to update the weghts of the model, thus movng the model closer to the true cleaned model (n expectaton). Fnally, the system ether termnates due to a stoppng condton (e.g., C( ) has been called a maxmum number of tmes k, or tranng error convergence), or passes control to the sampler for the next teraton Optmzatons In many cases, such as mssng values, errors can be effcently detected. A user provded Detector can be used to dentfy such records that are more lkely to be drty, and thus mproves the lkelhood that the next sample wll contan true drty records. Furthermore, the Estmator uses prevously cleaned data to estmate the effect that cleanng a gven record wll have on the model. These components can be used separately (f only one s suppled) or together to focus the system s cleanng efforts on records that wll most mprove the model. Secton 7 descrbes several nstantatons of these components for dfferent data cleanng problems. Our experments show that these optmzatons can mprove model accuracy by up-to 2.5x (Secton 8.3.2). 4.2 Example The followng example llustrates how a user would apply ActveClean to address the use case n Secton 2.6: EXAMPLE 1. The analyst chooses to use an SVM model, and manually cleans records by hand (the C( )). ActveClean ntally selects a sample of 5 records (the default) to show the analyst. She dentfes a subset of 15 records that are drty, fxes them by normalzng the drug and corporaton names wth the

5 or rght). For ths class of models, gven a suboptmal pont, the drecton to the global optmum s the gradent of the loss functon. The gradent s a d-dmensonal vector functon of the current model θ (d) and the clean data. Therefore, Actve- Clean needs to update θ (d) some dstance γ (Fgure 3B): θ new θ (d) γ φ(θ (d) ) Fgure 2: ActveClean allows users to tran predctve models whle progressvely cleanng data. The framework adaptvely selects the best data to clean and can optonally (denoted wth dotted lnes) ntegrate wth predefned detecton rules and estmaton algorthms for mproved conference. help of a search engne, and corrects the labels wth typographcal or ncorrect values. The system then uses the cleaned records to update the the current best model and select the next sample of 5. The analyst can stop at any tme and use the mproved model to predct donaton lkelhoods. 5. UPDATES WITH CORRECTNESS Ths secton descrbes an algorthm for relable model updates. The updater assumes that t s gven a sample of data S drty from R drty where S drty has a known samplng probablty p(). Sectons 6 and 7 show how to optmze p( ) and the analyss n ths secton apples for any samplng dstrbuton p( ) >. 5.1 Geometrc Dervaton The update algorthm ntutvely follows from the convex geometry of the problem. Consder the problem n one dmenson (.e., the parameter θ s a scalar value), so then the goal s to fnd the mnmum pont (θ) of a curve l(θ). The consequence of drty data s that the wrong loss functon s optmzed. Fgure 3A llustrates the consequence of the optmzaton. The red dotted lne shows the loss functon on the drty data. Optmzng the loss functon fnds θ (d) that at the mnmum pont (red star). However, the true loss functon (w.r.t to the clean data) s n blue, thus the optmal value on the drty data s n fact a suboptmal pont on clean curve (red crcle). Fgure 3: (A) A model traned on drty data can be thought of as a sub-optmal pont w.r.t to the clean data. (B) The gradent gves us the drecton to move the suboptmal model to approach the true optmum. The optmal clean model θ (c) s vsualzed as a yellow star. The frst queston s whch drecton to update θ (d) (.e., left At the optmal pont, the magntude of the gradent wll be zero. So ntutvely, ths approach teratvely moves the model downhll (transparent red crcle) correctng the drty model untl the desred accuracy s reached. However, the gradent depends on all of the clean data whch s not avalable and ActveClean wll have to approxmate the gradent from a sample of newly cleaned data. The man ntuton s that f the gradent steps are on average correct, the model stll moves downhll albet wth a reduced convergence rate proportonal to the naccuracy of the sample-based estmate. To derve a sample-based update rule, the most mportant property s that sums commute wth dervatves and gradents. The convex loss class of models are sums of losses, so gven the current best model θ, the true gradent g (θ) s: g (θ) = φ(θ) = 1 N N φ(x (c), y (c), θ) ActveClean needs to estmate g (θ) from a sample S, whch s drawn from the drty data R drty. Therefore, the sum has two components the gradent on the already clean data g C whch can be computed wthout cleanng and g S the gradent estmate from a sample of drty data to be cleaned: g(θ) = R clean R g C(θ) + R drty R g S(θ) g C can be calculated by applyng the gradent to all of the already cleaned records: 1 g C(θ) =, y (c), θ) R clean φ(x (c) R clean g S can be estmated from a sample by takng the gradent w.r.t each record, and re-weghtng the average by ther respectve samplng probabltes. Before takng the gradent the cleanng functon C( ) s appled to each sampled record. Therefore, let S be a sample of data, where each S s drawn wth probablty p(): g S(θ) = 1 1 S p() φ(x(c), y (c), θ) S Then, at each teraton t, the update becomes: θ (t+1) θ (t) γ g(θ (t) ) 5.2 Model Update Algorthm To summarze, the algorthm s ntalzed wth θ () = θ (d) whch s the drty model. There are three user set parameters the budget k, batch sze b, and the step sze γ. In the followng secton, we wll provde references from the convex optmzaton lterature that allow the user to approprately select these values. At each teraton t = {1,..., T }, the cleanng s appled to a batch of data b selected from the set of canddate drty records R drty. Then, an average gradent s estmated from the cleaned batch and the model s updated. Iteratons contnue untl k = T b records are cleaned.

6 1. Calculate the gradent over the sample of newly clean data and call the result g S(θ (t) ) 2. Calculate the average gradent over all of the already clean data n R clean = R R drty, and call the result g C(θ (t) ) 3. Apply the followng update rule: θ (t+1) θ (t) γ ( R drty R g S(θ (t) )+ R clean g C(θ (t) )) R 5.3 Analyss wth Stochastc Gradent Descent The update algorthm can be formalzed as a class of very well studed algorthms called Stochastc Gradent Descent. SGD provdes a theoretcal framework to understand and analyze the update rule and bound the error. Mn-batch stochastc gradent descent (SGD) s an algorthm for fndng the optmal value gven the convex loss and data. In mn-batch SGD, random subsets of data are selected at each teraton and the average gradent s computed for every batch. One key dfference wth tradtonal SGD models s that ActveClean apples a full gradent step on the already clean data and averages t wth a stochastc gradent step (.e., calculated from a sample) on the drty data. Therefore, ActveClean teratons can take multple passes over the clean data but at most a sngle cleanng pass of the drty data. The update algorthm can be thought of as a varant of SGD that lazly materalzes the clean value. As data s sampled at each teraton, data s cleaned when needed by the optmzaton. It s well known that even for an arbtrary ntalzaton SGD makes sgnfcant progress n less than one epoch (a pass through the entre dataset) [9]. In practce, the drty model can be much more accurate than an arbtrary ntalzaton as corrupton may only affect a few features and combned wth the full gradent step on the clean data the updates converge very quckly. Settng the step sze γ: There s extensve lterature n machne learnng for choosng the step sze γ approprately. γ can be set ether to be a constant or decayed over tme. Many machne learnng frameworks (e.g., MLLb, Sc-kt Learn, Vowpal Wabbt) automatcally set learnng rates or provde dfferent learnng schedulng frameworks. In the experments, we use a technque called nverse scalng where there s a parameter γ =.1, and at each teraton t decays to γ t = γ. S t Settng the batch sze b: The batch sze should be set by the user to have the desred propertes. Larger batches wll take longer to clean and wll make more progress towards the clean model but wll have less frequent model updates. On the other hand, smaller batches are cleaned faster and have more frequent model updates. There are dmnshng returns to ncreasng the batch sze O( 1 b ). In the experments, we use a batch sze of 5 whch converges fast but allows for frequent model updates. If a data cleanng technque requres a larger batch sze than 5,.e., data cleanng s fast enough that the teraton overhead s sgnfcant compared to cleanng 5 records, ActveClean can apply the updates n smaller batches. For example, the batch sze set by the user mght be b = 1, but the model updates after every 5 records are cleaned. We can dsassocate the batchng requrements of SGD and the batchng requrements of the data cleanng technque Convergence Condtons and Propertes Convergence propertes of batch SGD formulatons have been well studed [11]. Essentally, f the gradent estmate s unbased and the step sze s approprately chosen, the algorthm s guaranteed to converge. In Appendx B, we show that the gradent estmate from ActveClean s ndeed unbased and our choce of step sze s one that s establshed to converge. The convergence rates of SGD are also well analyzed [11, 8, 39]. The analyss gves a bound on the error of ntermedate models and the expected number of steps before achevng a model wthn a certan error. For a general convex loss, a batch sze b, and T teratons, the convergence rate s bounded by O( σ2 bt ). σ 2 s the varance n the estmate of the gradent at each teraton: E( g g 2 ) where g s the gradent computed over the full data f t were fully cleaned. Ths property of SGD allows us to bound the model error wth a monotoncally decreasng functon of the number of records cleaned, thus satsfyng the relablty condton n the problem statement. If the loss n non-convex, the update procedure wll converge towards a local mnmum rather than the global mnmum (See Appendx C). 5.4 Example Ths example descrbes an applcaton of the update algorthm. EXAMPLE 2. Recall that the analyst has a drty SVM model on the drty data θ (d). She decdes that she has a budget of cleanng 1 records, and decdes to clean the 1 records n batches of 1 (set based on how fast she can clean the data, and how often she wants to see an updated result). All of the data s ntally treated as drty wth R drty = R and R clean =. The gradent of a basc SVM s gven by the followng functon: { y x f y x θ 1 φ(x, y, θ) = f y x θ 1 For each teraton t, a sample of 1 records S s drawn from R drty. ActveClean then apples the cleanng functon to the sample: {(x (c), y (c) )} = {C() : S} Usng these values, ActveClean estmates the gradent on the newly cleaned data: g S(θ) = p() φ(x(c), y (c), θ) S ActveClean also apples the gradent to the already clean data (ntally non-exstent): g C(θ) = 1 R clean Then, t calculates the update rule: θ (t+1) θ (t) γ ( R drty R φ(x (c) R clean, y (c), θ) g S(θ (t) ) + R clean R g C(θ (t) )) Fnally, R drty R drty S, R clean R clean + S, and contnue to the next teraton.

7 6. EFFICIENCY WITH SAMPLING The updater receved a sample wth probabltes p( ). For any dstrbuton where p( ) >, we can preserve correctness. ActveClean uses a samplng algorthm that selects the most valuable records to clean wth hgher probablty. 6.1 Oracle Samplng Problem Recall that the convergence rate of an SGD algorthm s bounded by σ 2 whch s the varance of the gradent. Intutvely, the varance measures how accurately the gradent s estmated from a unform sample. Other samplng dstrbutons, whle preservng the sample expected value, may have a lower varance. Thus, the oracle samplng problem s defned as a search over samplng dstrbutons to fnd the mnmum varance samplng dstrbuton. DEFINITION 1 (ORACLE SAMPLING PROBLEM). Gven a set of canddate drty data R drty, r R drty fnd samplng probabltes p(r) such that over all samples S of sze k t mnmzes: E( g S g 2 ) It can be shown that the optmal dstrbuton over records n R drty s probabltes proportonal to: p φ(x (c), y (c), θ (t) ) Ths s an establshed result, for thoroughness, we provde a proof n the appendx (Secton D), but ntutvely, records wth hgher gradents should be sampled wth hgher probablty as they affect the update more sgnfcantly. However, ActveClean cannot exclude records wth lower gradents as that would nduce a bas hurtng convergence. The problem s that the optmal dstrbuton leads to a chcken-and-egg problem: the optmal samplng dstrbuton requres knowng (x (c), y (c) ), however, cleanng s requred to know those values. 6.2 Drty Gradent Soluton Such an oracle does not exst, and one soluton s to use the gradent w.r.t to the drty data: p φ(x (d), y (d), θ (t) ) It turns out that the soluton works reasonably well n practce on our expermental datasets and has been studed n Machne Learnng as the Expected Gradent Length heurstc [31]. The contrbuton n ths work s ntegratng ths heurstc wth statstcally correct updates. However, ntutvely, approxmatng the oracle as closely as possble can result n mproved prortzaton. The subsequent secton descrbes two components, the detector and estmator, that can be used to mprove the convergence rate. Our experments suggest up-to a 2x mprovement n convergence when usng these optonal optmzatons (Secton 8.3.2). 7. OPTIMIZATIONS In ths secton, we descrbe two approaches to optmzaton, the Detector and the Estmator, that mprove the effcency of the cleanng process. Both approaches are desgned to ncrease the lkelhood that the Sampler wll pck drty records that, once cleaned, most move the model towards the true clean model. The Detector s ntended to learn the characterstcs that dstngush drty records from clean records whle the Estmator s desgned to estmate the amount that cleanng a gven drty record wll move the model towards the true optmal model. 7.1 The Detector The detector returns two mportant aspects of a record: (1) whether the record s drty, and (2) f t s drty, what s wrong wth the record. The sampler can use (1) to select a subset of drty records to sample at each batch and the estmator can use (2) estmate the value of data cleanng based on other records wth the same corrupton. ActveClean supports two types of detectors: a pror and adaptve. In former assumes that we know the set of drty records and how they are drty a pror to ActveClean, whle the latter adaptvely learns characterstcs of the drty data as part of runnng ActveClean A Pror Detector For many types of drtness such as mssng attrbute values and constrant volatons, t s possble to effcently enumerate a set of corrupted records and determne how the records are corrupted. DEFINITION 2 (A PRIORI DETECTION). Let r be a record n R. An a pror detector s a detector that returns a Boolean of whether the record s drty and a set of columns e r that are drty. D(r) = ({, 1}, e r) From the set of columns that are drty, fnd the correspondng features that are drty f r and labels that are drty l r. Here s an example ths defnton usng a data cleanng methodology proposed n the lterature. Constrant-based Repar: One model for detectng errors nvolves declarng constrants on the database. Detecton. Let Σ be a set of constrants on the relaton R. In the detecton step, the detector selects a subset of records R drty R that volate at least one constrant. The set e r s the set of columns for each record whch have a constrant volaton. EXAMPLE 3. An example of a constrant on the runnng example dataset s that the status of a contrbuton can be only allowed" or dsallowed". Any other value for status s an error. 7.2 Adaptve Detecton A pror detecton s not possble n all cases. The detector also supports adaptve detecton where detecton s learned from prevously cleaned data. Note that ths learnng" s dstnct from the learnng" at the end of the ppelne. The challenge n formulatng ths problem s that detector needs to descrbe how the data s drty (e.g. e r n the a pror case). The detector acheves ths by categorzng the corrupton nto u classes. These classes are corrupton categores that do not necessarly algn wth features, but every record s classfed wth at most one category. When usng adaptve detecton, the repar step has to clean the data and report to whch of the u classes the corrupted record belongs. When an example (x, y) s cleaned, the repar step labels t wth one of the clean, 1, 2,..., u + 1 classes (ncludng one for not drty"). It s possble that u ncreases each teraton as more types of drtness are dscovered. In

8 many real world datasets, data errors have localty, where smlar records tend to be smlarly corrupted. There are usually a small number of error classes even f a large number of records are corrupted. One approach for adaptve detecton s usng a statstcal classfer. Ths approach s partcularly suted for a small number data error classes, each of whch contanng many erroneous records. Ths problem can be addressed by any classfer, and we use an all-versus-one SVM n our experments. Another approach could be to adaptvely learn predcates that defne each of the error classes. For example, f records wth certan attrbutes are corrupted, a pattern tableau can be assgned to each class to select a set of possbly corrupted records. Ths approach s better suted than a statstcal approach for a large number of error classes or scarcty of errors. However, t reles on errors beng well algned wth certan attrbute values. DEFINITION 3 (ADAPTIVE CASE). Select the set of records for whch κ gves a postve error classfcaton (.e., one of the u error classes). After each sample of data s cleaned, the classfer κ s retraned. So the result s: D(r) = ({1, }, {1,..., u + 1}) Adaptve Detecton Wth OpenRefne: EXAMPLE 4. OpenRefne s a spreadsheet-based tool that allows users to explore and transform data. However, t s lmted to cleanng data that can ft n memory on a sngle computer. Snce the cleanng operatons are coupled wth data exploraton, ActveClean does not know what s drty n advance (the analyst may dscover new errors as she cleans). Suppose the analyst wants to use OpenRefne to clean the runnng example dataset wth ActveClean. She takes a sample of data from the entre dataset and uses the tool to dscover errors. For example, she fnds that some drugs are ncorrectly classfed as both drugs and devces. She then removes the devce attrbute for all records that have the drug name n queston. As she fxes the records, she tags each one wth a category tag of whch corrupton t belongs to. 7.3 The Estmator To get around the problem wth oracle samplng, the estmator wll estmate the cleaned value wth prevously cleaned data. The estmator wll also take advantage of the detector from the prevous secton. There are a number of dfferent approaches, such as regresson, that could be used to estmate the cleaned value gven the drty values. However, there s a problem of scarcty, where errors may affect a small number of records. As a result, the regresson approach would have to learn a multvarate functon wth only a few examples. Thus, hgh-dmensonal regresson ll-suted for the estmator. Conversely, t could try a very smple estmator that just calculates an average change and adds ths change to all of the gradents. Ths estmator can be hghly naccurate as t also apples the change to records that are known to be clean. ActveClean leverages the detector for an estmator between these two extremes. The estmator calculates average changes feature-by-feature and selectvely corrects the gradent when a feature s known to be corrupted based on the detector. It also apples a lnearzaton that leads to mproved estmates when the sample sze s small. We evaluate the lnearzaton n Secton 8.5 aganst alternatves, and fnd that t provdes more accurate estmates for a small number of samples cleaned. The result s a based estmator, and when the number of cleaned samples s large the alternatve technques are comparable or even slghtly better due to the bas. Estmaton For A Pror Detecton. If most of the features are correct, t would seem lke the gradent s only ncorrect n one or two of ts components. The problem s that the gradent φ( ) can be a very nonlnear functon of the features that couple features together. For example, the gradent for lnear regresson s: φ(x, y, θ) = (θ T x y)x It s not possble to solate the effect of a change of one feature on the gradent. Even f one of the features s corrupted, all of the gradent components wll be ncorrect. To address ths problem, the gradent can be approxmated n a way that the effects of drty features on the gradent are decoupled. Recall, n the a pror detecton problem, that assocated wth each r R drty s a set of errors f r, l r whch s a set that dentfes a set of corrupted features and labels. Ths property can be used to construct a coarse estmate of the clean value. The man dea s to calculate average changes for each feature, then gven an uncleaned (but drty) record, add these average changes to correct the gradent. To formalze the ntuton, nstead of computng the actual gradent wth respect to the true clean values, compute the condtonal expectaton gven that a set of features and labels f r, l r are corrupted: p E( φ(x (c), y (c), θ (t) ) f r, l r) Corrupted features are defned as that: / f r = x (c) [] x (d) [] = / l r = y (c) [] y (d) [] = The needed approxmaton represents a lnearzaton of the errors, and the resultng approxmaton wll be of the form: p(r) φ(x, y, θ (t) ) + M x rx + M y ry where M x, M y are matrces and rx and ry are vectors wth one component for each feature and label where each value s the average change for those features that are corrupted and otherwse. Essentally, t the gradent wth respect to the drty data plus some lnear correcton factor. In the appendx, we present a dervaton usng a Taylor seres expanson and a number of M x and M y matrces for common convex losses (Appendx E and F). The appendx also descrbes how to mantan rx and ry as cleanng progresses. Estmaton For Adaptve Case. A smlar procedure holds n the adaptve settng, however, t requres reformulaton. Here, ActveClean uses u corrupton classes provded by the detector. Instead of condtonng on the features that are corrupted, the estmator condtons on the classes. So for each error class, t computes a ux and uy. These are the average change n the features gven that class and the average change n labels gven that class. p(r u) φ(x, y, θ (t) ) + M x ux + M y uy Example Here s an example of usng the optmzaton to select a sample of data for cleanng.

9 EXAMPLE 5. Consder usng ActveClean wth an a pror detector. Let us assume that there are no errors n the labels and only errors n the features. Then, each tranng example wll have a set of corrupted features (e.g., {1, 2, 6}, {1, 2, 15}). Suppose that the cleaner has just cleaned the records r 1 and r 2 represented as tuples wth ther corrupted feature set: (r 1,{1, 2, 3}), (r 2,{1, 2, 6}). For each feature, ActveClean mantans the average change between drty and clean n a value n a vector x[] for those records corrupted on that feature. Then, gven a new record (r 3,{1, 2, 3, 6}), r3 x s the vector x where component s set to f the feature s not corrupted. Suppose the data analyst s usng an SVM, then the M x matrx s as follows: { y[] f y x θ 1 M x[, ] = f y x θ 1 Thus, we calculate a samplng weght for record r 3: p(r 3) φ(x, y, θ (t) ) + M x r3 x To turn the result nto a probablty dstrbuton, ActveClean normalzes over all drty records. 8. EXPERIMENTS Frst, the experments evaluate how varous types of corrupted data beneft from data cleanng. Next, the experments explore dfferent prortzaton and model update schemes for progressve data cleanng. Fnally, ActveClean s evaluated end-to-end n a number of real-world data cleanng scenaros. 8.1 Expermental Setup and Notaton The man metrc for evaluaton s a relatve measure of the traned model and the model f all of the data s cleaned. Relatve Model Error. Let θ be the model traned on the drty data, and let θ be the model traned on the same data f t was cleaned. Then the model error s defned as θ θ. θ Scenaros Income Classfcaton (Adult): In ths dataset of 45,552 records, the task s to predct the ncome bracket (bnary) from 12 numercal and categorcal covarates wth an SVM classfer. Sezure Classfcaton (EEG): In ths dataset, the task s to predct the onset of a sezure (bnary) from 15 numercal covarates wth a thresholded Lnear Regresson. There are 1498 data ponts n ths dataset. Ths classfcaton task s nherently hard wth an accuracy on completely clean data of only 65%. Handwrtng Recognton (MNIST) 1 : In ths dataset, the task s to classfy 6, mages of handwrtten mages nto 1 categores wth an one-to-all multclass SVM classfer. The unque part of ths dataset s the featurzed data conssts of a 784 dmensonal vector whch ncludes edge detectors and raw mage patches. Dollars For Docs: The dataset has 24,89 records wth 5 textual attrbutes and one numercal attrbute. The dataset s featurzed wth bag-of-words featurzaton model for the textual attrbutes whch resulted n a 221 dmensonal feature 1 Dataset vector, and a bnary SVM s used to classfy the status of the medcal donatons. World Bank: The dataset has 193 records of country name, populaton, and varous macro-economcs statstcs. The values are lsted wth the date at whch they were acqured. Ths allowed us to determne that records from smaller and less populous countres were more lkely to be out-of-date Compared Algorthms Here are the alternatve methodologes evaluated n the experments: Robust Logstc Regresson [14]. Feng et al. proposed a varant of logstc regresson that s robust to outlers. We chose ths algorthm because t s a robust extenson of the convex regularzed loss model, leadng to a better apples-toapples comparson between the technques. (See detals n Appendx H.1) Dscardng Drty Data. As a baselne, drty data are dscarded. SampleClean (SC) [33]. SampleClean takes a sample of data, apples data cleanng, and then trans a model to completon on the sample. Actve Learnng (AL) [18]. To farly evaluate Actve Learnng, we frst apply our gradent update to ensure correctness. Wthn each teraton, examples are prortzed by dstance to the decson boundary (called Uncertanty Samplng n [31]). However, we do not nclude our optmzatons such as detecton and estmaton. ActveClean Oracle (AC+O): In ActveClean Oracle, nstead of an estmaton and detecton step, the true clean value s used to evaluate the theoretcal deal performance of Actve- Clean. 8.2 Does Data Cleanng Matter? The frst experment evaluates the benefts of data cleanng on two of the example datasets (EEG and Adult). Our goal s to understand whch types of data corrupton are amenable to data cleanng and whch are better suted for robust statstcal technques. The experment compares four schemes: (1) full data cleanng, (2) baselne of no cleanng, (3) dscardng the drty data, and (4) robust logstc regresson,. We corrupted 5% of the tranng examples n each dataset n two dfferent ways: Random Corrupton: Smulated hgh-magntude random outlers. 5% of the examples are selected at random and a random feature s replaced wth 3 tmes the hghest feature value. Systematc Corrupton: Smulated nnocuous lookng (but stll ncorrect) systematc corrupton. The model s traned on the clean data, and the three most mportant features (hghest weghted) are dentfed. The examples are sorted by each of these features and the top examples are corrupted wth the mean value for that feature (5% corrupton n all). It s mportant to note that examples can have multple corrupted features. Fgure 4 shows the test accuracy for models traned on both types of data wth the dfferent technques. The robust method performs well on the random hgh-magntude outlers wth only a 2.% reducton n clean test accuracy for EEG and 2.5% reducton for Adult. In the random settng, dscardng drty data also performs relatvely well. However,

10 Test Accuracy 1% 9% 8% 7% 6% 5% 4% 3% 2% 1% % (a) Randomly Corrupted Data EEG Adult Test Accuracy 1% 9% 8% 7% 6% 5% 4% 3% 2% 1% % (b) Systematcally Corrupted Data Fgure 4: (a) Robust technques and dscardng data work when corrupted data are random and look atypcal. (b) Data cleanng can provde relable performance n both the systematcally corrupted settng and randomly corrupted settng. the robust method falters on the systematc corrupton wth a 9.1% reducton n clean test accuracy for EEG and 1.5% reducton for Adult. The problem s that wthout cleanng, there s no way to know f the corrupton s random or systematc and when to trust a robust method. Whle data cleanng requres more effort, t provdes benefts n both settngs. In the remanng experments, unless otherwse noted, the experments use systematc corrupton. Summary: A 5% systematc corrupton can ntroduce a 1% reducton n test accuracy even when usng a robust method. 8.3 ActveClean: A Pror Detecton The next set of experments evaluate dfferent approaches to cleanng a sample of data compared to ActveClean usng a pror detecton. A pror detecton assumes that all of the corrupted records are known n advance but ther clean values are unknown Actve Learnng and SampleClean The next experment evaluates the samples-to-error tradeoff between four alternatve algorthms: ActveClean (AC), SampleClean, Actve Learnng, and ActveClean +Oracle (AC+O). Fgure 5 shows the model error and test accuracy as a functon of the number of cleaned records. In terms of model error, ActveClean gves ts largest benefts for small sample szes. For 5 cleaned records of the Adult dataset, Actve- Clean has 6.1x less error than SampleClean and 2.1x less error than Actve Learnng. For 5 cleaned records of the EEG dataset, ActveClean has 9.6x less error than SampleClean and 2.4x less error than Actve Learnng. Both Actve Learnng and ActveClean beneft from the ntalzaton wth the drty model as they do not retran ther models from scratch, and ActveClean mproves on ths performance wth detecton and error estmaton. Actve Learnng has no noton of drty and clean data, and therefore prortzes wth respect to the drty data. These gans n model error also correlate well to mprovements n test error (defned as the test accuracy dfference w.r.t cleanng all data). The test error converges more quckly than model error, emphaszng the benefts of progressve data cleanng, snce t s not neccessary to clean all the data to get a model wth essentally the same performance as the clean model. For example, to acheve a test error of 1% on the Adult dataset, ActveClean cleans 5 fewer records than Actve Learnng. Summary: ActveClean wth a pror detecton returns results that are more than 6x more accurate than SampleClean and 2x more accurate than Actve Learnng for cleanng 5 records. EEG Adult Model Error % Test Error % (a) Adult # Records Cleaned Model Error % Test Error % (b) EEG # Records Cleaned Fgure 5: The relatve model error as a functon of the number of examples cleaned. ActveClean converges wth a smaller sample sze to the true result n comparson to Actve Learnng and SampleClean. Fgure 6: -D denotes no detecton, and -D-I denotes no detecton and no mportance samplng. Both optmzatons sgnfcantly help ActveClean outperform Sample- Clean and Actve Learnng Source of Improvements The next experment compares the performance of Actve- Clean wth and wthout varous optmzatons at 5 records cleaned pont. ActveClean wthout detecton s denoted as (AC-D) (that s at each teraton we sample from the entre drty data), and ActveClean wthout detecton and mportance samplng s denoted as (AC-D-I). Fgure 6 plots the relatve error of the alternatves and ActveClean wth and wthout the optmzatons. Wthout detecton (AC-D), ActveClean s stll more accurate than Actve Learnng. Removng the mportance samplng, ActveClean s slghtly worse than Actve Learnng on the Adult dataset but s comparable on the EEG dataset. Summary: Both a pror detecton and non-unform samplng sgnfcantly contrbute to the gans over Actve Learnng Mxng Drty and Clean Data Tranng a model on mxed data s an unrelable methodology lackng the same guarantees as Actve Learnng or SampleClean even n the smplest of cases. For thoroughness, the next experments nclude the model error as a functon of records cleaned n comparson to ActveClean. Fgure 7 plots the same curves as the prevous experment comparng ActveClean, Actve Learnng, and two mxed data algorthms. PC randomly samples data, clean, and wrtes-back the cleaned data. PC+D randomly samples data from usng the drty data detector, cleans, and wrtes-back the cleaned data. For these errors PC and PC+D gve reasonable results

11 (not always guaranteed), but ActveClean converges faster. ActveClean tunes the weghtng when averagng drty and clean data nto the gradent. Model Error % (a) Adult # Records Cleaned Model Error % (b) EEG # Records Cleaned Fgure 7: The relatve model error as a functon of the number of examples cleaned. ActveClean converges wth a smaller sample sze to the true result n comparson to partal cleanng (PC,PC+D). Summary: ActveClean converges faster than mxng drty and clean data snce t reweghts data based on the fracton that s drty and clean. Partal cleanng s not guaranteed to gve sensble results Corrupton Rate The next experment explores how much of the performance s due to the ntalzaton wth the drty model (.e., SampleClean trans a model from scratch"). Fgure 8 vares the systematc corrupton rate and plots the number of records cleaned to acheve 1% relatve error for SampleClean and ActveClean. SampleClean does not use the drty data and thus ts error s essentally governed by the Central Lmt Theorem. SampleClean outperforms ActveClean only when corruptons are very severe (45% n Adult and nearly 6% n EEG). When the ntalzaton wth the drty model s naccurate, Actve- Clean does not perform as well. Summary: SampleClean s benefcal n comparson to Actve- Clean when corrupton rates exceed 45%. 8.4 ActveClean: Adaptve Detecton Ths experment explores how the results of the prevous experment change when usng an adaptve detector nstead of the a pror detector. Recall, n the systematc corrupton, 3 of the most nformatve features were corrupted, thus we group these problems nto 9 classes. We use an all-versus-one SVM to learn the categorzaton Basc Performance Fgure 9 overlays the convergence plots n the prevous experments wth a curve (denoted by AC+C) that represents ActveClean usng a classfer nstead of the a pror detecton. Intally ActveClean s comparable to Actve Learnng; however, as the classfer becomes more effectve the detecton mproves the performance. Over both datasets, at the 5 records pont on the curve, adaptve ActveClean has a 3% hgher model error compared to a pror ActveClean. At 1 records pont on the curve, adaptve ActveClean has about 1% hgher error. Summary: For 5 records cleaned, adaptve ActveClean has a 3% hgher model error compared to a pror ActveClean, but stll outperforms Actve Learnng and SampleClean Classfable Errors Error Rate (a) Adult AC 2 SC # Cleaned to 1% Error Error Rate (b) EEG AC 2 SC # Cleaned to 1% Error Fgure 8: ActveClean performs well untl the corrupton s so severe that the drty model s not a good ntalzaton. The error of SampleClean does not depend on the corrupton rate so t s a vertcal lne. Model Error % (a) Adult # Records Cleaned Model Error % (b) EEG # Records Cleaned Fgure 9: Even wth a classfer ActveClean converges faster than Actve Learnng and SampleClean. The adaptve case depends on beng able to predct corrupted records. For example, random corrupton not correlated wth any other data features may be hard to learn. As corrupton becomes more random, the classfer becomes ncreasngly erroneous. The next experment explores makng the systematc corrupton more random. Instead of selectng the hghest valued records for the most valuable features, we corrupt random records wth probablty p. We compare these results to AC-D where we do not have a detector at all at one vertcal slce of the prevous plot (cleanng 1 records). Fgure 1a plots the relatve error reducton usng a classfer. When the corrupton s about 5% random then there s a break even pont where no detecton s better. The classfer s mperfect and msclassfes some data ponts ncorrectly as cleaned. Summary: When errors are ncreasngly random (5% random) and cannot be accurately classfed, adaptve detecton provdes no beneft over no detecton. Error Reducton EEG 2 Adult 1.5 Baselne (a) Error Randomness Estmaton Error Regresson Avg Gradent Avg Change Taylor 15% 1% 5% % (b) Records Cleaned Fgure 1: (a) Data corruptons that are less random are easer to classfy, and lead to more sgnfcant reductons n relatve model error. (b) The Taylor seres approxmaton gves more accurate estmates when the amount of cleaned data s small. 8.5 Estmaton The next experment compares estmaton technques: (1) lnear regresson" trans a lnear regresson model that predcts the clean gradent as a functon of the drty gradent, (2) average gradent" whch does not use the detecton to nform

12 Model Error % (a) Dollars For Docs # Records Cleaned (Log Scale) Drty AC AC+O SC AL Correct Postve % (b) Dollars For Docs # Records Cleaned (Log-Scale) Drty AC Clean SC AL Fgure 11: (a) The relatve model error as a functon of the number of cleaned records. (b) The true postve rate as a functon of the number of cleaned records. how to apply the estmate, (3) average feature change" uses detecton but no lnearzaton, and (4) the Taylor seres lnear approxmaton. Fgure 1b measures how accurately each estmaton technque estmates the gradent as a functon of the number of cleaned records on the EEG dataset. Estmaton error s measured usng the relatve L2 error wth the true gradent. The Taylor seres approxmaton proposed gves more accurate for small cleanng szes. Lnear regresson and the average feature change technque do eventually perform comparably but only after cleanng much more data. Summary: Lnearzed gradent estmates are more accurate when estmated from small samples. 8.6 Real World Scenaros The next set of experments evaluate ActveClean n three real world scenaros, one demonstratng the a pror case and the other two for the adaptve detecton case A Pror: Constrant Cleanng The frst scenaro explores the Dollars for Docs dataset publshed by ProPublca descrbed throughout the paper. To run ths experment, the entre dataset was cleaned up front, and smulated samplng from the drty data and cleanng by lookng up the value n the cleaned data (see Appendx I for constrants, errors, and cleanng methodology). Fgure 11a shows that ActveClean converges faster than Actve Learnng and SampleClean. To acheve a 4% relatve error (.e., a 75% error reducton from the drty model), ActveClean cleans 4 fewer records than Actve Learnng. Also, for 1 records cleaned, ActveClean has nearly an order of magntude smaller error than SampleClean. Fgure 11b shows the detecton rate (fracton of dsallowed research contrbutons dentfed) of the classfer as a functon of the number of records cleaned. On the drty data, we can only correctly classfy 66% of the suspected examples (88% overall accuracy due to a class mbalance). On the cleaned data, ths classfer s nearly perfect wth a 97% true postve rate (98% overall accuracy). ActveClean converges to the cleaned accuracy faster than the alternatves wth a classfer of 92% true postve rate for only 1 records cleaned. Summary: To acheve an 8% detecton rate, ActveClean cleans nearly 1x less records than Actve Learnng Adaptve: Replacng Corrupted Data The next experment explores the MNIST handwrtten dgt recognton dataset wth a MATLAB mage processng ppelne. In ths scenaro, the analyst must nspect a potentally corrupted mage and replace t wth a hgher qualty one. The MNIST dataset conssts of 64x64 grayscale mages. There are Model Error % MNIST Block Removal 5x # Images Cleaned Model Eror % MNIST Fuzzy # Images Cleaned Fgure 12: In a real adaptve detecton scenaro wth the MNIST dataset, ActveClean outperforms Actve Learnng and SampleClean. two types of smulated corruptons: (1) 5x5 block removal where a random 5x5 block s removed from the mage by settng ts pxel values to, and (2) Fuzzy where a 4x4 movng average patch s appled over the entre mage. These corruptons are appled to a random 5% of the mages, and mmc the random (Fuzzy) vs. systematc corrupton (5x5 removal) studed n the prevous experments. The adaptve detector uses a 1 class classfer (one for each dgt) to detect the corrupton. Fgure 12 shows that ActveClean makes more progress towards the clean model wth a smaller number of examples cleaned. To acheve a 2% error for the block removal, Actve- Clean can nspect 22 fewer mages than Actve Learnng and 275 fewer mages than SampleClean. For the fuzzy mages, both Actve Learnng and ActveClean reach 2% error after cleanng fewer than 1 mages, whle SampleClean requres 175. Summary: In the MNIST dataset, ActveClean sgnfcantly reduces (more than 2x) the number of mages to clean to tran a model wth 2% error Adaptve: Regresson In the pror two experments, we explored classfcaton problems. In ths experment, we consder the case when the convex model represents a lnear regresson model. Regresson models allow us to vsualze what s happenng when we apply ActveClean. In Fgure 13, we llustrate regresson model tranng on a small dataset of 193 countres collected from the World Bank. Each country has an assocated populaton and total dollar value of mports. We are nterested n examnng the relatonshp between these varables. However, for some countres, the mport values are out-of-date n the World Bank dataset. Up-to-date values are usually avaable on natonal statstcs webstes and can be determned wth some web searchng. It turns out that smaller countres were more lkely to have out-of-date statstcs n the World Bank dataset, and as a result, the trend lne s msleadng n the drty data. We appled ActveClean after verfyng 3 out of the 193 countres (marked n yellow), and found that we could acheve a hghly accurate approxmaton of the full result. Summary: ActveClean s accurate even n regresson analytcs. 9. RELATED WORK Data Cleanng: When data cleanng s expensve, t s desrable to apply t progressvely, where analysts can nspect early results wth only k N records cleaned. Progressve data cleanng s a well studed problem especally n the context of entty resoluton [6, 34, 3, 17]. Pror work has fo-

13 Fgure 13: World Bank Data. We apply ActveClean to learn an accurate model predctng populaton from mport values. The data has a systematc bas where small countres have out-of-date mport values. cused on the problem of desgnng data structures and algorthms to apply data cleanng progressvely. whch s challengng because many data cleanng algorthms requre nformaton from the entre relaton. Over the last 5 years a number of new results have expanded the scope and practcalty of progressve data cleanng [26, 38, 37]. ActveClean studes the problem of prortzng progressve cleanng by leveragng nformaton about a user s subsequent use for the data. Certan records, f cleaned, may be more lkely to affect the downstream analyss. There are a number of other works that use machne learnng to mprove the effcency and/or relablty of data cleanng [38, 37, 16]. For example, Yakout et al. tran a model that evaluates the lkelhood of a proposed replacement value [37]. Another applcaton of machne learnng s value mputaton, where a mssng value s predcted based on those records wthout mssng values. Machne learnng s also ncreasngly appled to make automated repars more relable wth human valdaton [38]. Human nput s often expensve and mpractcal to apply to entre large datasets. Machne learnng can extrapolate rules from a small set of examples cleaned by a human (or humans) to uncleaned data [16, 38]. Ths approach can be coupled wth actve learnng [27] to learn an accurate model wth the fewest possble number of examples. Whle, n sprt, ActveClean s smlar to these approaches, t addresses a very dfferent problem of data cleanng before user-specfed modelng. The key new challenge n ths problem s ensurng the correctness of the user s model after partal data cleanng. SampleClean [33] apples data cleanng to a sample of data, and estmates the results of aggregate queres. Samplng has also been appled to estmate the number of duplcates n a relaton [19]. Smlarly, Bergman et al. explore the problem of query-orented data cleanng [7], where gven a query, they clean data relevant to that query. Exstng work does not explore cleanng drven by the downstream machne learnng queres" studed n ths work. Deshpande et al. studed data acquston n sensor networks [12]. They explored value of nformaton based prortzaton of data acquston for estmatng aggregate queres of sensor readngs. Smlarly, Jeffery et al. [21] explored smlar prortzaton based on value of nformaton. We see ths work as pushng prortzaton further down the ppelne to the end analytcs. Fnally, ncremental optmzaton methods lke SGD have a connecton to ncremental materalzed vew mantenance as the argument for ncremental mantenance over recomputaton s smlar (.e., relatvely sparse updates). Krshnan et al. explored how samples of materalzed vews can be mantaned smlar to how models are updated wth a sample of clean data n ths work [24]. Stochastc Optmzaton and Actve Learnng: Zhao and Tong recently proposed usng mportance samplng n conjuncton wth stochastc gradent descent [39]. The deas appled n ActveClean are well rooted n the Machne Learnng and Optmzaton lterature, and we apply these deas to the data cleanng problem. Ths lne of work bulds on pror results n lnear algebra that show that some matrx columns are more nformatve than others [13], and Actve Learnng whch shows that some labels are more nformatve that others [31]. Actve Learnng largely studes the problem of label acquston [31], and recently the lnks between Actve Learnng and Stochastc optmzaton have been studed [18]. We use the work n Gullory et al. to evaluate a state-of-the-art Actve Learnng technque aganst ActveClean. Transfer Learnng and Bas Mtgaton: ActveClean has a strong lnk to a feld called Transfer Learnng and Doman Adaptaton [29]. The basc dea of Transfer Learnng s that suppose a model s traned on a dataset D but tested on a dataset D. Much of the complexty and contrbuton of ActveClean comes from effcently tunng such a process for expensve data cleanng applcatons costs not studed n Transfer Learnng. In robotcs, Mahler et al. explored a calbraton problem n whch data was systematcally corrupted [25] and proposed a rule-based technque for cleanng data. Other problems n bas mtgaton (e.g., Krshnan et al. [23]) have the same structure, systematcally corrupted data that s feedng nto a model. In ths work, we try to generalze these prncples gven a general drty dataset, convex model, and data cleanng procedure. Secure Learnng: ActveClean s also related to work n adversaral learnng [28], where the goal s to make models robust to adversaral data manpulaton. Ths lne of work has extensvely studed methodologes for makng models prvate to external queres and robust to malcous labels [36], but the data cleanng problem explores more general corruptons than just malcous labels. One wdely appled technque n ths feld s reject-on-negatve mpact, whch essentally, dscards data that reduces the loss functon whch wll not work when we do not have access to the true loss functon (only the drty loss"). 1. DISCUSSION AND FUTURE WORK The expermental results suggest the followng conclusons about ActveClean: (1) when the data corrupton rate s relatvely small (e.g., 5%), ActveClean cleans fewer records than Actve Learnng or SampleClean to acheve the same model accuracy, (2) all of the optmzatons n ActveClean (mportance samplng, detecton, and estmaton) lead to sgnfcantly more accurate models at small sample szes, (3) only when corrupton rates are very severe (e.g. 5%), SampleClean outperforms ActveClean, and (4) two real-world scenaros demonstrate smlar accuracy mprovements where ActveClean returns sgnfcantly more accurate models than SampleClean or Actve Learnng for the same number of records cleaned. There are also a few addtonal ponts for dscusson. ActveClean provdes guarantees for tranng error on models traned wth progressve data cleanng, however, there are no such guarantees on test error. Ths work focuses on the problem where an analyst has a large amount of drty data and

14 would lke explore data cleanng and predctve models on ths dataset. By provdng the analyst more accurate model estmates, the value of dfferent data cleanng technques can be judged wthout havng to clean the entre dataset. However, the exploratory analyss problem s dstnct from the model deployment problem (.e., servng predctons to users from the model), whch we hope to explore n more detal n future work. It mplctly assumes that when the model s deployed, t wll be appled n a settng where the test data s also clean. Tranng on clean data, and testng on drty data, defeats the purpose of data cleanng and can lead to unrelable predctons. As the experments clearly show, ActveClean s not strctly better than Actve Learnng or SampleClean. ActveClean s optmzed for a specfc desgn pont of sparse errors and small sample szes, and the emprcal results suggest t returns more accurate models n ths settng. As sample szes and error rates ncrease, the benefts of ActveClean are reduced. Another consderaton for future work s automatcally selectng alternatve technques when ActveClean s expected to perform poorly. Beyond these lmtatons, there are several exctng new avenues for future work. The data cleanng models explored n ths work can be extended to handle non-unform costs, where dfferent errors have a dfferent cleanng cost. Next, the emprcal success of Deep Learnng has led to ncreasng ndustry and research adopton of non-convex losses n many tasks that were tradtonally served by convex models. In future work, we hope to explore how we can ntegrate wth such frameworks. 11. CONCLUSION The growng popularty of predctve models n data analytcs adds addtonal challenges n managng drty data. Progressve data cleanng n ths settng s susceptble to errors due to mxng drty and clean data, senstvty to sample sze, and the sparsty of errors. The key nsght of ActveClean s that an mportant class of predctve models, called convex loss models (e.g., lnear regresson and SVMs), can be smultaneously traned and cleaned. Consequently, there are provable guarantees on the convergence and error bounds of ActveClean. ActveClean also ncludes numerous optmzatons such as: usng the nformaton from the model to nform data cleanng on samples, drty data detecton to avod samplng clean data, and batchng updates. The expermental results are promsng as they suggest that these optmzatons can sgnfcantly reduce data cleanng costs when errors are sparse and cleanng budgets are small. Technques such as Actve Learnng and SampleClean are not optmzed for the sparse low-budget settng, and ActveClean acheves models of smlar accuracy for sgnfcantly less records cleaned. Ths research s supported n part by NSF CISE Expedtons Award CCF , LBNL Award 77618, and DARPA XData Award FA , and gfts from Amazon Web Servces, Google, SAP, The Thomas and Stacey Sebel Foundaton, Adatao, Adobe, Apple, Inc., Blue Goj, Bosch, C3Energy, Csco, Cray, Cloudera, EMC2, Ercsson, Facebook, Guavus, HP, Huawe, Informatca, Intel, Mcrosoft, NetApp, Pvotal, Samsung, Schlumberger, Splunk, Vrdata and VMware. 12. REFERENCES [1] Berkeley data analytcs stack. [2] Dollars for docs. [3] For bg-data scentsts, jantor work s key hurdle to nsghts. [4] A pharma payment a day keeps docs fnances okay. a-pharma-payment-a-day-keeps-docs-fnances-ok. [5] A. Alexandrov, R. Bergmann, S. Ewen, J. Freytag, F. Hueske, A. Hese, O. Kao, M. Lech, U. Leser, V. Markl, F. Naumann, M. Peters, A. Rhenländer, M. J. Sax, S. Schelter, M. Höger, K. Tzoumas, and D. Warneke. The stratosphere platform for bg data analytcs. VLDB J., 23(6), 214. [6] Y. Altowm, D. V. Kalashnkov, and S. Mehrotra. Progressve approach to relatonal entty resoluton. PVLDB, 7(11), 214. [7] M. Bergman, T. Mlo, S. Novgorodov, and W. C. Tan. Query-orented data cleanng wth oracles. In SIGMOD Conference, 215. [8] D. P. Bertsekas. Incremental gradent, subgradent, and proxmal methods for convex optmzaton: A survey. CoRR, abs/157.13, 215. [9] L. Bottou. Stochastc gradent descent trcks. In Neural Networks: Trcks of the Trade - Second Edton [1] A. Crotty, A. Galakatos, and T. Kraska. Tupleware: Dstrbuted machne learnng on small clusters. IEEE Data Eng. Bull., 37(3), 214. [11] O. Dekel, R. Glad-Bachrach, O. Shamr, and L. Xao. Optmal dstrbuted onlne predcton usng mn-batches. JMLR, 13, 212. [12] A. Deshpande, C. Guestrn, S. Madden, J. M. Hellersten, and W. Hong. Model-drven data acquston n sensor networks. In VLDB, 24. [13] P. Drneas, M. Magdon-Ismal, M. W. Mahoney, and D. P. Woodruff. Fast approxmaton of matrx coherence and statstcal leverage. JMLR, 13, 212. [14] J. Feng, H. Xu, S. Mannor, and S. Yan. Robust logstc regresson and classfcaton. In NIPS, 214. [15] J. Fredman, T. Haste, and R. Tbshran. The elements of statstcal learnng, volume 1. Sprnger seres n statstcs Sprnger, Berln, 21. [16] C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampall, J. Shavlk, and X. Zhu. Corleone: Hands-off crowdsourcng for entty matchng. In SIGMOD, 214. [17] A. Gruenhed, X. L. Dong, and D. Srvastava. Incremental record lnkage. PVLDB, 7(9), 214. [18] A. Gullory, E. Chastan, and J. Blmes. Actve learnng as non-convex optmzaton. In Internatonal Conference on Artfcal Intellgence and Statstcs, 29. [19] A. Hese, G. Kasnec, and F. Naumann. Estmatng the number and szes of fuzzy-duplcate clusters. In CIKM Conference, 214. [2] G. Inc. Tensorflow. [21] S. R. Jeffery, G. Alonso, M. J. Frankln, W. Hong, and J. Wdom. Declaratve support for sensor data cleanng. In Pervasve Computng, 26. [22] S. Kandel, A. Paepcke, J. M. Hellersten, and J. Heer. Enterprse data analyss and vsualzaton: An ntervew study. IEEE Trans. Vs. Comput. Graph., 18(12), 212. [23] S. Krshnan, J. Patel, M. J. Frankln, and K. Goldberg. A methodology for learnng, analyzng, and mtgatng socal nfluence bas n recommender systems. In RecSys, 214. [24] S. Krshnan, J. Wang, M. J. Frankln, K. Goldberg, and T. Kraska. Stale vew cleanng: Gettng fresh answers from stale materalzed vews. PVLDB, 8(12), 215. [25] J. Mahler, S. Krshnan, M. Laskey, S. Sen, A. Mural, B. Kehoe, S. Patl, J. Wang, M. Frankln, P. Abbeel, and K. Y. Goldberg. Learnng accurate knematc control of cable-drven surgcal robots usng data cleanng and gaussan process regresson. In CASE, 214. [26] C. Mayfeld, J. Nevlle, and S. Prabhakar. ERACER: a database approach for statstcal nference and data cleanng. In SIGMOD Conference, 21. [27] B. Mozafar, P. Sarkar, M. J. Frankln, M. I. Jordan, and S. Madden. Scalng up crowd-sourcng to very large datasets: A case for actve learnng. PVLDB, 8(2), 214. [28] B. Nelson, B. I. P. Rubnsten, L. Huang, A. D. Joseph, S. J. Lee, S. Rao, and J. D. Tygar. Query strateges for evadng convex-nducng classfers. JMLR, 13, 212. [29] S. J. Pan and Q. Yang. A survey on transfer learnng. TKDE, 22(1), 21. [3] T. Papenbrock, A. Hese, and F. Naumann. Progressve duplcate

15 detecton. IEEE Trans. Knowl. Data Eng., 27(5), 215. [31] B. Settles. Actve learnng lterature survey. Unversty of Wsconsn, Madson, 52:11, 21. [32] E. H. Smpson. The nterpretaton of nteracton n contngency tables. Journal of the Royal Statstcal Socety. Seres B (Methodologcal), [33] J. Wang, S. Krshnan, M. J. Frankln, K. Goldberg, T. Kraska, and T. Mlo. A sample-and-clean framework for fast and accurate query processng on drty data. In SIGMOD Conference, 214. [34] S. E. Whang and H. Garca-Molna. Incremental entty resoluton on rules and data. VLDB J., 23(1), 214. [35] C. Woolston. Gender-dsparty study faces attack. gender-dsparty-study-faces-attack [36] H. Xao, B. Bggo, G. Brown, G. Fumera, C. Eckert, and F. Rol. Is feature selecton secure aganst tranng data posonng? In ICML, 215. [37] M. Yakout, L. Bert-Equlle, and A. K. Elmagarmd. Don t be scared: use scalable automatc reparng wth maxmal lkelhood and bounded changes. In SIGMOD Conference, 213. [38] M. Yakout, A. K. Elmagarmd, J. Nevlle, M. Ouzzan, and I. F. Ilyas. Guded data repar. PVLDB, 4(5), 211. [39] P. Zhao and T. Zhang. Stochastc optmzaton wth mportance samplng for regularzed loss mnmzaton. In ICML, 215. APPENDIX A. SET-OF-RECORDS CLEANING MODEL In paper, we formalzed the analyst-specfed data cleanng as follows. We take the sample of the records S drty, and apply data cleanng C( ). C s appled to a record and produces the clean record: S clean = {C(r) : r S drty } The record-by-record cleanng model s a formalzaton of the costs of data cleanng where each record has the same cost to clean and ths cost does not change throughout the entre cleanng sesson. There are, however, some cases when cleanng the frst record of a certan type of corrupton s expensve but all subsequent records are cheaper. EXAMPLE 6. In most spell checkng systems, when a msspellng s dentfed, the system gves an opton to fx all nstances of that msspellng. EXAMPLE 7. When an nconsstent value s dentfed all other records wth the same nconsstency can be effcently fxed. Ths model of data cleanng can ft nto our framework and we formalze t as the Set-of-Records" model as opposed to the Record-by-Record" model. In ths model, the cleanng functon C( ) s not restrcted to updatng only the records n the sample. C( ) takes the entre drty sample as an argument (that s the cleanng s a functon of the sample), the drty data, and updates the entre drty data: R drty = C(S drty, R drty ) we requre that for every record s S drty, that record s completely cleaned after applyng C( ), gvng us S clean. Records outsde of S drty may be cleaned on a subset of drty attrbutes by C( ). After each teraton, we re-run the detector, and move any r R drty that are clean to R clean. Such a model allows us to capture data cleanng operatons such as n Example 6 and Example 7. B. STOCHASTIC GRADIENT DESCENT Stochastc Gradent Descent converges for a sutably chosen step sze f the sample gradents are unbased estmates of the full gradent. The frst problem s to choose weghts α and β (to average already clean and newly cleaned data) such that the estmate of the gradent s unbased. The batch S drty s drawn only from R drty. Snce the szes of R drty and ts complement are known, t follows that the gradent over the already clean data g C and the recently cleaned data g S can be combned as follows: Therefore, g(θ t ) = R drty g S+ R clean g C R α = R clean, β = R drty R R LEMMA 1. The gradent estmate g(θ) s unbased f g S s an unbased estmate of: 1 g(θ) R drty PROOF SKETCH. 1 1 E( g(θ)) = R drty R drty E( g (θ))) By symmetry, 1 E( R drty 1 E( g(θ)) = g(θ) R drty g(θ)) = R drty g S+ R clean g C R The error bound dscussed n Proposton 2 can be tghtened for a class of models called strongly convex (see [8] for a defnton). PROPOSITION 1. For a strongly convex loss, a batch sze b, and T teratons, the convergence rate s bounded by O( σ2 bt ). C. NON-CONVEX LOSSES We acknowledge that there s an ncreasng popularty of non-convex losses n the Neural Network and Deep Learnng lterature. However, even for these losses, gradent descent technques stll apply. Instead of convergng to a global optmum they converge to a locally optmal value. Lkewse, ActveClean wll converge to the closest locally optmal value to the drty model. Because of ths, t s harder to reason about the results. Dfferent ntalzatons wll lead to dfferent local optma, and thus, ntroduces a complex dependence on the ntalzaton wth the drty model. Ths problem s not fundemental to ActveClean and any gradent technque suffers ths challenge for general non-convex losses, and we hope to explore ths more n the future. D. IMPORTANCE SAMPLING Ths lemma descrbes the optmal dstrbuton over a set of scalars: LEMMA 2. Gven a set of real numbers A = {a 1,..., a n}, let Â be a sample wth replacement of A of sze k. If µ s the mean Â, the samplng dstrbuton that mnmzes the varance of µ,.e., the expected square error, s p(a ) a.

16 Lemma 2 shows that when estmatng a mean of numbers wth samplng, the dstrbuton wth optmal varance s samplng proportonally to the values. The varance of ths estmate s gven by: V ar(µ) = E(µ 2 ) E(µ) 2 Snce the estmate s unbased, we can replace E(µ) wth the average of A: V ar(µ) = E(µ 2 ) Ā2 Snce Ā s determnstc, we can remove that term durng mnmzaton. Furthermore, we can wrte E(µ 2 ) as: E(µ 2 ) = 1 n 2 Then, we can solve the followng optmzaton problem (removng the proportonalty of ) over the set of weghts 1 n P = {p(a 2 )}: mn P N n a 2 p a 2 p subject to: P >, P = 1 Applyng Lagrange multplers, an equvalent unconstraned optmzaton problem s: mn P >,λ> N a 2 p + λ ( P 1) If, we take the dervatves wth respect to p and set them equal to zero: a2 2 p 2 + λ = If, we take the dervatve wth respect to λ and set t equal to zero: P 1 Solvng the system of equatons, we get: p = a a E. LINEARIZATION If d s the drty value and c s the clean value, the Taylor seres approxmaton for a functon f s gven as follows: f(c) = f(d) + f (d) (d c) +... Ignorng the hgher order terms, the lnear term f (d) (d c) s a lnear functon n each feature and label. We only have to know the change n each feature to estmate the change n value. In our case the functon f s the gradent φ. So, the resultng lnearzaton s: φ(x (c), y (c), θ) φ(x, y, θ) + X φ(x, y, θ) (x x(c) ) + Y φ(x, y, θ) (y y(c) ) When we take the expected value: E( φ(x clean, y clean, θ)) φ(x, y, θ)+ φ(x, y, θ) E( x) X It follows that: + φ(x, y, θ) E( y) Y φ(x, y, θ) + M x E( x) + M y E( y) where M x = φ and My = φ. Recall that the feature space s d dmensonal and label space s l dmensonal. X Y Then, M x s an d d matrx, and M y s a d l matrx. Both of these matrces are computed for each record. x s a d dmensonal vector where each component represents a change n that feature and y s an l dmensonal vector that represents the change n each of the labels. Ths lnearzaton allows ActveClean to mantan per feature (or label) average changes and use these changes to center the optmal samplng dstrbuton around the expected clean value. To estmate E( x) and E( y), consder the followng for a sngle feature : If we average all j = {1,..., K} records cleaned that have an error for that feature, weghted by ther samplng probablty: x = 1 NK Smlarly, for a label : y = 1 NK K (x (d) [] x (c) []) 1 p(j) j=1 K (y (d) [] y (c) []) 1 p(j) j=1 Each x and y represents an average change n a sngle feature. A sngle vector can represent the necessary changes to apply to a record r: For a record r, the set of corrupted features s f r, l r. Then, each record r has a d-dmensonal vector rx whch s constructed as follows: { / f r rx[] = x f r Each record r also has an l-dmensonal vector ry whch s constructed as follows: { / l r rx[] = y l r Fnally, the result s: p r φ(x, y, θ (t) ) + M x rx + M y ry F. EXAMPLE M X, M Y Lnear Regresson: φ(x, y, θ) = (θ T x y)x For a record, r, suppose we have a feature vector x. If we take the partal dervatves wth respect to x, M x s: Smlarly M y s: Logstc Regresson: M x[, ] = 2x[] + j M x[, j] = θ[j]x[] M y[, 1] = x[] θ[j]x[j] y φ(x, y, θ) = (h(θ T x) y)x

17 where we can rewrte ths as: In component form, Therefore, SVM: Therefore, h(z) = h θ (x) = e z e θt x φ(x, y, θ) = (h θ (x) y)x g = φ(x, y, θ) g[] = h θ (x) x[] yx[] M x[, ] = h θ (x) (1 h θ (x)) θ[]x[] + h θ (x) y M x[, j] = h θ (x) (1 h θ (x)) θ[j]x[] + h θ (x) φ(x, y, θ) = M x[, ] = M y[, 1] = x[] { y x f y x θ 1 f y x θ 1 { y[] f y x θ 1 f y x θ 1 M x[, j] = M y[, 1] = x[] G. AGGREGATE QUERIES AS CONVEX LOSSES G.1 AVG and SUM queres avg, sum queres are a specal case of the convex loss mnmzaton dscussed n the paper: If we defne the followng loss, t s easy to verfy the the optmal θ s the mean µ: φ = (x θ) 2 wth the approprate scalng t can support avg, sum queres wth and wthout predcates. Takng the gradent of that loss: φ = 2(x θ) It s also easy to verfy that the bound on errors s O( E((x µ)2 bt ), whch s essentally the CLT. The mportance samplng results are nuttve as well. Applyng the lnearzaton: M x = 2 The mportance samplng prortzes ponts that t expects to be far away from the mean. G.2 MEDIAN Smlarly, we can analyze the medan query. If we defne the followng loss, t s easy to verfy the the optmal θ s the medan m: φ = x θ Takng the gradent of that loss: Applyng the lnearzaton: φ = 1 f < m, -1 f > m M x = The ntutve result s that a robust query lke a medan does not need to consder estmaton as the query result s robust to small changes. H. EXPERIMENTAL COMPARISON H.1 Robust Logstc Regresson We use the algorthm from Feng et al. for robust logstc regresson. 1. Input: Contamnated tranng samples {(x 1, y 1),..., (x n, y n)} an upper bound on the number of outlers n, number of nlers n and sample dmenson p. 2. Intalzaton: Set T = 4 log p/n + log n/n 3. Remove samples (x, y) whose magntude satsfes x T. 4. Solve regularzed logstc regresson problem. I. DOLLARS FOR DOCS SETUP The dollars for docs dataset has the followng schema: C o n t r b u t o n ( p _ s p e c a l t y, drug_name, devce_name, c o r p o r a t o n, amount, d s p u t e, s t a t u s ) To flag suspect donatons, we used the status attrbute. When the status was covered" that means t was an allowed contrbuton under the researcher s declared protocol. When the status was non-covered" that means t was a dsallowed contrbuton under the researcher s declared protocol. The rest of the textual attrbutes were featurzed wth a bag-ofwords model, and the numercal amount and dspute attrbutes were treated as numbers. We cleaned the entre Dollars for Docs dataset upfront to be able to evaluate how dfferent budgeted data cleanng strateges compare to cleanng the full data. To clean the dataset, we loaded the entre data 2489 records nto Mcrosoft Excel. We dentfed four broad classes of errors: Corporatons are nconsstently represented: Pfzer", Pfzer Inc.", Pfzer Incorporated". Drugs are nconsstently represented: TAXOTERE DOC- ETAXEL -PROSTATE CANCER" and TAXOTERE" Label of covered and not covered are not consstent: No", Yes", N", Ths study s not supported", None", Combnaton" Research subject must be a drug OR a medcal devce and not both: BIO FLU QPAN H7N9AS3 Vaccne" and BIO FLU QPAN H7N9AS3 Devce" To fx these errors, we sorted by each column and merged values that looked smlar and removed nconsstences as n the status labels. When there were ambgutes, we refered to the drug company s webste and whtepapers. When possble, we used batch data transformatons, lke fnd and replace (.e. the Set-of-Records model). In all, records had some error and full data cleanng requred about 2 days of efforts.

18 Once cleaned, n our experment, we encoded the 4 problems as data qualty constrants. To fx the constrants, we looked up the clean value n the dataset that we cleaned up front. Rule 1: Matchng dependency on corporaton (Weghted Jaccard Smlarty >.8). Rule 2: Matchng dependency on drug (Weghted Jaccard Smlarty >.8). Rule 3: Label must ether be covered" or not covered". Rule 4: Ether drug or medcal devce should be null. J. MNIST SETUP We nclude vsualzaton of the errors that we generated for the MNIST experment. We generated these errors n MATLAB by takng the grayscale verson of the mage (a matrx) and corruptng them by block removal and fuzzyng. Fgure 14: We experment wth two forms of corrupton n the MNIST mage datasets: 5x5 block removal and makng the mages fuzzy. Image (a) shows an uncorrupted 9", mage (b) shows one corrupted wth block removal, and mage (c) shows one that s corrupted wth fuzzness.