Abstract. This paper introduces new algorithms and data structures for quick counting for machine

Transcription

1 Journl of Artiæcil Intelligence Reserch 8 è998è 67-9 Submitted 7è97; published è98 Cched Suæcient Sttistics for Eæcient Mchine Lerning with Lrge Dtsets Andrew Moore Mry Soon Lee School of Computer Science nd Robotics Institute Crnegie Mellon University, Pittsburgh PA 5 wm@cs.cmu.edu mslee@cs.cmu.edu Abstrct This pper introduces new lgorithms nd dt structures for quick counting for mchine lerning dtsets. We focus on the counting tsk of constructing contingency tbles, but our pproch is lso pplicble to counting the number of records in dtset tht mtch conjunctive queries. Subject to certin ssumptions, the costs of these opertions cn be shown to be independent of the number of records in the dtset nd logliner in the number of non-zero entries in the contingency tble. We provide very sprse dt structure, the ADtree, to minimize memory use. We provide nlyticl worst-cse bounds for this structure for severl models of dt distribution. We empiriclly demonstrte tht trctbly-sized dt structures cn be produced for lrge rel-world dtsets by èè using sprse tree structure tht never lloctes memory for counts of zero, èbè never llocting memory for counts tht cn be deduced from other counts, nd ècè not bothering to expnd the tree fully ner its leves. We show how the ADtree cn be used to ccelerte Byes net structure ænding lgorithms, rule lerning lgorithms, nd feture selection lgorithms, nd we provide number of empiricl results compring ADtree methods ginst trditionl direct counting pproches. We lso discuss the possible uses of ADtrees in other mchine lerning methods, nd discuss the merits of ADtrees in comprison with lterntive representtions such s kd-trees, R-trees nd Frequent Sets.. Cching Suæcient Sttistics Computtionl eæciency is n importnt concern for mchine lerning lgorithms, especilly when pplied to lrge dtsets èfyyd, Mnnil, & Pitetsky-Shpiro, 997; Fyyd & Uthurusmy, 996è or in rel-time scenrios. In erlier work we showed how kd-trees with multiresolution cched regression mtrix sttistics cn enble very fst loclly weighted nd instnce bsed regression èmoore, Schneider, & Deng, 997è. In this pper, we ttempt to ccelerte predictions for symbolic ttributes using kind of kd-tree tht splits on ll dimensions t ll nodes. Mny mchine lerning lgorithms operting on dtsets of symbolic ttributes need to do frequent counting. This work is lso pplicble to Online Anlyticl Processing èolapè pplictions in dt mining, where opertions on lrge dtsets such s multidimensionl dtbse ccess, DtCube opertions èhrinryn, Rjrmn, & Ullmn, 996è, nd ssocition rule lerning èagrwl, Mnnil, Sriknt, Toivonen, & Verkmo, 996è could be ccelerted by fst counting. Let us begin by estblishing some nottion. We re given dt set with R records nd M ttributes. The ttributes re clled ; ;::: M. The vlue of ttribute i in the cæ998 AI Access Foundtion nd Morgn Kufmnn Publishers. All rights reserved.

2 Moore & Lee kth record is smll integer lying in the rnge f; ;:::n i g where n i is clled the rity of ttribute i. Figure gives n exmple. Attributes Arity Record Record Record Record Record 4 5 n = n = 4 n = 4 M = R = 6 Record 6 Figure : A simple dtset used s n exmple. It hs R = 6 records nd M = ttributes.. Queries A query is set of èttribute = vlueè pirs in which the left hnd sides of the pirs form subset of f ::: M g rrnged in incresing order of index. Four exmples of queries for our dtset re è = è; è =; = è; èè; è =; =4; =è èè Notice tht the totl number of possible queries is æ M i=èn i + è. This is becuse ech ttribute cn either pper in the query with one of the n i vlues it my tke, or it my be omitted èwhich is equivlent to giving it i = * ëdon't cre" vlueè.. Counts The count of query, denoted by CèQueryè is simply the number of records in the dtset mtching ll the èttribute = vlueè pirs in Query. For our exmple dtset we ænd:. Contingency Tbles Cè =è = Cè =; =è = 4 Cèè = 6 Cè =; =4; =è = Ech subset of ttributes, ièè ::: iènè, hs n ssocited contingency tble denoted by ctè ièè ::: iènè è. This is tble with row for ech of the possible sets of vlues for ièè ::: iènè. The row corresponding to ièè = v ::: iènè = v n records the count Cè ièè = v ::: iènè = v n è. Our exmple dtset hs ttributes nd so = 8 contingency tbles exist, depicted in Figure. 68

3 Cched Sufficient Sttistics for Efficient Mchine Lerning ct() # 6 ct( ) # ct(, ) # 0 ct( ) ct( ) # # 5 ct(, ) ct(, ) # # ct(,, ) # Figure : The eight possible contingency tbles for the dtset of Figure. A conditionl contingency tble, written ctè ièè ::: iènè j jèè = u ;::: jèpè = u p è èè is the contingency tble for the subset of records in the dtset tht mtch the query to the right of the j symbol. For exmple, ctè ; j =è= è 0 0 Contingency tbles re used in vriety of mchine lerning pplictions, including building the probbility tbles for Byes nets nd evluting cndidte conjunctive rules in rule lerning lgorithms èquinln, 990; Clrk & Niblett, 989è. It would thus be desirble to be ble to perform such counting eæciently. If we re prepred to py one-time cost for building cching dt structure, then it is esy to suggest mechnism for doing counting in constnt time. For ech possible query, we precompute the contingency tble. The totl mount of numbers stored in memory for such dt structure would be æ M i= èn i + è, which even for our humble dtset of Figure is 45, s reveled by Figure. For rel dtset with more thn ten ttributes of medium rity, or æfteen binry ttributes, this is fr too lrge to æt in min memory. We would like to retin the speed of precomputed contingency tbles without incurring n intrctble memory demnd. Tht is the subject of this pper.. Cche Reduction : The Dense ADtree for Cching Suæcient Sttistics First we will describe the ADtree, the dt structure we will use to represent the set of ll possible counts. Our initil simpliæed description is n obvious tree representtion tht does not yield ny immedite memory svings, but will lter provide severl opportunities. The SE-tree èrymon, 99è is similr dt structure. 69

4 Moore & Lee for cutting oæ zero counts nd redundnt counts. This structure is shown in Figure. An ADtree node èshown s rectngleè hs child nodes clled ëvry nodes" èshown s ovlsè. Ech ADnode represents query nd stores the number of records tht mtch the query èin the C = è æeldè. The Vry j child of n ADnode hs one child for ech ofthen j vlues of ttribute j. The kth such child represents the sme query s Vry j 's prent, with the dditionl constrint tht j = k. = *. = * M c = #... Vry Vry Vry M = = *. M= * = n = *. M= *... = * =. M= *... = * = n. M= * c = # c = # c = # c = # Vry Vry M Vry Vry M Figure : The top ADnodes of n ADtree, described in the text. Notes regrding this structure: æ Although drwn on the digrm, the description of the query èe.g., =; = *; :: M = * on the leftmost ADnode of the second levelè is not explicitly recorded in the ADnode. The contents of n ADnode re simply count nd set of pointers to the Vry j children. The contents of Vry j node re set of pointers to ADnodes. æ The cost of looking up count is proportionl to the number of instntited vribles in the query. For exmple, to look up Cè 7 =; =; =èwewould follow the following pth in the tree: Vry 7! 7 =! Vry! =! Vry! =. Then the count is obtined from the resulting node. æ Notice tht if node ADN hs Vry i s its prent, then ADN's children re Vry i+ Vry i+... Vry M. It is not necessry to store Vry nodes with indices below i+ becuse tht informtion cn be obtined from nother pth in the tree. 70

5 Cched Sufficient Sttistics for Efficient Mchine Lerning. Cutting oæ nodes with counts of zero As described, the tree is not sprse nd will contin exctly æ M i=èn i + è nodes. Sprseness is esily chieved by storing insted of node for ny query tht mtches zero records. All of the speciliztions of such query will lso hve count of zero nd they will not pper nywhere in the tree. For some dtsets this cn reduce the number of numbers tht need to be stored. For exmple, the dtset in Figure, which previously needed 45 numbers to represent ll contingency tbles, will now only need numbers.. Cche Reduction II: The Sprse ADtree It is esy to devise dtsets for which there is no beneæt in filing to store counts of zero. Suppose we hve M binry ttributes nd M records in which the kth record is the bits of the binry representtion of k. Then no query hs count of zero nd the tree contins M nodes. To reduce the tree size despite this, we will tke dvntge of the observtion tht very mny of the counts stored in the bove tree re redundnt. Ech Vry j node in the bove ADtree stores n j subtrees one subtree for ech vlue of j. Insted, we will ænd the most common of the vlues of j ècll it MCVè nd store in plce of the MCVth subtree. The remining n j, subtrees will be represented s before. An exmple for simple dtset is given in Figure 4. Ech Vry j node now records which of its vlues is most common in MCV æeld. Appendix B describes the strightforwrd lgorithm for building such nadtree. As we will see in Section 4, it is still possible to build full exct contingency tbles èor give counts for speciæc queriesè in time tht is only slightly longer thn for the full ADtree of Section. But ærst let us exmine the memory consequences of this representtion. Appendix A shows tht for binry ttributes, given M ttributes nd R records, the number of nodes needed to store the tree is bounded bove by M in the worst cse ènd much less if Ré M è. In contrst, the mount of memory needed by the dense tree of Section is M in the worst cse. Notice in Figure 4 tht the MCV vlue is context dependent. Depending on constrints on prent nodes, 's MCV is sometimes nd sometimes. This context dependency cn provide drmtic svings if ès is frequently the cseè there re correltions mong the ttributes. This is discussed further in Appendix A. 4. Computing Contingency Tbles from the Sprse ADtree Given n ADtree, we wish to be ble to quickly construct contingency tbles for ny rbitrry set of ttributes f ièè ::: iènè g. Notice tht conditionl contingency tble ctè ièè ::: iènè j Queryè cn be built recursively. We ærst build ctè ièè ::: iènè j ièè =; Queryè ctè ièè ::: iènè j ièè =; Queryè. ctè ièè ::: iènè j ièè = n ièè ; Queryè 7

6 Moore & Lee = * = * c = 8 Vry mcv = Vry mcv = = = = * = (mcv) (mcv) * = * = c = c = c = Count 0 Vry mcv = (mcv) Vry mcv = (mcv) = = c = Figure 4: A sprse ADtree built for the dtset shown in the bottom right. The most common vlue for is, nd so the = subtree of the Vry child of the root node is. At ech of the Vry nodes the most common child is lso set to èwhich child is most common depends on the contextè. For exmple, to build ctè ; è using the dtset in Figure, we cn build ctè j =è nd ctè j = è nd combine them s in Figure 5. ct( = ) # 0 ct( = ) # ct(, ) # 0 Figure 5: An exmple èusing numbers from Figure è of how contingency tbles cn be combined recursively to form lrger contingency tbles. When building conditionl contingency tble from n ADtree, we will not need to explicitly specify the query condition. Insted, we will supply n ADnode of the ADtree, which implicitly is equivlent informtion. The lgorithm is: 7

7 Cched Sufficient Sttistics for Efficient Mchine Lerning MkeContbè f ièè ::: iènè g, ADNè Let VN := The Vry ièè subnode of ADN. Let MCV := VN:MCV. For k := ; ;:::;n ièè If k 6= MCV Let ADN k := The ièè = k subnode of VN. CT k := MkeContbèf ièè ::: iènè g; ADN k è. CT MCV := èclculted s explined belowè Return the conctention of CT :::CT nièè. The bse cse of this recursion occurs when the ærst rgument is empty, in which cse we return one-element contingency tble contining the count ssocited with the current ADnode, ADN. There is n omission in the lgorithm. In the itertion over k f; ;:::n ièè g we re unble to compute the conditionl contingency tble for CT MCV becuse the ièè = MCV subtree is delibertely missing s per Section. Wht cn we do insted? We cn tke dvntge of the following property of contingency tbles: ctè ièè ::: iènè j Queryè = nx ièè k= ctè ièè ::: iènè j ièè = k; Queryè èè The vlue ctè ièè ::: iènè j Queryè cn be computed from within our lgorithm by clling MkeContbèf ièè ::: iènè g; ADNè è4è nd so the missing conditionl contingency tble in the lgorithm cn be computed by the following row-wise subtrction: X CT MCV := MkeContbèf ièè ::: iènè g; ADNè, CT k è5è k6=mcv Frequent Sets èagrwl et l., 996è, which re trditionlly used for lerning ssocition rules, cn lso be used for computing counts. A recent pper èmnnil & Toivonen, 996è, which lso employs similr subtrction trick, clcultes counts from Frequent Sets. In Section 8 we will discuss the strengths nd weknesses of Frequent Sets in comprison with ADtrees. 4. Complexity of building contingency tble Wht is the cost of computing contingency tble? Let us consider the theoreticl worstcse cost of computing contingency tble for n ttributes ech of rity k note tht this cost is unrelisticlly pessimistic èexcept when k = è, becuse most contingency tbles re sprse, s discussed lter. The ssumption tht ll ttributes hve the sme rity, k, is mde to simplify the clcultion of the worst-cse cost, but is not needed by the code. 7

8 Moore & Lee A contingency tble for n ttributes hs k n entries. Write Cènè = the cost of computing such contingency tble. In the top-level cll of MkeContb there re k clls to build contingency tbles from n, ttributes: k, of these clls re to build CTè ièè ::: iènè j ièè = j; Queryè for every j in f; ;:::kg except the MCV, nd the ænl cll is to build CTè ièè ::: iènè j Queryè. Then there will be k, subtrctions of contingency tbles, which will ech require k n, numeric subtrctions. So we hve Cè0è = è6è Cènè = kcèn, è+èk, èk n, if né0 è7è The solution to this recurrence reltion is Cènè = è + nèk, èèk n, ; this cost is logliner in the size of the contingency tble. By comprison, if we used no cched dt structure, but simply counted through the dtset in order to build contingency tble we would need OènR + k n è opertions where R is the number of records in the dtset. We re thus cheper thn the stndrd counting method if k n ç R. We re interested in lrge dtsets in which R my be more thn 00; 000. In such cse our method will present severl order of mgnitude speedup for, sy, contingency tble of eight binry ttributes. Notice tht this cost is independent ofm, the totl number of ttributes in the dtset, nd only depends upon the èlmost lwys much smllerè number of ttributes n requested for the contingency tble. 4. Sprse representtion of contingency tbles In prctice, we do not represent contingency tbles s multidimensionl rrys, but rther s tree structures. This gives both the slow counting pproch nd the ADtree pproch substntil computtionl dvntge in cses where the contingency tble is sprse, i.e. hs mny zero entries. Figure 6 shows such sprse contingency tble representtion. This cn men verge-cse behvior is much fster thn worst cse for contingency tbles with lrge numbers of ttributes or high-rity ttributes. Indeed, our experiments in Section 7 show costs rising much more slowly thn Oènk n, è s n increses. Note too tht when using sprse representtion, the worst-cse for MkeContb is now OèminènR; nk n, èè becuse R is the mximum possible number of non-zero contingency tble entries. 5. Cche Reduction III: Lef-Lists We nowintroduce scheme for further reducing memory use. It is not worth building the ADtree dt structure for smll number of records. For exmple, suppose we hve 5 records nd 40 binry ttributes. Then the nlysis in Appendix A shows us tht in the worst cse the ADtree might require 070 nodes. But computing contingency tbles using the resulting ADtree would, with so few records, be no fster thn the conventionl counting pproch, which would merely require us to retin the dtset in memory. Aside from concluding tht ADtrees re not useful for very smll dtsets, this lso leds to ænl method for sving memory in lrge ADtrees. Any ADtree node with fewer thn R min records does not expnd its subtree. Insted it mintins list of pointers into the originl dtset, explicitly listing those records tht mtch the current ADnode. Such list of pointers is clled lef-list. Figure 7 gives n exmple. 74

9 Cched Sufficient Sttistics for Efficient Mchine Lerning ct(,, ) # v= v= v= v= v= v=4 v= v= v= v=4 v= v= v= v= v= v= v= v= Figure 6: The right hnd ægure is the sprse representtion of the contingency tble on the left. The use of lef-lists hs one minor nd two mjor consequences. The minor consequence is the need to include strightforwrd chnge in the contingency tble generting lgorithm to hndle lef-list nodes. This minor ltertion is not described here. The ærst mjor consequence is tht now the dtset itself must be retined in min memory so tht lgorithms tht inspect lef-lists cn ccess the rows of dt pointed to in those lef-lists. The second mjor consequence is tht the ADtree my require much less memory. This is documented in Section 7 nd worst-cse bounds re provided in Appendix A. 6. Using ADtrees for Mchine Lerning As we will see in Section 7, the ADtree structure cn substntilly speed up the computtion of contingency tbles for lrge rel dtsets. How cn mchine lerning nd sttisticl lgorithms tke dvntge of this? Here we provide three exmples: Feture Selection, Byes net scoring nd rule lerning. But it seems likely tht mny other lgorithms cn lso beneæt, for exmple stepwise logistic regression, GMDH èmdl & Ivkhnenko, 994è, nd text clssiæction. Even decision tree èquinln, 98; Breimn, Friedmn, Olshen, & Stone, 984è lerning my beneæt. In future work we will lso exmine wys to speed up nerest neighbor nd other memory-bsed queries using ADtrees.. This depends on whether the cost of initilly building the ADtree cn be mortized over mny runs of the decision tree lgorithm. Repeted runs of decision tree building cn occur if one is using the wrpper model of feture selection èjohn, Kohvi, & Pæeger, 994è, or if one is using more intensive serch over tree structures thn the trditionl greedy serch èquinln, 98; Breimn et l., 984è 75

10 Moore & Lee Vry mcv = = * = * = * c = 9 Vry mcv = Vry mcv = Row = = * = * c = See rows,, (mcv) = = * = * c = See rows 8,9 (mcv) = * = = * c = 4 (mcv) = * = * = c = Vry mcv = 8 9 (mcv) = * = = c = See rows 4,8 Figure 7: An ADtree built using lef-lists with R min =4. Any node mtching or fewer records is not expnded, but simply records set of pointers into the dtset èshown on the rightè. 6. Dtsets The experiments used the dtsets in Tble. Ech dtset ws supplied to us with ll continuous ttributes lredy discretized into rnges. 6. Using ADtrees for Feture Selection Given M ttributes, of which one is n output tht we wish to predict, it is often interesting to sk ëwhich subset of n ttributes, èn émè, is the best predictor of the output on the sme distribution of dtpoints tht re reæected in this dtset?" èkohvi, 995è. There re mny wys of scoring set of fetures, but prticulrly simple one is informtion gin ècover & Thoms, 99è. Let out be the ttribute we wish to predict nd let ièè ::: iènè be the set of ttributes used s inputs. Let X be the set of possible ssignments of vlues to ièè ::: iènè nd write Assign k X s the kth such ssignment. Then nx out ç n Cèout = vè X out ç Cèout ç = v; Assign k è InfoGin = v= f R ç XjXj, k= CèAssign k è R v= where R is the number of records in the entire dtset nd f CèAssign k è è8è fèxè =,x log x è9è The counts needed in the bove computtion cn be red directly from ctè out ; ièè ::: iènè è. Serching for the best subset of ttributes is simply question of serch mong ll ttribute-sets of size n èn speciæed by the userè. This is simple exmple designed to test 76

11 Cched Sufficient Sttistics for Efficient Mchine Lerning Nme R = Num. M = Num. Records Attributes ADULT 5,060 5 The smll ëadult Income" dtset plced in the UCI repository by Ron Kohvi èkohvi, 996è. Contins census dt relted to job, welth, nd ntionlity. Attribute rities rnge from to 4. In the UCI repository this is clled the Test Set. Rows with missing vlues were removed. ADULT 0,6 5 The sme kinds of records s bove but with diæerent dt. The Trining Set. ADULT 45, 5 ADULT nd ADULT conctented. CENSUS 4,4 A lrger dtset bsed on diæerent census, lso provided by Ron Kohvi. CENSUS 4,4 5 The sme dt s CENSUS, but with the ddition of two extr, high-rity ttributes. BIRTH 9,67 97 Records concerning very wide number of redings nd fctors recorded t vrious stges during pregnncy. Most ttributes re binry, nd 70 of the ttributes re very sprse, with over 95è of the vlues being FALSE. SYNTH 0Kí500K 4 Synthetic dtsets of entirely binry ttributes generted using the Byes net in Figure 8. Tble : Dtsets used in experiments. our counting methods: ny prcticl feture selector would need to penlize the number of rows in the contingency tble èelse high rity ttributes would tend to winè. 6. Using ADtrees for Byes Net Structure Discovery There re mny possible Byes net lerning tsks, ll of which entil counting, nd hence might be speeded up by ADtrees. In this pper we present experimentl results for the prticulr exmple of scoring the structure of Byes net to decide how well it mtches the dt. We will use mximum likelihood scoring with penlty for the number of prmeters. We ærst compute the probbility tble ssocited with ech node. Write Prentsèjè for the prent ttributes of node j nd write X j s the set of possible ssignments of vlues to Prentsèjè. The mximum likelihood estimte for P è j = v j X j è è0è is estimted s Cè j = v; X j è CèX j è nd ll such estimtes for node j's probbility tbles cn be red from ctè j ; Prentsèjèè. The next step in scoring structure is to decide the likelihood of the dt given the probbility tbles we computed nd to penlize the number of prmeters in our network èwithout the penlty the likelihood would increse every time link ws dded to the èè 77

12 Moore & Lee Figure 8: A Byes net tht generted our SYNTH dtsets. There re three kinds of nodes. The nodes mrked with tringles re generted with P è i =è=0:8;pè i =è=0:. The squre nodes re deterministic. A squre node tkes vlue if the sum of its four prents is even, else it tkes vlue. The circle nodes re probbilistic functions of their single prent, deæned by P è i =j Prent = è = 0 nd P è i =j Prent = è = 0:4. This provides dtset with firly sprse vlues nd with mny interdependencies. networkè. The penlized log-likelihood score èfriedmn & Ykhini, 996è is, N prms logèrè=+r MX X n X j j= AsgnX j v= P è j = v ^ Asgnè log P è j = v j Asgnè èè where N prms is the totl number of probbility tble entries in the network. We serch mong structures to ænd the best score. In these experiments we use rndomrestrt stochstic hill climbing in which the opertions re rndom ddition or removl of network link or rndomly swpping pir of nodes. The ltter opertion is necessry to llow the serch lgorithm to choose the best ordering of nodes in the Byes net. Stochstic serches such s this re populr method for ænding Byes net structures èfriedmn & Ykhini, 996è. Only the probbility tbles of the æected nodes re recomputed on ech step. Figure 9 shows the Byes net structure returned by our Byes net structure ænder fter 0,000 itertions of hill climbing. 6.4 Using ADtrees for Rule Finding Given n output ttribute out nd distinguished vlue v out, rule ænders serch mong conjunctive queries of the form Assign =è ièè = v ::: iènè = v n è èè 78

13 Cched Sufficient Sttistics for Efficient Mchine Lerning t_t_r_i_b_u_t_e _s_c_o_r_e _n_p reltionship.84 prs = <no prents> clss prs = reltionship sex prs = reltionship clss cpitl-gin prs = clss hours-per-week prs = reltionship clss sex mritl-sttus prs = reltionship sex eduction-num prs = clss cpitl-loss prs = clss ge prs = mritl-sttus rce prs = reltionship eduction-num eduction prs = reltionship eduction-num workclss prs = reltionship hours-per-week eduction-num ntive-country prs = eduction-num rce fnlwgt prs = <no prents> occuption prs = clss sex eduction workclss Score is 459 The serch took 6 seconds. Figure 9: Output from the Byes structure ænder running on the ADULT dtset. Score is the contribution to the sum in Eqution due to the speciæed ttribute. np is the numberofentries in the probbility tble for the speciæed ttribute. to ænd the query tht mximizes the estimted vlue P è out = v out j Assignè = Cè out = v out ; Assignè CèAssignè è4è To void rules without signiæcnt support, we lso insist tht CèAssignè èthe number of records mtching the queryè must be bove some threshold S min. In these experiments we implement brute force serch tht looks through ll possible queries tht involve user-speciæed number of ttributes, n. We build ech ctè out ; ièè ::: iènè è in turn èthere re M choose n such tblesè, nd then look through the rows of ech tble for ll queries using the ièè ::: iènè tht hve greter thn minimum support S min. We return priority queue of the highest scoring rules. For instnce on the ADULT dtset, the best rule for predicting ëclss" from 4 ttributes ws: score = è8è6è, workclss = Privte, eduction-num = bove, mritlsttus = Mrried-civ-spouse, cpitl-loss = bove600 è clss ç 50k 7. Experimentl Results Let us ærst exmine the memory required by nadtree on our dtsets. Tble shows us, for exmple, tht the ADULT dtset produced n ADtree with 95,000 nodes. The tree required lmost megbytes of memory. Among the three ADULT dtsets, the size of the tree vried pproximtely linerly with the number of records. Unless otherwise speciæed, in ll the experiments in this section, the ADULT dtsets used no lef-lists. The BIRTH nd SYNTHETIC dtsets used lef-lists of size R min =6by defult. The BIRTH dtset, with its lrge number of sprse ttributes, required modest 8 megbytes to store the tree mny mgnitudes below the worst-cse bounds. Among the synthetic dtsets, the tree size incresed sublinerly with the dtset size. This indictes 79

14 Moore & Lee Dtset M R Nodes Megbytes Build Time CENSUS CENSUS ADULT ADULT ADULT BIRTH SYN0K SYN60K SYN5K SYN50K SYN500K Tble : The size of ADtrees for vrious dtsets. M is the number of ttributes. R is the number of records. Nodes is the number of nodes in the ADtree. Megbytes is the mount of memory needed to store the tree. Build Time is the number of seconds needed to build the tree èto the nerest secondè. tht s the dtset gets lrger, novel records èwhich my cuse new nodes to pper in the treeè become less frequent. Tble shows the costs of performing 0,000 itertions of Byes net structure serching. All experiments were performed on 00Mhz Pentium Pro mchine with 9 megbytes of min memory. Recll tht ech Byes net itertion involves one rndom chnge to the network nd so requires recomputtion of one contingency tble èthe exception is the ærst itertion, in which ll nodes must be computedè. This mens tht the time to run 0,000 itertions is essentilly the time to compute 0,000 contingency tbles. Among the ADULT dtsets, the dvntge of the ADtree over conventionl counting rnges between fctor of 9 to. Unsurprisingly, the computtionl costs for ADULT increse sublinerly with dtset size for the ADtree but linerly for the conventionl counting. The computtionl dvntges nd the subliner behvior re much more pronounced for the synthetic dt. Next, Tble 4 exmines the eæect of lef-lists on the ADULT nd BIRTH dtsets. For the ADULT dtset, the byte size of the tree decreses by fctor of 5 when lef-lists re incresed from to 64. But the computtionl cost of running the Byes serch increses by only 5è, indicting worth-while trdeoæ if memory is scrce. The Byes net scoring results involved the verge cost of computing contingency tbles of mny diæerent sizes. The following results in Tbles 5 nd 6 mke the svings for æxed size ttribute sets esier to discern. These tbles give results for the feture selection nd rule ænding lgorithms, respectively. The biggest svings come from smll ttribute sets. Computtionl svings for sets of size one or two re, however, not prticulrly interesting since ll such counts could be cched by strightforwrd methods without needing ny tricks. In ll cses, however, we do see lrge svings, especilly for the BIRTH dt. Dtsets with lrger numbers of rows would, of course, revel lrger svings. 80

15 Cched Sufficient Sttistics for Efficient Mchine Lerning Dtset M R ADtree Time Regulr Time Speedup Fctor CENSUS CENSUS ADULT ADULT ADULT BIRTH SYN0K SYN60K SYN5K SYN50K SYN500K Tble : The time èin secondsè to perform 0,000 hill-climbing itertions serching for the best Byes net structure. ADtree Time is the time when using the ADtree nd Regulr Time is the time tken when using the conventionl probbility tble scoring method of counting through the dtset. Speedup Fctor is the number of times by which the ADtree method is fster thn the conventionl method. The ADtree times do not include the time for building the ADtree in the ærst plce ègiven in Tble è. A typicl use of ADtrees will build the tree only once nd then be ble to use it for mny dt nlysis opertions, nd so its building cost cn be mortized. In ny cse, even including tree building cost would hve only minor impct on the results. ADULT BIRTH R min èmb ènodes Build Serch èmb ènodes Build Serch Secs Secs Secs Secs , , , , , , , , , , , , ,90.48, , , , , , , Tble 4: Investigting the eæect of the R min prmeter on the ADULT dtset nd the BIRTH dtset. èmb is the memory used by the ADtree. ènodes is the number of nodes in the ADtree. Build Secs is the time to build the ADtree. Serch Secs is the time needed to perform 0,000 itertions of the Byes net structure serch. 8

16 Moore & Lee ADULT BIRTH Number Number ADtree Regulr Speedup Number ADtree Regulr Speedup Attributes Attribute Time Time Fctor Attribute Time Time Fctor Sets Sets , , , ,, , ,4, Tble 5: The time tken to serch mong ll ttribute sets of given size ènumber Attributesè for the set tht gives the best informtion gin in predicting the output ttribute. The times, in seconds, re the verge evlution times per ttribute-set. ADULT BIRTH Number Number ADtree Regulr Speedup Number ADtree Regulr Speedup Attributes Rules Time Time Fctor Rules Time Time Fctor , , , , , ,84, ,505, ,077,67, Tble 6: The time tken to serch mong ll rules of given size ènumber Attributesè for the highest scoring rules for predicting the output ttribute. The times, in seconds, re the verge evlution time per rule. 8

17 Cched Sufficient Sttistics for Efficient Mchine Lerning 8. Alterntive Dt Structures 8. Why not use kd-tree? kd-trees cn be used for ccelerting lerning lgorithms èomohundro, 987; Moore et l., 997è. The primry diæerence is tht kd-tree node splits on only one ttribute insted of ll ttributes. This results in much less memory èliner in the number of recordsè. But counting cn be expensive. Suppose, for exmple, tht level one of the tree splits on, level two splits on, etc. Then, in the cse of binry vribles, if we hve query involving only ttributes 0 nd higher, we hve to explore ll pths in the tree down to level 0. With dtsets of fewer thn 0 records this my benocheper thn performing liner serch through the records. Another possibility, R-trees èguttmn, 984; Roussopoulos & Leifker, 985è, store dtbses of M-dimensionl geometric objects. However, in this context, they oæer no dvntges over kd-trees. 8. Why not use Frequent Set ænder? Frequent Set ænders èagrwl et l., 996è re typiclly used with very lrge dtbses of millions of records contining very sprse binry ttributes. Eæcient lgorithms exist for ænding ll subsets of ttributes tht co-occur with vlue TRUE in more thn æxed number èchosen by the user, nd clled the supportè of records. Recent reserchèmnnil & Toivonen, 996è suggests tht such Frequent Sets cn be used to perform eæcient counting. In the cse where support =, ll such Frequent Sets re gthered nd, if counts of ech Frequent Set re retined, this is equivlent to producing n ADtree in which insted of performing node cutoæ for the most common vlue, the cutoæ lwys occurs for vlue FALSE. The use of Frequent Sets in this wy would thus be very similr to the use of ADtrees, with one dvntge nd one disdvntge. The dvntge is tht eæcient lgorithms hve been developed for building Frequent Sets from smll number of sequentil psses through dt. The ADtree requires rndom ccess to the dtset while it is being built, nd for its lef-lists. This is imprcticl if the dtset is too lrge to reside in min memory nd is ccessed through dtbse queries. The disdvntge of Frequent Sets in comprison with ADtrees is tht, under some circumstnces, the former my require much more memory. Assume the vlue is rrer thn throughout ll ttributes in the dtset nd ssume resonbly tht we thus choose to ænd ll Frequent Sets of s. Unnecessrily mny sets will be produced if there re correltions. In the extreme cse, imgine dtset in which 0è of the vlues re, 70è re nd ttributes re perfectly correlted ll vlues in ech record re identicl. Then, with M ttributes there would be M Frequent Sets of s. In contrst, the ADtree would only contin M + nodes. This is n extreme exmple, but dtsets with much weker inter-ttribute correltions cn similrly beneæt from using n ADtree. Lef-lists re nother technique to reduce the size of ADtrees further. They could lso be used for the Frequent Set representtion. 8

18 Moore & Lee 8. Why not use hsh tbles? If we knew tht only smll set of contingency tbles would ever be requested, insted of ll possible contingency tbles, then n ADtree would be unnecessry. It would be better to remember this smll set of contingency tbles explicitly. Then, some kind of tree structure could be used to index the contingency tbles. But hsh tble would be eqully time eæcient nd require less spce. A hsh tble coding of individul counts in the contingency tbles would similrly llow us to use spce proportionl only to the number of non-zero entries in the stored tbles. But for representing suæcient sttistics to permit fst solution to ny contingency tble request, the ADtree structure remins more memory eæcient thn the hsh-tble pproch èor ny method tht stores ll non-zero countsè becuse of the memory reductions when we exploit the ignoring of most common vlues. 9. Discussion 9. Wht bout numeric ttributes? The ADtree representtion is designed entirely for symbolic ttributes. When fced with numeric ttributes, the simplest solution is to discretize them into æxed ænite set of vlues which re then treted s symbols, but this is of little help if the user requests counts for queries involving inequlities on numeric ttributes. In future work we will evlute the use of structures combining elements from multiresolution kd-trees of rel ttributes èmoore et l., 997è with ADtrees. 9. Algorithm-speciæc counting tricks Mny lgorithms tht count using the conventionl ëliner" method hve lgorithm-speciæc wys of ccelerting their performnce. For exmple, Byes net structure ænder my try to remember ll the contingency tbles it hs tried previously in cse it needs to re-evlute them. When it deletes link, it cn deduce the new contingency tble from the old one without needing liner count. In such cses, the most pproprite use of the ADtree my be s lzy cching mechnism. At birth, the ADtree consists only of the root node. Whenever the structure ænder needs contingency tble tht cnnot be deduced from the current ADtree structure, the pproprite nodes of the ADtree re expnded. The ADtree then tkes on the role of the lgorithm-speciæc cching methods, while èin generlè using up much less memory thn if ll contingency tbles were remembered. 9. Hrd to updte incrementlly Although the tree cn be built cheply èsee the experimentl results in Section 7è, nd lthough it cn be built lzily, the ADtree cnnot be updted cheply with new record. This is becuse one new record my mtch upto M nodes in the tree in the worst cse. 9.4 Scling up The ADtree representtion cn be useful for dtsets of the rough size nd shpe used in this pper. On the ærst dtsets we hve looked t the ones described in this pper we hve 84

19 Cched Sufficient Sttistics for Efficient Mchine Lerning shown empiriclly tht the sizes of the ADtrees re trctble given rel noisy dt. This included one dtset with 97 ttributes. It is the extent to which the ttributes re skewed in their vlues nd correlted with ech other tht enbles the ADtree to void pproching its worse-cse bounds. The min technicl contribution of this pper is the trick tht llows us to prune oæ most-common-vlues. Without it, skewedness nd correltion would hrdly help t ll. The empiricl contribution of this pper hs been to show tht the ctul sizes of the ADtrees produced from rel dt re vstly smller thn the sizes we would get from the worst-cse bounds in Appendix A. But despite these svings, ADtrees cnnot yet represent ll the suæcient sttistics for huge dtsets with mny hundreds of non-sprse nd poorly correlted ttributes. Wht should we do if our dtset or our ADtree cnnot æt into min memory? In the ltter cse, we could simply increse the size of lef-lists, trding oæ decresed memory ginst incresed time to build contingency tbles. But if tht is indequte t lest three possibilities remin. First, we could build pproximte ADtrees tht do not store ny informtion for nodes tht mtch fewer thn threshold number of records. Then pproximte contingency tbles ècomplete with error boundsè cn be produced èmnnil & Toivonen, 996è. A second possibility is to exploit secondry storge nd store deep, rrely visited nodes of the ADtree on disk. This would doubtless best be chieved by integrting the mchine lerning lgorithms with current dtbse mngement tools topic of considerble interest in the dt mining community èfyyd et l., 997è. A third possibility, which restricts the size of contingency tbles we my sk for, is to refuse to store counts for queries with more thn some threshold number of ttributes. 9.5 Wht bout the cost of building the tree? In prctice, ADtrees could be used in two wys: æ One-oæ. When trditionl lgorithm is required we build the ADtree, run the fst version of the lgorithm, discrd the ADtree, nd return the results. æ Amortized. When new dtset becomes vilble, new ADtree is built for it. The tree is then shipped nd re-used by nyone who wishes to do rel-time counting queries, multivrite grphs nd chrts, or ny mchine lerning lgorithms on ny subset of the ttributes. The cost of the initil tree building is then mortized over ll the times it is used. In dtbse terminology, the process is known s mterilizing èhrinryn et l., 996è nd hs been suggested s desirble for dtmining by severl reserchers èjohn & Lent, 997; Mnnil & Toivonen, 996è. The one-oæ option is only useful if the cost of building the ADtree plus the cost of running the ADtree-bsed lgorithm is less thn the cost of the originl counting-bsed lgorithm. For the intensive mchine lerning methods studied here, this condition is sfely stisæed. But wht if we decided to use less intensive, greedier Byes net structure ænder? Tble 7. Without pruning, on ll of our dtsets we rn out of memory on 9 megbyte mchine before we hd built even è of the tree, nd it is esy to show tht the BIRTH dtset would hve needed to store more thn 0 0 nodes. 85

20 Moore & Lee Dtset Speedup ignoring build-time, 0,000 itertions Speedup llowing for build-time, 0,000 itertions Speedup llowing for build-time, 00 itertions CENSUS CENSUS ADULT ADULT ADULT BIRTH SYN0K SYN60K SYN5K SYN50K SYN500K Tble 7: Computtionl economics of building ADtrees nd using them to serch for Byes net structures using the experiments of Section 7. shows tht if we only run for 00 itertions insted of 0,000 4 nd if we ccount for one-oæ ADtree building cost, then the reltive speedup of using ADtrees declines gretly. To conclude: if the dt nlysis is intense then there is beneæt to using ADtrees even if they re used in one-oæ fshion. If the ADtree is used for multiple purposes then its build-time is mortized nd the resulting reltive eæciency gins over trditionl counting re the sme for both exhustive serches nd non-exhustive serches. Algorithms tht use non-exhustive serches include hill-climbing Byes net lerners, greedy rule lerners such s CN èclrk & Niblett, 989è nd decision tree lerners èquinln, 98; Breimn et l., 984è. Acknowledgements This work ws sponsored by Ntionl Science Foundtion Creer Awrd to Andrew Moore. The uthors thnk Justin Boyn, Scott Dvies, Nir Friedmn, nd Jeæ Schneider for their suggestions, nd Ron Kohvi for providing the census dtsets. Appendix A: Memory Costs In this ppendix we exmine the size of the tree. For simplicity, we restrict ttention to the cse of binry ttributes. The worst-cse number of nodes in n ADtree Given dtset with M ttributes nd R records, the worst-cse for the ADtree will occur if ll M possible records exist in the dtset. Then, for every subset of ttributes there exists exctly one node in the ADtree. For exmple consider the ttribute set f ièè ::: iènè g, where ièè éièè é ::: é iènè. Suppose there is node in the tree corresponding to the 4. Unsurprisingly, the resulting Byes nets hve highly inferior structure. 86

21 Cched Sufficient Sttistics for Efficient Mchine Lerning query f ièè = v ::: iènè = v n g for some vlues v :::v n.from the deænition of n ADtree, nd remembering we re only considering the cse of binry ttributes, we cn stte: æ v is the lest common vlue of ièè. æ v is the lest common vlue of ièè mong those records tht mtch è ièè = v è.. æ v k+ is the lest common vlue of ièk+è mong those records tht mtch è ièè = v ;:::; ièkè = v k è. So there is t most one such node. Moreover, since our worst-cse ssumption is tht ll possible records exist in the dtbse, we see tht the ADtree will indeed contin this node. Thus, the worst-cse number of nodes is the sme s the number of possible subsets of ttributes: M. The worst-cse number of nodes in n ADtree with resonble number of rows It is frequently the cse tht dtset hs R ç M. With fewer records, there is much lower worst-cse bound on the ADtree size. A node t the kth level of the tree corresponds to query involving k ttributes ècounting the root node s level 0è. Such node cn mtch t most R,k records becuse ech of the node's ncestors up the tree hs pruned oæ t lest hlf the records by choosing to expnd only the lest common vlue of the ttribute introduced by tht ncestor. Thus, there cn be no tree nodes t level blog Rc + of the tree, becuse such nodes would hve to mtch fewer thn R,blog Rc, é records. They would thus mtch no records, mking them. The nodes in n ADtree must ll exist t level blog Rc or higher. The number of nodes t level k is t most è M è, becuse every node t level k involves n ttribute set of size k k nd becuse ègiven binry ttributesè for every ttribute set there is t most one node in the ADtree. Thus the totl number of nodes in the tree, summing over the levels is less thn blog X Rc k=0 è M k! bounded bove by OèM blog Rc =èblog Rc,è!è The number of nodes if we ssume skewed independent ttribute vlues Imgine tht ll vlues of ll ttributes in the dtset re independent rndom binry vribles, tking vlue with probbility p nd tking vlue with probbility, p. Then the further p is from 0:5, the smller we cn expect the ADtree to be. This is becuse, on verge, the less common vlue of Vry node will mtch frction minèp;, pè of its prent's records. And, on verge, the number of records mtched t the kth level of the tree will be Rèminèp;, pèè k. Thus, the mximum level in the tree t which wemy ænd node mtching one or more records is pproximtely bèlog Rè=è, log qèc, where q = minèp;, pè. And so the totl number of nodes in the tree is pproximtely bèlog Rè=è, log X qècè! M k k=0 è5è bounded bove by OèM bèlog Rè=è, log qèc =èbèlog Rè=è, log qèc,è!è è6è 87

22 Moore & Lee Since the exponent is reduced by fctor of log è=qè, skewedness mong the ttributes thus brings enormous svings in memory. The number of nodes if we ssume correlted ttribute vlues The ADtree beneæts from correltions mong ttributes in much the sme wy tht it beneæts from skewedness. For exmple, suppose tht ech record ws generted by the simple Byes net in Figure 0, where the rndom vrible B is hidden ènot included in the recordè. Then for i 6= j, P è i 6= j è=pè,pè. If ADN is ny node in the resulting ADtree then the number of records mtching ny other node two levels below ADN in the tree will be frction pè, pè of the number of records mtching ADN. From this we cn see tht the number of nodes in the tree is pproximtely! bèlog Rè=è, log X qèc k=0 è M k bounded bove by OèM bèlog Rè=è, log qèc =èbèlog Rè=è, log qèc,è!è è7è where q = p pè, pè. Correltion mong the ttributes cn thus lso bring enormous svings in memory even if ès is the cse in our exmpleè the mrginl distribution of individul ttributes is uniform. B... M P(B) = 0.5 P( B) = - p i P( ~B) = p i Figure 0: AByes net tht genertes correlted boolen ttributes ; ::: M. The number of nodes for the dense ADtree of Section The dense ADtrees do not cut oæ the tree for the most common vlue of Vry node. The worst cse ADtree will occur if ll M possible records exist in the dtset. Then the dense ADtree will require M nodes becuse every possible query èwith ech ttribute tking vlues, or *è will hve count in the tree. The number of nodes t the kth level of the dense ADtree cn be k è M k è in the worst cse. The number of nodes when using Lef-lists Lef-lists were described in Section 5. If tree is built using mximum lef-list size of R min, then ny nodeintheadtree mtching fewer thn R min records is lef node. This mens tht Formule 5, 6 nd 7 cn be re-used, replcing R with R=R min. It is importnt to remember, however, tht the lef nodes must now contin room for R min numbers insted of single count. 88

23 Cched Sufficient Sttistics for Efficient Mchine Lerning Appendix B: Building the ADtree We deæne the function MkeADTreeè i, RecordNumsè where RecordNums is subset of f; ;:::;Rg èr is the totl number of records in the dtsetè nd where ç i ç M. This mkes n ADtree from the rows speciæed in RecordNums in which ll ADnodes represent queries in which only ttributes i nd higher re used. MkeADTreeè i, RecordNumsè Mke new ADnode clled ADN. ADN:COUNT := j RecordNums j. For j := i; i +;:::;M jth Vry node of ADN := MkeVryNodeè j ; RecordNumsè. MkeADTree uses the function MkeVryNode, which we now deæne: MkeVryNodeè i, RecordNumsè Mke new Vry node clled VN. For k := ; ;:::n i Let Childnums k := fg. For ech j RecordNums Let v ij =Vlue of ttribute i in record j Add j to the set Childnums vij Let VN:MCV := rgmx k j Childnums k j. For k := ; ;:::n i If j Childnums k j= 0 or if k = MCV Set the i = k subtree of VN to. Else Set the i = k subtree of VN to MkeADTreeè i+ ; Childnums k è To build the entire tree, we must cll MkeADTreeè ; f :::Rgè. Assuming binry ttributes, the cost of building tree from R records nd M ttributes is bounded bove by References blog X Rc k=0 R k è M k Agrwl, R., Mnnil, H., Sriknt, R., Toivonen, H., & Verkmo, A. I. è996è. Fst discovery of ssocition rules. In Fyyd, U. M., Pitetsky-Shpiro, G., Smyth, P., & Uthurusmy, R. èeds.è, Advnces in Knowledge Discovery nd Dt Mining. AAAI Press.! è8è 89

24 Moore & Lee Breimn, L., Friedmn, J. H., Olshen, R. A., & Stone, C. J. è984è. Clssiæction nd Regression Trees. Wdsworth. Clrk, P., & Niblett, R. è989è. The CN induction lgorithm. Mchine Lerning,, 6í84. Cover, T. M., & Thoms, J. A. è99è. Elements of Informtion Theory. John Wiley & Sons. Fyyd, U., Mnnil, H., & Pitetsky-Shpiro, G. è997è. Dt Mining nd Knowledge Discovery. Kluwer Acdemic Publishers. A new journl. Fyyd, U., & Uthurusmy, R. è996è. Specil issue on Dt Mining. Communictions of the ACM, 9 èè. Friedmn, N., & Ykhini, Z. è996è. On the smple complexity of lerning Byesin networks. In Proceedings of the th conference on Uncertinty in Artiæcil Intelligence. Morgn Kufmnn. Guttmn, A. è984è. R-trees: A dynmic index structure for sptil serching. In Proceedings of the Third ACM SIGACT-SIGMOD Symposium on Principles of Dtbse Systems. Assn for Computing Mchinery. Hrinryn, V., Rjrmn, A., & Ullmn, J. D. è996è. Implementing Dt Cubes Eæciently. InProceedings of the Fifteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Dtbse Systems : PODS 996, pp. 05í6. Assn for Computing Mchinery. John, G. H., Kohvi, R., & Pæeger, K. è994è. Irrelevnt fetures nd the Subset Selection Problem. In Cohen, W. W., & Hirsh, H. èeds.è, Mchine Lerning: Proceedings of the Eleventh Interntionl Conference. Morgn Kufmnn. John, G. H., & Lent, B. è997è. SIPping from the dt ærehose. In Proceedings of the Third Interntionl Conference on Knowledge Discovery nd Dt Mining. AAAI Press. Kohvi, R. è995è. The Power of Decision Tbles. In Lvre, N., & Wrobel, S. èeds.è, Mchine Lerning : ECML-95 : 8th Europen Conference on Mchine Lerning, Herclion, Crete, Greece. Springer Verlg. Kohvi, R. è996è. Scling up the ccurcy of nive-byes clssiæers: decision-tree hybrid. In E. Simoudis nd J. Hn nd U. Fyyd èed.è, Proceedings of the Second Interntionl Conference on Knowledge Discovery nd Dt Mining. AAAI Press. Mdl, H. R., & Ivkhnenko, A. G. è994è. Inductive Lerning Algorithms for Complex Systems Modeling. CRC Press Inc., Boc Rton. Mnnil, H., & Toivonen, H. è996è. Multiple uses of frequent sets nd condensed representtions. In E. Simoudis nd J. Hn nd U. Fyyd èed.è, Proceedings of the Second Interntionl Conference on Knowledge Discovery nd Dt Mining. AAAI Press. 90