Load Balancing of Parallelized Information Filters

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XXX, NO. XX, XXXXXXX 2001 1 Load Balancng of Parallelzed Informaton Flters Nel C. Rowe, Member, IEEE Comuter Socety, and Amr Zaky, Member, IEEE Comuter Socety AbstractÐWe nvestgate the data-arallel mlementaton of a set of ªnformaton fltersº used to rule out unnterestng data from a database or data stream. We develo an analytc model for the costs and advantages of load rebalancng the arallel flterng rocesses, as well as a quck heurstc for ts desrablty. Our model uses bnomal models of the flter rocesses and fts key arameters to the results of extensve smulatons. Exerments confrm our model. Rebalancng should ay off whenever rocessor communcatons costs are hgh. Further exerments showed t can also ay off even wth low communcatons costs for 16-64 rocesses and 1-10 data tems er rocessor; then, mbalances can ncrease rocessng tme by u to 52 ercent n reresentatve cases, and rebalancng can ncrease t by 78 ercent, so our quck redctve model can be valuable. Results also show that our roosed heurstc rebalancng crteron gves close to otmal balancng. We also extend our model to handle varatons n flter rocessng tme er data tem. Index TermsÐInformaton flterng, data arallelsm, load balancng, nformaton retreval, conjunctons, otmalty, and Monte Carlo methods. 1 INTRODUCTION æ SUPPOSE we have a query whch s a conjuncton of restrctons on data we wsh to retreve from a bg database. These restrctons can be consdered as ªnformaton fltersº [1], or as rograms that take some set of data tems as nut and return some small subset of ªnterestngº tems as outut. For nstance, n retrevng ctures of buldngs under constructon, we can look n an ndex frst for the ctures that show buldngs and then exclude those that do not show constructon, alyng two nformaton flters n successon. Sgnature methods [4] are an mortant subcase of nformaton flterng; these check aganst hashes of nformaton n a comlex data structure. We would lke to flter n the fastest way. Rowe [8] and Jarke and Koch [5] dscuss the otentally dramatc advantages of reorderng the conjuncts and of nsertng redundant conjuncts. But, those results are rmarly for sequental machnes. Effcency ssues are becomng ncreasngly mortant wth multmeda lbrares [9]. Multmeda data can be consderably more costly to retreve than text data; for nstance, a long aer requres 10K bytes whle a sngle televson cture requres 1,000K, so data transfer s slower and slower storage meda are often requred. Content analyss of multmeda data can be very slow; recognzng a shae n a cture, for nstance, requres comlex data structures and many asses through the cture. So, t hels consderably to summarze multmeda data and ndex t. Informaton flters can then check the ndexes quckly. Parallel nformaton flterng then could seed thngs u consderably more. For nstance, to fnd ctures of buldngs under constructon when we have 100 rocessors, we could have each of the 100 look n a dfferent art of the database, or, alternatvely, we could have 50 look for buldngs n the database whle 50 looked for thngs under constructon.. The authors are wth Code CS/R, Deartment of Comuter Scence, Naval Postgraduate School, Monterey, CA 93943. E-mal: {rowe, zaky}@cs.ns.navy.ml. Manuscrt receved 19 Set. 1997; revsed 25 May 2000; acceted 29 Aug. 2000. For nformaton on obtanng rernts of ths artcle, lease send e-mal to: tkde@comuter.org, and reference IEEECS Log Number 105711. Whle otmzaton of arallel nformaton flterng s a schedulng roblem, general schedulng methods such as [11] are too general to be effcent for t. Also, not drectly helful are exerments wth exensve massvely-arallel machnes [10] and wth thousands of flters [7] snce the resources requred for those aroaches are beyond the caabltes of most comuter facltes and workstatons, recludng the real-tme alcaton of the methods there. Instead, we wll assume that under 100 flters are easly avalable multrocessor software, and we wll exlot the roblem-secfc feature that each flter n a flterng sequence has fewer data tems to work on. Secton 2 of ths aer roves, by a theorem, that we can narrow our nvestgaton to data arallelsm. Secton 3 ntroduces the statstcal metrcs we wll need to assess, the benefts and costs of load balancng, and wll develo quck aroxmatons of them. Secton 4 dscusses the obvous cases for load balancng n data-arallel flterng and then develos a smle and quck crteron for when t s desrable. Secton 5 extends these results nto an algorthm for decdng where to balance loads n a long sequence of flters. Secton 6 reorts on a wde range of smulaton exerments we conducted; Secton 7 reorts on more focused exerments about an IRIS multrocessor. All exerments suorted our analytc models. Secton 8 extends the results of Sectons 3 and 4 to the case n whch flter cost er data tem can vary. 2 ASSUMPTIONS Suose data tems must ass through m nformaton flters; that s, a data tem must searately ass the test admnstered by each flter f that data tem s to be art of the query answer. Assume that the event of a data tem assng a test s ndeendent of the event of any other data tem assng any test. Assume that each flter has an average cost of executon er data tem of c and an a ror robablty of assng a random data tem of, where 0 < < 1 to avod consderng trval cases. Let ;j be the robablty of assng flters through j. Assume further that tmes are ndeendent of robabltes, or that the tme to test whether an tem asses a flter s ndeendent of the success or falure of the test or any other test on that data tem; ths s true of testng of unform-sze data by hash table lookus n sgnature fles, for nstance. If the flters are aled n sequence, the exected total tme er data tem of assng flters one through m wll be: c 1;m ˆ c 1 c 2 1;1 c 3 1;2 c m 1;m 1 : The tme and robablty arameters can be estmated ether from ast statstcs of the flters on smlar roblems, or by alyng the flters to a small random samle of the database. Now, suose we have R dentcal rocessors avalable on a sngle-user multrocessor. We can show that t never makes sense to execute dfferent nformaton flters n arallel gven the reasonable assumton that ntalzaton tme s roortonal to the number of rocessors. That s because data arallelsm, where all rocessors work on the same flter at the same tme, s roven to be better. Data-Parallelsm Theorem. Suose the executon tme of two nformaton flters n arallel (functonal arallelsm) for a set of data tems s kr max c 1 =r 1 ;c 2 =r 2, where k 0 s a constant, R s the number of avalable dentcal rocessors, r s the number of rocessors allocated to flter, and c s the total executon tme cost for flter. Then, functonal arallelsm s always slower than data arallelsm for those rocessors. Proof. The tme for data arallelsm (usng all R rocessors for one flter, then all R for the other flter) s kr c 1 =R 1 c 2 =R f flter 1 goes frst, and kr c 2 =R 2 c 1 =R f flter 2 goes 1 1041-4347/01/$10.00 ß 2001 IEEE

2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XXX, NO. XX, XXXXXXX 2001 frst, where s the robablty of success of flter. The tme for functonal arallelsm s kr max c 1 =r 1 ;c 2 = R r 1 assumng neglgble tme to ntersect the results of the two flters. The mnmum of the latter wth resect to r 1 occurs wth erfect load balancng (f ossble) of the two rocessors, or when c 1 =r 1 ˆ c 2 = R r 1, or when r 1 ˆ Rc 1 = c 1 c 2. Then, the tme becomes kr c 1 c 2 =R ˆkR c 1 =R c 2 =R. Comarng terms wth the formula for data arallelsm wth flter 1 frst, we see that kr ˆ kr, c 1 =R ˆ c 1 =R, and c 2 =R > 1 c 2 =R. Comarng terms wth the formula for data arallelsm wth flter 2 frst, we see that kr ˆ kr, c 2 =R ˆ c 2 =R, and c 1 =R > 2 c 1 =R. Hence, functonal arallelsm s always slower than data arallelsm even wth our conservatve assumtons and no matter what R, c 1, c 2, and k are (and even f k ˆ 0). tu Thus, data arallelsm s the best way to exlot R dentcal rocessors for flterng. For maxmum advantage, we should assgn data tems randomly to rocessors to revent related tems from endng u on the same rocessor and affectng ts rate of flterng success. We should also ntally assgn equal numbers of data tems to each rocessor f rocessng tme er data tem s ether constant or unredctable. Data tems should reman wth ther rocessors for the second and subsequent flterng actons f necessary, so no data needs to be transferred between rocessors, and a rocessor that fnshes early on a flter can start early on the next. But, f the number of remanng data tems er rocessor can vary sgnfcantly over rocessors after a sequence of flterng actons, data transfers mght sgnfcantly reduce total rocessng tme. General load-balancng algorthms for data-arallel comutatons (lke [6] usng run-tme decson-theoretc analyss and [3] usng run-tme montors of rocessor erformance) could be used, but nformaton flterng s a secal case for whch secal technques should work better. Flterng s a chan of smle lnearly-deendent actons, and we can randomly assgn data to rocessors. Ths means that assocated random varables wll ft well to classc dstrbutons lke the bnomal and normal, unlke many alcatons of data arallelsm. 3 ESTIMATING THE DEGREE OF LOAD IMBALANCE We need to model the exected load mbalance after alyng flters on multle dentcal data-arallel rocessors, or the degree to whch some rocessors wll have more data tems to work on than others even when they start wth the same number. The event of a random data tem assng flters 1 through s a bnomal rocess. For n data tems ntally suled to each rocessor (for Rn ˆ N total data tems), the exected mean number of data tems resultng from flter on a rocessor wll be n 1; and the standard devaton about that mean wll be ˆ n 1; 1 1;. The 1; decreases monotoncally wth, so both the mean and the standard devaton of the exected number of data tems remanng wll decrease; but, generally, flter tme c ncreases n a good sequence order [8] (quck flters should generally be frst to reduce work faster), so t s not easy to guess the best laces to rebalance. We assume for now that flterng tme s the same for every data tem, as wth common flterng methods lke lookus n hash tables. Thus, an extra k data tems, for a flter that requres tme of c er data tem, mean an extra kc n rocessng tme. But, Secton 8 wll rovde some results when flterng tme can vary. Two statstcs whch are helful for analyzng mbalance are the excess number of data tems suled to the most overburdened of R rocessors (whch measures the beneft of rebalancng), and the mnmum number of data tems that must be transferred among R rocessors to rebalance them (whch measures the cost of rebalancng). Mathematcally, these are, resectvely, the dfference between the maxmum and mean of R random varables drawn on a bnomal dstrbuton (what we wll call ªmaxdevº) and the average of the absolute values of the devatons of the same R varables about ther mean (what we wll call ªabsdevº) tmes R. ªMaxdevº s usually the L 1 norm used n numercal analyss [2] (snce our dstrbutons usually skew toward the maxmum) and ªabsdevº s the L 1 norm. Publshed analyses of these norms usually assume a normal dstrbuton, not a good aroxmaton for a bnomal dstrbuton n ths alcaton where both the number of data tems and the robablty of flter success can be very small. So, we ragmatcally used ªMonte Carloº technques (exerments wth random numbers) and regresson on the results of the exerments to get aroxmate formulas relatng ªmaxdevº and ªabsdevº to the standard devaton of the bnomal dstrbuton snce, ntutvely, both should be roortonal to the standard devaton. We generated sets of 8,192 random varables on bnomal dstrbutons wth robabltes 1; ˆ 0:01 2 k for k ˆ 4 to 6 and wth unverse szes n ˆ 2 j for j ˆ 2 to 13. Then, we comuted ªmaxdevº and ªabsdevº takng those varable values as the number of data tems on R rocessors, R ˆ 2 l for l ˆ 2 to 9. Snce magntudes of these arameters vary wdely, t makes sense to take ther logarthms before dong a regresson. The best models found corresonded to the equatons: maxdev ˆ 0:9756 0:8322 R 0:2353 n 0:0307 ; 2 absdev ˆ 0:7039 1:1007 ; 3 where s the observed standard devaton of the number of tems remanng after flter has been aled; ts theoretcal value s n b b; 1 b;, where b s the number of the most recent revous flter before whch the data tems were balanced. The coeffcent of determnaton (or degree of confdence) was 0.983 for ªmaxdevº and 0.981 for ªabsdev.º We nvestgated the addton of a number of other terms to these models, but none added anythng statstcally sgnfcant. Note the closeness of the exonents of to 1 whch says these statstcs are closely connected to the standard devaton. As an examle, suose we have 8,000 data tems, 16 dentcal rocessors, and a flter wth success robablty 1 (robablty of a random data tem of assng the flter) of 0.1. Then, 500 data tems should be ntally assgned to each rocessor. An exected number of 50 data tems er rocessor should reman after alyng the flter, wth a standard devaton over rocessors n the number of data tems remanng of 1 ˆ 500 0:1 0:9 ˆ 6:708: Then, our formulas say and maxdev 1 ˆ 0:9756 6:708 0:8322 16 0:2353 500 0:0307 ˆ 11:050 absdev 1 ˆ 0:7039 6:708 1:1007 ˆ 5:719: Ths says that, on the average, the most overburdened of the 16 rocessors after the frst flter wll have 11.05 more data tems than the average rocessor; and the average rocessor wll need to send or receve 5.719 data tems n order to rebalance after the frst flter. 4 EVALUATING THE BENEFITS AND COSTS OF THE LOAD BALANCING OF FILTER PROCESSORS One way to rebalance s ncremental: Have a ªcontrolº rocessor or shared memory kee lsts of data tems that need to be fltered, one

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XXX, NO. XX, XXXXXXX 2001 3 for each flter lus a fnal-result lst. When a flter rocessor becomes dle, t asks the control rocessor to send t some tem from the lst for flter for some and the control rocessor deletes the tem from the lst. The flter rocessor ales flter to the tem. If the tem asses, the rocessor returns t to the control rocessor to be added to the lst for flter 1. Intally, all lsts are emty excet the lst for flter 1, whch should contan all data tems; work termnates when all lsts are emty excet ossbly the fnal-result lst. Ths method s smle and revents any sgnfcant mbalance. Unfortunately, t entals consderable message traffc and resource contenton for the control rocessor snce each flterng acton on each data tem requres messages to and from the control rocessor. Ths s ntolerable n the many mortant alcatons where communcaton s costly, as for flters on dfferent Internet stes or for remote envronmental sensors dstrbuted over a terran. Ths can also be ntolerable wth less communcaton costs and a large number of data tems or rocessors, esecally f flterng requres as much nformaton about the data tem as wth the multmeda data tems that are our chef alcaton nterest. A fx mght be to store on each flterng rocessor all the nformaton about every data tem, so that only tem onters need be transferred to and from the control rocessor. But, that could ental mossble memory requrements n each rocessor, as for mllons of multmeda data tems. In general, some load balancng wll usually be desrable whenever communcatons costs between flters are large and some flters are costly. That does not mean, however, balancng before every flter. The mbalance may not be large enough, nor the exected cost mrovement. And judcous balancng can stll be desrable even wth low communcaton costs. So, we need to analyze a sequence of flters carefully to fnd good balancng oortuntes. We exlore the followng effcent rebalancng method usng some deas from Secton 6.2 of [11]. Assume that each rocess when comletng flter reorts ts number of data tems remanng to a control rocessor. (Centralzed control s more effcent than rebalancng by messages between ars of rocessors snce obtanng erfect balance requres nformaton from all rocessors.) When the control rocessor receves messages from all R rocessors for flter, and f the dfference between the number of data tems on the busest and least-busy rocessors s large enough, t sends orders to the rocessors to transfer certan numbers of data tems. These transfers need not slow rocessng much f rocessors have already begun flter 1 because the transferred tems can be done last when executng flter 1. Perfect balance s mossble when the number of data tems s not a multle of R. But, f each rocessor tmestams nformaton sent to the central rocessor, comensaton can be done later by gvng earler fnshng rocessors addtonal tems to work on. For nstance, wth three rocessors and 28 data tems, one rocessor must ntally take 10 tems nstead of nne. Suose the next flter asses sx of the 10, four of the nne, and four of the other nne. If each flter takes the same amount of tme er data tem, transfer one of the sx to each of the other two rocessors and ths should comensate. Now, let t s be the tme for a rocessor to reort ts set sze; let t b be the tme for a rocessor to receve, decher, and ntalze a rebalancng order from the control rocessor; let t t be the tme er data tem for a rocessor to transmt or receve a sngle data tem to or from another rocessor; and let c 1;m be the exected tme er data tem of executng the remanng flters 1 through m, usng the notaton of (1). Also, let m be the number of flters, n the ntal number of data tems on each rocessor, and R the number of rocessors. Perfect load rebalancng just after flter wll have a tme advantage of: D ˆ c 1;m maxdev Š t s t b t t R absdev Š: (Ths conservatvely assumes the worst case of a bus archtecture, where all rocessors examne all R absdev transferred data tems; but, wth a star rocessor archtecture for nstance, where every rocessor has a unque connecton to every other rocessor, the most unbalanced rocessor determnes rebalancng tme, and R absdev can be relaced by maxdev.) As an examle, take the one n the last secton where we have 8,000 data tems, 16 rocessors, 500 tems ntally assgned to each rocessor, and a flter wth success robablty 1 ˆ 0:1. There we found maxdev 1 ˆ 11:050 and absdev 1 ˆ 5:719. Assume the tme cost of the remanng flters er data tem s c 2;m ˆ 10 and t s ˆ t b ˆ t t ˆ 1. Then, D 1 ˆ 10 11:05 1 1 1 16 5:719 ˆ 16:996; whch says rebalancng s advantageous. In general, we can substtute (2) and (3) nto (4) to obtan: D ˆ c 1;m 0:9756 0:8322 R 0:2353 n 0:0307 Š t s t b Rt t 0:7039 1:1007 ŠŠ; 5 where s the standard devaton of the number of data tems er rocessor after flter. Rebalancng wll only be desrable f ths s ostve, or f: c 1;m > 1:0250 t s t b 0:8322 R 0:2353 n 0:0307 Š 6 0:7215t t 0:2785 R 0:7647 n 0:0307 Š: These formulae hold for any number of flters m even f m<rfor R the number of rocessors. They also hold for the total number of data tems less than R. However, n the latter case, the seedu obtaned by arallelsm for flter s lmted to the number of data tems as nut to, although some artal comensaton can be obtaned wth the tmestam aroach mentoned above. 5 AN ALGORITHM FOR LOAD REBALANCING More than one rebalancng may hel n a sequence of flters. But, rebalancng reduces the need for subsequent rebalances n the flter sequence. In artcular, f rebalancng s done just before flter, the standard devaton of the number of tems remanng after flter j on a rocessor changes from n 1;j 1 1;j to q q n 1; 1;j = 1; 1 1;j = 1; ˆ n 1;j 1 1;j = 1; : Snce often ths change has lttle effect, we can use a heurstc ªgreedyº algorthm to fnd the best rebalancng lan for a flter sequence: 1. Comute D for every flter 0 <<mfrom (5). 2. If the maxmum unused D s nonostve, sto. 3. Else fnd D j, the maxmum unused D ; mark t as used, and make a note to rebalance after flter j. 4. Recomute D from (5) for all >jand go to Ste 2. Ths algorthm s O m 2, m s the number of flters. As an examle, suose we have three flters n ths order: flter 1 whch requres 10 tme unts and has a success robablty 0.1, flter 2 whch requres 30 and has a success robablty 0.05, and flter 3 whch requres eght and has a success robablty of 0.5. Hence, c 2;3 ˆ 30 0:05 8 ˆ 30:4 and c 3;3 ˆ c 3 ˆ 8. Assume 8,000 ntal data tems assgned to 16 rocessors, so n 0 ˆ 500, n 1 ˆ 50, and n 2 ˆ 2:5. Then: 4

4 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XXX, NO. XX, XXXXXXX 2001 TABLE 1 Examle Results of the Frst Exerment 1 ˆ 500 0:1 0:9 ˆ 6:708; 2 ˆ 500 0:005 0:995 ˆ 1:577: Assume also that t s ˆ 0, t b ˆ 0, and t t ˆ 1. Then: D 1 ˆ 30:4 0:9756 6:708 0:8322 16 0:2353 500 0:0307 16 1 0:7039 6:708 1:1007 ˆ205:9 D 2 ˆ 8 0:9756 1:577 0:8322 16 0:2353 500 0:0307 16 1 0:7039 1:577 1:1007 ˆ0:079: Hence, the best lace to rebalance s after flter 1. If we do that, 2 s changed to 50 0:05 0:95 ˆ 1:541 and D 2 to: D 2 ˆ 8 0:9756 1:541 0:8322 16 0:2353 500 0:0307 16 1 0:7039 1:541 1:1007 ˆ 1:536: So, we should rebalance only after flter 1 and not after flter 2. The algorthm tres to antcate the need to balance loads, but rebalancng can also be done at executon tme when set szes are unexectedly large. To do ths, use (4) to comute D usng the actual rocessor allocatons for ªmaxdevº and ªabsdev;º rebalance f D > 0. Ths entals a usually neglgble overhead of t s er flter. 6 EXPERIMENTS ON THE DEGREE OF LOAD IMBALANCE IN SIMULATED FILTERING We frst generated 5,000 random flter sequences (created by the rogram n [8]) to study average-case erformance of our methods. The rogram s assgned to flter a random on the range 0.001 to 0.999. In one set of exerments, flter tmes were ntally evenly dstrbuted from 0 to 10; n another set, logarthms of tmes were evenly dstrbuted from 0.01 to 0.99. But, for both, we occasonally ncreased the tme of later flters as necessary to ensure that no flter took less than a recedng flter that t logcally entaled. Snce otmzng the flter sequence s mortant for realstc balancng analyss, we found the locally-otmal sequence usng the heurstc methods of [8], methods whch we showed there to almost always fnd the globally otmal sequence. We then comuted two ratos for ths sequence: r b, the rato of the extra tme ncurred by never rebalancng the exected mbalances n the flter sequence to n c 1;m, the tme f no mbalances ever occurred; and r c, the rato of the exected number of data tems that must be shfted to always mantan erfect balance to the flter rocessng load m n, the roduct of the ntal number of data tems and the number of flters. So, r b measures the relatve beneft of rebalancng and r c measures the relatve cost of rebalancng; both are dmensonless quanttes. We can rewrte (4) usng them: D ˆ c 1;m n r b Š t s t b t t r c m n Š: Table 1 shows some reresentatve data for r b and r c, where the four rght columns reresent geometrc means of r b and r c over the 5,000 onts. We can make three observatons from Table 1. Frst, mbalance can ncrease rocessng tme by u to 52 ercent n these cases, and rebalancng actons can amount to 78 ercent of the number of flterng actons, so, mbalance analyss can be mortant. Second, rebalancng s more desrable wth a unform dstrbuton of the logarthms of the tmes; that suggests that more uneven tme dstrbutons wll show even more advantage. Thrd,

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XXX, NO. XX, XXXXXXX 2001 5 rebalancng s most helful when the number of data tems s near to or less than the number of rocessors. Such cases are not trval; however, mortant alcatons occur wth small numbers of tems and costly flters, as wth the natural-language and magerocessng flters n [8]. Our exerments can be summarzed wth regresson formulae redctng r b and r c from the number of flters m, the logarthm of the number of rocessors log R, and the logarthm of the number of startng data tems log n. These formulae ermt us to quckly judge when load balancng by the algorthm s worthwhle usng r b and r c. The best models we obtaned on 1,152 data onts were (wth coeffcents of determnaton of 0.988 and 0.959, resectvely): r b ˆ 0:0379 0:2967m 0:2304 log R 0:2442 log n 0:0062m log R 0:0622m log n 0:0678 log R log n 0:0059m 2 0:0112 log 2 r 0:0626 log 2 n 0:0003 log 3 r 0:0037 log 3 n 0:006m log 2 r 0:0009m 2 log n 0:0028m log 2 n 0:0021 log 2 R log n 0:0047 log R log 2 n ; r c ˆ 1:1840 0:8502m 0:4592 log R 0:9494 log n 0:1774m 2 0:2316 log 2 r 0:1866 log 2 n 0:0075m 3 0:0642 log 3 r 0:0131 log 3 n 0:2184m log R 0:0185m log n 0:2707 log R log n 0:0121m 2 log R 0:0123m log 2 R 0:0081m 2 log n 0:0047m log 2 n 0:0950 log 2 r log n 0:0539 log R log 2 n : To test the mortance of usng the otmal flter sequence, we also ran exerments usng the reverse of the otmal flter order and logarthmc tmes. Ths dd mrove the degree of balance snce r b decreased consstently to about 0.2 of ts revous value, whle r c only ncreased to 1.2 of ts revous value. But, the tme of executng the flter sequence ncreased on average by a factor of 7.0 for 5 flters and 24.9 for 10 flters, whch seems a oor bargan. 7 EXPERIMENTS FOR A PARTICULAR PARALLEL MACHINE We also studed a real multrocessor, the IRIS 3D/340, to see the effects of modest overheads on arallelsm. (Agan, many alcatons do not have such low overheads.) We dd a flterng mlementaton wth a master thread for the control rocessor whch forked threads for the flterng rocessors. When a thread fnshed alyng a flter to ts share of data, t roceeds to the next flter, excet f t had receved an order drectng t to susend executon. Then, the master thread assessed workloads, rebalanced them, and restarted the threads. Our runs rovded a least-squares estmate of t s t b ˆ 2:65R 16:5 mcroseconds, R beng the number of rocessors; the rebalancng cost factor, t t, was observed to be neglgble for ths shared-memory machne. (The frst fork s much more tme-consumng than subsequent forks, but reresents constant overhead for all executon lans.) Real-tme flterng on the IRIS encountered erformance varatons due to uncontrollable oeratng system behavor and the resence of other users on the machnes' network. So, we oted to smulate the machne wth arameters derved from the runs. Instead of random flter sequences unlkely to occur n alcatons, we used flter sequences lke those found to be otmal n [8]. So, we assumed the ªwork-sreadº between flters, c 1 1; 1 =c 1; usng the notaton of Secton 2, was aroxmately constant. Thus, work sread was a fourth nut n these exerments besdes the three of Table 1. For outut, we measured sequence-tme seedu, the tme when usng one rocessor dvded by the tme when usng all rocessors. We comared three aroaches: 1) no rebalancng, Fg. 1. An examle of our results for 32 rocessors and four ndeendent flters. 2) heurstc rebalancng usng the algorthm of Secton 5, and 3) otmal rebalancng, tryng all 2 m 1 ways of rebalancng and choosng the one of mnmum tme as redcted from (1) and (5). Our exerments showed the work-sread arameter was most mortant n affectng balancng, and hgher sreads caused hgher mbalances. We also found 1) when more than 100 tems were assgned to a rocessor, mbalance never ran more than 10 ercent over otmum tme; 2) otherwse, rebalancng was unnecessary f work-sread < 1.5; and 3) the need to rebalance was not sgnfcantly related to the number of flters. We also dentfed uer bounds on the beneft of rebalancng by analyzng flter sequences wth otmal rebalancng assumng t s ˆ t b ˆ t t ˆ 0. We then focused on regons of the search sace where the seed wthout rebalancng was more than 10 ercent of the otmum. Confrmng Table 1, mbalance was more sgnfcant when the number of tems er rocessor was small and when the number of rocessors was large. Fg. 1 shows an examle of our results for 32 rocessors and four ndeendent flters and dslays the rato of the seedu obtaned by arallelzng wthout rebalancng versus arallelzng wth rebalancng usng the heurstc algorthm. The lot shows a 100 ercent mrovement for 32 tems er rocessor and u to 25 ercent for 256 tems er rocessor. In these exerments, heurstc rebalancng was no worse than 1 ercent more than otmal rebalancng. To test senstvty to the cost of overhead, we also ran the heurstc wth the same cases and wth t s t b set to 100 tmes the average IRIS values. Fg. 2 shows the rato of the subsequent heurstc seedu to the orgnal heurstc seedu llustratng only mnor dfferences. Ths suggests that the heurstc algorthm would stll be sutable wth sgnfcant communcaton overheads, as for a cluster of workstatons. Overall, results of the IRIS exerments suggest that our heurstc algorthm s useful for medum scale multrocessng (16-64 rocessors). For fewer rocessors, arallel flterng remans reasonably well-balanced. For more rocessors, rebalancng seems lkely to cost more than ts benefts.

6 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XXX, NO. XX, XXXXXXX 2001 robablty 0.5 that t takes 2 unts, and robablty 0.25 t takes 3 unts. Then, the standard devaton of random varable x s s ˆ 0:25 0:5 2 0:5 0 2 0:25 0:5 2 ˆ 0:3535: Hence, oldmaxdev ˆ 0:9756 1:541 0:8322 16 0:2353 500 0:0307 ˆ 3:249 maxdev ˆ 3:249 1:1777 2:2554 0:3535 2 ˆ 3:924: Substtutng n (4): D 2 ˆ 8 3:924 16 1 0:7039 1:541 1:1007 ˆ13:27: Hence, rebalancng after flter 2 has become desrable even wth rebalancng after flter 1. Fg. 2. Rato of the subsequent heurstc seedu to the orgnal heurstc seedu, whch shows only mnor dfferences. 8 EXTENDING THE MODEL TO VARIABLE-TIME FILTERS So far, we have assumed that flter tme s a constant er data tem. Ths s arorate when flters are mlemented by hash lookus, by searches n trees n whch all leaves are at the same level, or when the result of a numerc calculaton wthout condtonals s comared to a threshold. But, f flter rocessng tme can vary, rocessor mbalance ncreases and, thus, the advantages of rebalancng. Ths requres a new formula for ªmaxdevº n (4); the ªabsdevº formula s unchanged. Agan, we used Monte Carlo methods on sets of now 1,024 random varables to get an aroxmate formula for ªmaxdev,º now generatng a random varable x for the rato of flter executon tme er data tem to ts exected value. We assumed that x was normally dstrbuted by 1 wth standard devaton s and wth (assumed rare) truncaton at 0. Then, the random varable y for the average of k values of x wll have aroxmately a normal dstrbuton wth k tmes less standard devaton: y ˆ x 1 = k 1 k ˆ x 1 k k: Our exerments used the n,, and R values of Secton 3 and took s ˆ 0:1 2 d for d ˆ 4 to d ˆ 3. The best model that we found relatng s and the old ªmaxdevº (from (2)) to the new ªmaxdevº (by regresson on the square of maxdev) was: maxdev ˆ oldmaxdev 1:1777 2:2554s 2 : 7 Agan, we exlored a varety of addtonal factors, but none added anythng sgnfcant. Ths model had a coeffcent of determnaton of 0.932: not bad for a ft to 6,144 onts reresentng extrema of a random varable. These results were for s<1 f s contnues to ncrease, the normal dstrbuton assumton no longer always holds, but, n the lmt, maxdev should ncrease roortonately to s. As an examle, take the three-flter case of Secton 5 wth the rebalancng after flter 1. Now, assume that flter 2 s a looku n a nonunform ndex n whch data can be at a deth of 1, 2, or 3. Assume wth robablty 0.25 that rocessng takes 1 unt of tme, 9 CONCLUSIONS We have shown that data arallelsm s the best way to take advantage of multle rocessors for an nformaton flterng roblem. We have rovded the reader wth quck ractcal tools for measurng the degree of mbalance that can occur n dataarallel flterng n (5) or the combnaton of (4) and (7). For handlng large numbers of flters, the algorthm of Secton 5 wll hel. Two knds of exerments, general ones and those focused on a artcular archtecture, have confrmed our analyss. Careful load balancng can be mortant when communcatons or flter costs are hgh, and a quck way to evaluate t can be valuable. ACKNOWLEDGMENTS Ths work was sonsored by the Defense Advanced Research Projects Agency as art of the I3 Project under AO 8939, and by the US Naval Postgraduate School under funds rovded by the Chef for Naval Oeratons. Comutatons were done wth Mathematca software. REFERENCES [1] N.J. Belkn and W.B. Croft, ªInformaton Flterng and Informaton Retreval: Two Sdes of the Same Con?º Comm. ACM, vol. 35, no. 12,. 29-38, Dec. 1992. [2] G. Dahlqust and A. Bjorck, Numercal Methods. Englewood Clffs, N.J.: Prentce-Hall, 1974. [3] J. De Keyser and D. Roose, ªLoad Balancng Data Parallel Programs on Dstrbuted Memory Comuters,º Parallel Comutng, vol. 19,. 1199-1219, 1993. [4] C. Faloutsos, ªSgnature-Based Text Retreval Methods: A Survey,º Database Eng.,. 27-34, Mar. 1990. [5] M. Jarke and J. Koch, ªQuery Otmzaton n Database Systems,º Comutng Surveys, vol. 16, no. 2,. 117-157, June 1984. [6] D. Ncol and P. Reynolds, ªOtmal Dynamc Remang of Data Parallel Comutatons,º IEEE Trans. Comuters, vol. 39, no. 2,. 206-219, Feb. 1990. [7] K. Pattatt and M. Dontamsetty, ªOn a Generalzed Test Sequencng Problem,º IEEE Trans. Systems, Man, and Cybernetcs, vol. 22, no. 2,. 392-396, Mar./Ar. 1992. [8] N.C. Rowe, ªUsng Local Otmalty Crtera for Effcent Informaton Retreval wth Redundant Informaton Flters,º ACM Trans. Informaton Systems, vol. 14, no. 2,. 138-174, Ar. 1996. [9] N.C. Rowe, ªPrecse and Effcent Retreval of Catoned Images: The MARIE Project,º Lbrary Trends, vol. 48, no. 2,. 475-495, Fall 1999. [10] C. Stanfll and B. Kahle, ªParallel Free-Text Search on the Connecton Machne System,º Comm. ACM, vol. 29, no. 12,. 1229-1239, Dec. 1986. [11] H. Stone, Hgh-Performance Comuter Archtecture. Readng, Mass.: Addson- Wesley, 1987.