Group Nearest Neighbor Queries



Similar documents
6.7 Network analysis Introduction. References - Network analysis. Topological analysis

IDENTIFICATION OF THE DYNAMICS OF THE GOOGLE S RANKING ALGORITHM. A. Khaki Sedigh, Mehdi Roudaki

APPENDIX III THE ENVELOPE PROPERTY

Preprocess a planar map S. Given a query point p, report the face of S containing p. Goal: O(n)-size data structure that enables O(log n) query time.

Chapter Eight. f : R R

Speeding up k-means Clustering by Bootstrap Averaging

ANOVA Notes Page 1. Analysis of Variance for a One-Way Classification of Data

Numerical Methods with MS Excel

Cyber Journals: Multidisciplinary Journals in Science and Technology, Journal of Selected Areas in Telecommunications (JSAT), January Edition, 2011

Average Price Ratios

Optimal multi-degree reduction of Bézier curves with constraints of endpoints continuity

The analysis of annuities relies on the formula for geometric sums: r k = rn+1 1 r 1. (2.1) k=0

Efficient Traceback of DoS Attacks using Small Worlds in MANET

Maintenance Scheduling of Distribution System with Optimal Economy and Reliability

The Digital Signature Scheme MQQ-SIG

Abraham Zaks. Technion I.I.T. Haifa ISRAEL. and. University of Haifa, Haifa ISRAEL. Abstract

ECONOMIC CHOICE OF OPTIMUM FEEDER CABLE CONSIDERING RISK ANALYSIS. University of Brasilia (UnB) and The Brazilian Regulatory Agency (ANEEL), Brazil

SHAPIRO-WILK TEST FOR NORMALITY WITH KNOWN MEAN

T = 1/freq, T = 2/freq, T = i/freq, T = n (number of cash flows = freq n) are :

A Study of Unrelated Parallel-Machine Scheduling with Deteriorating Maintenance Activities to Minimize the Total Completion Time

Statistical Pattern Recognition (CE-725) Department of Computer Engineering Sharif University of Technology

Green Master based on MapReduce Cluster

ADAPTATION OF SHAPIRO-WILK TEST TO THE CASE OF KNOWN MEAN

Applications of Support Vector Machine Based on Boolean Kernel to Spam Filtering

Fractal-Structured Karatsuba`s Algorithm for Binary Field Multiplication: FK

A New Bayesian Network Method for Computing Bottom Event's Structural Importance Degree using Jointree

DECISION MAKING WITH THE OWA OPERATOR IN SPORT MANAGEMENT

1. The Time Value of Money

Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases

CH. V ME256 STATICS Center of Gravity, Centroid, and Moment of Inertia CENTER OF GRAVITY AND CENTROID

CHAPTER 2. Time Value of Money 6-1

10.5 Future Value and Present Value of a General Annuity Due

Dynamic Two-phase Truncated Rayleigh Model for Release Date Prediction of Software

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

On Error Detection with Block Codes

The Gompertz-Makeham distribution. Fredrik Norström. Supervisor: Yuri Belyaev

How To Make A Supply Chain System Work

RUSSIAN ROULETTE AND PARTICLE SPLITTING

An Approach to Evaluating the Computer Network Security with Hesitant Fuzzy Information

Integrating Production Scheduling and Maintenance: Practical Implications

A Parallel Transmission Remote Backup System

Fast, Secure Encryption for Indexing in a Column-Oriented DBMS

Chapter 3. AMORTIZATION OF LOAN. SINKING FUNDS R =

of the relationship between time and the value of money.

Banking (Early Repayment of Housing Loans) Order,

Mobile Agents in Telecommunications Networks A Simulative Approach to Load Balancing

A DISTRIBUTED REPUTATION BROKER FRAMEWORK FOR WEB SERVICE APPLICATIONS

Optimal replacement and overhaul decisions with imperfect maintenance and warranty contracts

Capacitated Production Planning and Inventory Control when Demand is Unpredictable for Most Items: The No B/C Strategy

Performance Attribution. Methodology Overview

A particle swarm optimization to vehicle routing problem with fuzzy demands

Finito: A Faster, Permutable Incremental Gradient Method for Big Data Problems

Robust Realtime Face Recognition And Tracking System

Optimization Model in Human Resource Management for Job Allocation in ICT Project

Projection model for Computer Network Security Evaluation with interval-valued intuitionistic fuzzy information. Qingxiang Li

Constrained Cubic Spline Interpolation for Chemical Engineering Applications

Using Phase Swapping to Solve Load Phase Balancing by ADSCHNN in LV Distribution Network

Web Service Composition Optimization Based on Improved Artificial Bee Colony Algorithm

Security Analysis of RAPP: An RFID Authentication Protocol based on Permutation

A COMPARATIVE STUDY BETWEEN POLYCLASS AND MULTICLASS LANGUAGE MODELS

AN ALGORITHM ABOUT PARTNER SELECTION PROBLEM ON CLOUD SERVICE PROVIDER BASED ON GENETIC

Impact of Interference on the GPRS Multislot Link Level Performance

VIDEO REPLICA PLACEMENT STRATEGY FOR STORAGE CLOUD-BASED CDN

Settlement Prediction by Spatial-temporal Random Process

Proceedings of the 2010 Winter Simulation Conference B. Johansson, S. Jain, J. Montoya-Torres, J. Hugan, and E. Yücesan, eds.

Statistical Intrusion Detector with Instance-Based Learning

AnySee: Peer-to-Peer Live Streaming

The impact of service-oriented architecture on the scheduling algorithm in cloud computing

The Time Value of Money

Near Neighbor Distribution in Sets of Fractal Nature

Simple Linear Regression

Optimal Packetization Interval for VoIP Applications Over IEEE Networks

n. We know that the sum of squares of p independent standard normal variables has a chi square distribution with p degrees of freedom.

Credibility Premium Calculation in Motor Third-Party Liability Insurance

Classic Problems at a Glance using the TVM Solver

Efficient Compensation for Regulatory Takings. and Oregon s Measure 37

Analysis of one-dimensional consolidation of soft soils with non-darcian flow caused by non-newtonian liquid

STOCHASTIC approximation algorithms have several

A two-stage stochastic mixed-integer program modelling and hybrid solution approach to portfolio selection problems

where p is the centroid of the neighbors of p. Consider the eigenvector problem

An IG-RS-SVM classifier for analyzing reviews of E-commerce product

Fault Tree Analysis of Software Reliability Allocation

Transcription:

Group Nearest Neghbor Queres Dmtrs Papadas Qogmao She Yufe Tao Kyrakos Mouratds Departmet of Computer Scece Hog Kog Uversty of Scece ad Techology Clear Water Bay, Hog Kog {dmtrs, qmshe, kyrakos}@cs.ust.hk Abstract Gve two sets of pots P ad Q, a group earest eghbor (GNN) query retreves the pot(s) of P wth the smallest sum of dstaces to all pots Q. Cosder, for stace, three users at locatos q, q 2 ad q 3 that wat to fd a meetg pot (e.g., a restaurat); the correspodg query returs the data pot p that mmzes the sum of Eucldea dstaces pq for 3. Assumg that Q fts memory ad P s dexed by a R-tree, we propose several algorthms for fdg the group earest eghbors effcetly. As a secod step, we exted our techques for stuatos where Q caot ft memory, coverg both dexed ad o-dexed query pots. A expermetal evaluato detfes the best alteratve based o the data ad query propertes.. Itroducto Nearest eghbor (NN) search s oe of the oldest problems computer scece. Several algorthms ad theoretcal performace bouds have bee devsed for exact ad approxmate processg ma memory [S9, AMN+98]. Furthermore, the applcato of NN search to cotet-based ad smlarty retreval has led to the developmet of umerous cost models [PM97, WSB98, BGRS99, B00] ad dexg techques [SYUK00, YOTJ0] for hghdmesoal versos of the problem. I spatal databases most of the work has focused o the pot NN query that retreves the k ( ) objects from a dataset P that are closest (usually accordg to Eucldea dstace) to a query pot q. The exstg algorthms (revewed Secto 2) assume that P s dexed by a spatal access method ad utlze some prug bouds to restrct the search space. Shahab et al. [SKS02] ad Papadas et al. [PZMT03] deal wth earest eghbor queres spatal etwork databases, where the dstace betwee two pots s defed as the legth of the shortest path coectg them the etwork. I addto to covetoal (.e., pot) NN queres, recetly there has bee a creasg terest alteratve forms of spatal ad spato-temporal NN search. Ferhatosmaoglu et al. [FSAA0] dscover the NN a costraed area of the data space. Kor ad Muthukrsha [KM00] dscuss Departmet of Computer Scece Cty Uversty of Hog Kog Tat Chee Aveue, Hog Kog taoyf@cs.ctyu.edu.hk reverse earest eghbor queres, where the goal s to retreve the data pots whose earest eghbor s a specfed query pot. Kor et al. [KMS02] study the same problem the cotext of data streams. Gve a query movg wth steady velocty, [SR0, TP02] cremetally mata the NN (as the query moves), whle [BJKS02, TPS02] propose techques for cotuous NN processg, where the goal s to retur all results up to a future tme. Kollos et al. [KGT99] develop varous schemes for aswerg NN queres o D movg objects. A overvew of exstg NN methods for spatal ad spato-temporal databases ca be foud [TP03]. I ths paper we dscuss group earest eghbor (GNN) queres, a ovel form of NN search. The put of the problem cossts of a set P={p,,p N } of statc data pots multdmesoal space ad a group of query pots Q={q,,q }. The output cotas the k ( ) data pot(s) wth the smallest sum of dstaces to all pots Q. The dstace betwee a data pot p ad Q s defed as dst(p,q)= =~ pq, where pq s the Eucldea dstace betwee p ad query pot q. As a example cosder a database that maages (statc) facltes (.e., dataset P). The query cotas a set of user locatos Q={q,,q } ad the result returs the faclty that mmzes the total travel dstace for all users. I addto to ts relevace geographc formato systems ad moble computg applcatos, GNN search s mportat several other domas. For stace, clusterg [JMF99] ad outler detecto [AY0], the qualty of a soluto ca be evaluated by the dstaces betwee the pots ad ther earest cluster cetrod. Furthermore, the operablty ad speed of very large crcuts depeds o the relatve dstace betwee the varous compoets them. GNN ca be appled to detect abormaltes ad gude relocato of compoets [NO97]. Assumg that Q fts memory ad P s dexed by a R- tree, we frst propose three algorthms for solvg ths problem. The, we exted our techques for cases that Q s too large to ft memory, coverg both dexed ad odexed query pots. The rest of the paper s structured as follows. Secto 2 outles the related work o covetoal earest eghbor search ad top-k queres. Secto 3

descrbes algorthms for the case that Q fts memory ad Secto 4 for the case that Q resdes o the dsk. Secto 5 expermetally evaluates the algorthms ad detfes the best oe depedg o the problem characterstcs. Secto 6 cocludes the paper wth drectos for future work. 2. Related work Followg most approaches the relevat lterature, we assume 2D data pots dexed by a R-tree [G84]. The proposed techques, however, are applcable to hgher dmesos ad other data-partto access methods such as A-trees [SYUK00] etc. Fgure 2. shows a R-tree for pot set P={p,p 2,,p 2 } assumg a capacty of three etres per ode. Pots that are close space (e.g., p, p 2, p 3 ) are clustered the same leaf ode (N 3 ). Nodes are the recursvely grouped together wth the same prcple utl the top level, whch cossts of a sgle root. Exstg algorthms for pot NN queres usg R-trees follow the brach-ad-boud paradgm, utlzg some metrcs to prue the search space. The most commo such metrc s mdst(n,q), whch correspods to the closest possble dstace betwee q ad ay pot the subtree of ode N. Fgure 2.a shows the mdst betwee pot q ad odes N, N 2. Smlarly, mdst(n,n 2 ) s the mmum possble dstace betwee ay two pots that resde the sub-trees of odes N ad N 2. p p 2 p 3 p 4 p 5 p 6 N 3 N 4 R N N 2 N N 2 N 3 N 4 N 5 N 6 p 7 p 8 p 9 p 0 p p 2 (a) Pots ad ode extets (b) The correspodg R-tree Fgure 2.: Example of a R-tree ad a pot NN query The frst NN algorthm for R-trees [RKV95] searches the tree a depth-frst (DF) maer. Specfcally, startg from the root, t vsts the ode wth the mmum mdst from q (e.g., N Fgure 2.). The process s repeated recursvely utl the leaf level (ode N 4 ), where the frst potetal earest eghbor s foud (p 5 ). Durg backtrackg to the upper level (ode N ), the algorthm oly vsts etres whose mmum dstace s smaller tha the dstace of the earest eghbor already retreved. I the example of Fgure 2., after dscoverg p 5, DF wll backtrack to the root level (wthout vstg N 3 ), ad the follow the path N 2,N 6 where the actual NN p s foud. The DF algorthm s sub-optmal,.e., t accesses more odes tha ecessary. I partcular, as prove [PM97], a optmal algorthm should vst oly odes tersectg the vcty crcle that ceters at the query pot q ad has radus equal to the dstace betwee q ad ts earest N 5 N 6 eghbor. I Fgure 2.a, for stace, a optmal algorthm should vst oly odes R, N, N 2, ad N 6 (whereas DF also vsts N 4 ). The best-frst (BF) algorthm of [HS99] acheves the optmal I/O performace by matag a heap H wth the etres vsted so far, sorted by ther mdst. As wth DF, BF starts from the root, ad serts all the etres to H (together wth ther mdst), e.g., Fgure 2.a, H={<N, mdst(n,q)>, <N 2, mdst(n 2,q)>}. The, at each step, BF vsts the ode H wth the smallest mdst. Cotug the example, the algorthm retreves the cotet of N ad serts all ts etres H, after whch H={<N 2, mdst(n 2,q)>, <N 4, mdst(n 4,q)>, <N 3, mdst(n 3,q)>}. Smlarly, the ext two odes accessed are N 2 ad N 6 (serted H after vstg N 2 ), whch p s dscovered as the curret NN. At ths tme, the algorthm termates (wth p as the fal result) sce the ext etry (N 4 ) H s farther (from q) tha p. Both DF ad BF ca be easly exteded for the retreval of k> earest eghbors. I addto, BF s also cremetal. Namely, t reports the earest eghbors ascedg order of ther dstace to the query, so that k does ot have to be kow advace (allowg dfferet termato codtos to be used). The brach-ad-boud framework also apples to closest par queres that fd the par of objects from two datasets, such that ther dstace s the mmum amog all pars. [HS98, CMTV00] propose varous algorthms based o the cocepts of DF ad BF traversal. The dfferece from NN s that the algorthms access two dex structures (oe for each data set) smultaeously. If the mdst of two termedate odes N ad N j (oe from each R-tree) s already greater tha the dstace of the closest par of objects foud so far, the sub-trees of N ad N j caot cota a closest par (thus, the par s prued). As show the ext secto, a processg techque for GNN queres apples multple covetoal NN queres (oe for each query pot) ad the combes ther results. Some related work o ths topc has appeared the lterature of top-k (or raked) queres over multple data repostores (see [FLN0, BCG02, F02] for represetatve papers). As a example, cosder that a user wats to fd the k mages that are most smlar to a query mage, where smlarty s defed accordg to features, e.g., color hstogram, object arragemet, texture, shape etc. The query s submtted to retreval eges that retur the best matches for partcular features together wth ther smlarty scores,.e., the frst ege wll output a set of matches accordg to color, the secod accordg to arragemet ad so o. The problem s to combe the multple puts order to determe the top-k results terms of ther overall smlarty. The ma dea behd all techques s to mmze the extet ad cost of search performed o each retreval ege order to compute the fal result. The threshold algorthm [FLN0] works as follows (assumg retreval of

the sgle best match): the frst query s submtted to the frst search ege, whch returs the closest mage p accordg to the frst feature. The smlarty betwee p ad the query mage wth respect to the other features s computed. The, the secod query s submtted to the secod search ege, whch returs p 2 (best match accordg to the secod feature). The overall smlarty of p 2 s also computed, ad the best of p ad p 2 becomes the curret result. The process s repeated a roud-rob fasho,.e., after the last search ege s quered, the secod match s retreved wth respect to the frst feature ad so o. The algorthm wll termate whe the smlarty of the curret result s hgher tha the smlarty that ca be acheved by ay subsequet soluto. I the ext secto we adapt ths approach to GNN processg. 3. Algorthms for memory-resdet queres Assumg that the set Q of query pots fts memory ad that the data pots are dexed by a R-tree, we preset three algorthms for processg GNN queres. For each algorthm we frst llustrate retreval of a sgle earest eghbor, ad the show the exteso to k>. Table 3. cotas the prmary symbols used our descrpto (some have ot appeared yet, but wll be clarfed shortly). Symbol Descrpto Q set of query pots Q a group of queres that fts memory ( ) umber of queres Q (Q ) M (M ) MBR of Q (Q ) q cetrod of Q dst(p,q) sum of dstaces betwee pot p ad query pots Q mdst(n,q) mmum dstace betwee MBR of ode N ad cetrod q mdst(p,m) mmum dstace betwee data pot p ad query MBR M mdst ( N, M ) weghted mdst of ode N wth respect to all query groups Table 3.: Frequetly used symbols 3. Multple query method The multple query method (MQM) utlzes the ma dea of the threshold algorthm,.e., t performs cremetal NN queres for each pot Q ad combes ther results. For stace, Fgure 3. (where Q ={q,q 2 }), MQM retreves the frst NN of q (pot p 0 wth p 0 q =2) ad computes the dstace p 0 q 2 (=5). Smlarly, t fds the frst NN of q 2 (pot p wth p q 2 =3) ad computes p q (=3). The pot (p ) wth the mmum sum of dstaces ( p q + p q 2 =6) to all query pots becomes the curret GNN of Q. For each query pot q, MQM stores a threshold t, whch s the dstace of the curret NN,.e., t = p 0 q =2 ad t 2 = p q 2 =3. The total threshold T s defed as the sum of all thresholds (=5). Cotug the example, sce T < dst(p,q), t s possble that there exsts a pot P whose dstace to Q s smaller tha dst(p,q). So MQM retreves the secod NN of q (p, whch has already bee ecoutered by q 2 ) ad updates the threshold t to p q (=3). Sce T (=6) ow equals the summed dstace betwee the best eghbor foud so far ad the pots of Q, MQM termates wth p as the fal result. I other words, every o-ecoutered pot has dstace greater or equal to T (=6), ad therefore t caot be closer to Q ( the global sese) tha p. Fgure 3.: Example of a GNN query Fgure 3.2 shows the pseudo code for MQM (NN), where best_dst (tally ) s the dstace of the best_nn foud so far. I order to acheve localty of the ode accesses for dvdual queres, we sort the pots Q accordg to ther Hlbert value; thus, two subsequet queres are lkely to correspod to earby pots ad access smlar R-tree odes. The algorthm for computg earest eghbors of query pots should be cremetal (e.g., best-frst search dscussed Secto 2) because the termato codto s ot kow advace. The exteso for the retreval of k (>) earest eghbors s straghtforward. The k eghbors wth the mmum overall dstaces are serted a lst of k pars <p, dst(p,q)> (sorted o dst(p,q)) ad best_dst equals the dstace of the k-th NN. The, MQM proceeds the same way as Fgure 3.2, except that wheever a better eghbor s foud, t s serted best_nn ad the last elemet of the lst s removed. MQM(Q: group of query pots) /* T : threshold ; best_dst dstace of the curret NN*/ sort pots Q accordg to Hlbert value; for each query pot: t =0; T=0; best_dst= ; best_nn=ull; //Italzato whle (T < best_dst) get the ext earest eghbor p j of the ext query pot q ; t = p j q ; update T; f dst(p j,q)<best_dst best_nn =p j ; //Update curret GNN of Q best_dst = dst(p j,q) ; ed of whle; retur best_nn; Fgure 3.2: The MQM algorthm

3.2 Sgle pot method MQM may cur multple accesses to the same ode (ad retreve the same data pot, e.g., p ) through dfferet queres. To avod ths problem, the sgle pot method (SPM) processes GNN queres by a sgle traversal. Frst, SPM computes the cetrod q of Q, whch s a pot space wth a small value of dst(q,q) (deally, q s the pot wth the mmum dst(q,q)). The tuto behd ths approach s that the earest eghbor s a pot of P "ear" q. It remas to derve () the computato of q, ad () the rage aroud q whch we should look for pots of P, before we coclude that o better NN ca be foud. Towards the frst goal, let (x,y) be the coordates of cetrod q ad (x,y ) be the coordates of query pot q. The cetrod q mmzes the dstace fucto: dst( q, Q) = ( x - x ) + ( y y ) = 2 2 Sce the partal dervatves of fucto dst(q,q) wth respect to ts depedet varables x ad y are zero at the cetrod q, we have the followg equatos: dst( q, Q) x x = = 0 x 2 2 = ( x x) + ( y y) dst( q, Q) y y = = 0 y 2 2 = ( x x) + ( y y) Ufortuately, the above equatos caot be solved to closed form for >2, or other words, they must be evaluated umercally, whch mples that the cetrod s approxmate. I our mplemetato, we use the gradet descet [HYC0] method to quckly obta a good approxmato. Specfcally, startg wth some arbtrary tal coordates, e.g. x=(/) =~ x ad, y=(/) =~ y, the method modfes the coordates as follows: dst( q, Q) x = x η dst( q, Q) ad y = y η, x y where ŋ s a step sze. The process s repeated utl the dstace fucto dst(q,q) coverges to a mmum value. Although the resultg pot q s oly a approxmato of the deal cetrod, t suffces for the purposes of SPM. Next we show how q ca be used to prue the search space based o the followg lemma. Lemma : Let Q={q,,q } be a group of query pots ad q a arbtrary pot space. The followg equalty holds for ay pot p: dst(p,q) p q - dst(q,q), where pq deotes the Eucldea dstace betwee p ad q. Proof: Due to the tragular equalty, for each query pot q we have that: pq + q q pq. By summg up the equaltes: pq + q q pq dst( p, Q) pq - dst( q, Q) q Q q Q Lemma provdes a threshold for the termato of SPM. I partcular, by applyg a cremetal pot NN query at q, we stop whe we fd the frst pot p such that: pq dst(q,q) dst(best_nn,q). By Lemma, dst(p,q) pq dst(q,q) ad, therefore, dst(p,q) dst(best_nn,q). The same dea ca be used for prug termedate odes, as summarzed by the followg heurstc. Heurstc : Let q be the cetrod of Q ad best_dst be the dstace of the best GNN foud so far. Node N ca be prued f: best_dst+ dst( q,q) mdst( N, q) where mdst(n,q) s the mmum dstace betwee the MBR of N ad the cetrod q. A example of the heurstc s show Fgure 3.3, where the best_dst = 5+4. Sce, dst(q,q)=+2, the rght part of the equalty equals 6, meag that both odes the fgure wll be prued. Fgure 3.3: Prug of odes SPM Based o the above observatos, t s straghtforward to mplemet SPM usg the depth-frst or best-frst paradgms. Fgure 3.4 shows the pseudo-code of DF SPM. Startg from the root of the R-tree (for P), etres are sorted a lst accordg to ther mdst from the query cetrod q ad are vsted (recursvely) ths order. Oce the frst etry wth mdst(n j,q) (best_dst+dst(q,q))/ has bee foud, the subsequet oes the lst are prued. The exteso to k (>) GNN queres s the same as covetoal (pot) NN algorthms. SPM(Node: R-tree ode, Q: group of query pots) /* q: the cetrod of Q*/ f Node s a termedate ode sort etres N j Node accordg to mdst(n j,q) lst; repeat get_ext etry N j from lst; f mdst(n j,q)< (best_dst+dst(q,q))/; /* Heurstc SPM(N j,q); /* recurso*/ utl mdst(n j,q) (best_dst+dst(q,q))/ or ed of lst; else f Node s a leaf ode sort pots p j Node accordg to mdst(p j,q) lst; repeat get_ext etry p j from lst; f p j q <(best_dst+dst(q,q))/; /* Heurstc for pots f dst(p j,q)< best_dst best_nn =p j ; //Update curret GNN best_dst = dst(p j,q) ; utl p j q (best_dst+dst(q,q))/ or ed of lst; retur best_nn; Fgure 3.4: The SPM algorthm

3.3 Mmum boudg method Lke SPM, the mmum boudg method (MBM) performs a sgle query, but uses the mmum boudg rectagle M of Q (stead of the cetrod q) to prue the search space. Specfcally, startg from the root of the R- tree for dataset P, MBM vsts oly odes that may cota caddate pots. I the sequel, we dscuss heurstcs for detfyg such qualfyg odes. Heurstc 2: Let M be the MBR of Q, ad best_dst be the dstace of the best GNN foud so far. A ode N caot cota qualfyg pots, f: best_dst mdst( N, M ) where mdst(n,m) s the mmum dstace betwee M ad N, ad s the cardalty of Q. Fgure 3.5 shows a group of query pots Q={q,q 2 } ad the best_nn wth best_dst=5. Sce mdst(n,m) = 3 > best_dst/2 = 2.5, N ca be prued wthout beg vsted. I other words, eve f there s a data pot p at the upper-rght corer of N ad all the query pots were at the lower rght corer of Q, t would stll be the case that dst(p,q)> best_dst. The cocept of heurstc 2 also apples to the leaf etres. Whe a pot p s ecoutered, we frst compute mdst(p,m) from p to the MBR of Q. If mdst(p,m) best_dst/, p s dscarded sce t caot be closer tha the best_nn. I ths way we avod performg the dstace computatos betwee p ad the pots of Q. Fgure 3.5: Example of heurstc 2 The heurstc curs mmum overhead, sce for every ode t requres a sgle dstace computato. However, t s ot very tght,.e., t leads to uecessary ode accesses. For stace, ode N 2 ( Fgure 3.5) passes heurstc 2 (ad should be vsted), although t caot cota qualfyg pots. Heurstc 3 presets a tghter boud for avodg such vsts. Heurstc 3: Let best_dst be the dstace of the best GNN foud so far. A ode N ca be safely prued f: mdst( N, q ) best_dst q Q where mdst(n,q ) s the mmum dstace betwee N ad query pot q Q. I Fgure 3.5, sce mdst(n 2, q ) + mdst(n 2, q 2 ) = 6 > best_dst = 5, N 2 s prued. Because heurstc 3 requres multple dstace computatos (oe for each query pot) t s appled oly for odes that pass heurstc 2. Note that (lke heurstc 2) heurstc 3 does represet the tghtest codto for successful ode vsts;.e., t s possble for a ode to satsfy the heurstc ad stll ot cota qualfyg pots. Cosder, for stace, Fgure 3.6, whch cludes 3 query pots. The curret best_dst s 7, ad ode N 3 passes heurstc 3, sce mdst(n 3,q ) + mdst(n 3,q 2 ) + mdst(n 3,q 3 ) = 5. Nevertheless, N 3 should ot be vsted, because the mmum dstace that ca be acheved by ay pot N 3 s greater tha 7. The dotted les Fgure 3.6 correspod to the dstace betwee the best possble pot p' (ot ecessarly a data pot) N 3 ad the three query pots. Fgure 3.6: Example of a hypothetcal optmal heurstc Assumg that we ca detfy the best pot p' the ode, we ca obta a tght heurstc a follows: f the dstace of p' s smaller tha best_dst vst the ode; otherwse, reject t. The combato of the best-frst approach wth ths heurstc would lead to a I/O optmal method (such as the algorthm of [HS99] for covetoal NN queres). Fdg pot p', however, s smlar to the problem of locatg the query cetrod (but ths tme a rego costraed by the ode MBR), whch, as dscussed Secto 3.2, ca oly be solved umercally (.e., approxmately). Although a approxmato suffces for SPM, for the correctess of best_dst t s ecessary to have the precse soluto ( order to avod false msses). As a result, ths hypothetcal heurstc caot be appled for exact GNN retreval. Heurstcs 2 ad 3 ca be used wth both the depth-frst ad best-frst traversal paradgms. For smplcty, we dscuss MBM based o depth-fst traversal usg the example of Fgure 3.7. The root of the R-tree s retreved ad ts etres are sorted by ther mdst to M. The, the ode (N ) wth the mmum mdst s vsted, sde whch the etry of N 4 has the smallest mdst. Pots p 5, p 6, p 4 ( N 4 ) are processed accordg to the value of mdst(p j,m) ad p 5 becomes the curret GNN of Q (best_dst=). Pots p 6 ad p 4 have larger dstaces ad are dscarded. Whe backtrackg to N, the subtree of N 3 s prued by heurstc 2. Thus, MBM backtracks aga to the root ad vsts odes N 2 ad N 6, sde whch p 0 has the smallest mdst to M ad s processed frst, replacg p 5 as the GNN (best_dst=7). The, p becomes the best NN (best_dst=6). Fally, N 5 s prued by heurstc 2, ad the algorthm termates wth p as the fal GNN. The exteso to retreval of knn ad the best-frst mplemetato are straghtforward.

p p 2 p 3 p 4 N 4 N 3 N p N 8 p 9 2 N 5 p 7 8 p 5 p 6 6 q q 2 M 2 5 3 p 0 p N 6 5 3 p 2 Fgure 3.7: Query processg of MBM 4. Algorthms for dsk-resdet queres We ow dscuss the stuato that the query set does ot ft ma memory. Secto 4. cosders that Q s dexed by a R-tree, ad shows how to adapt the R-tree closest par (CP) algorthm [HS98, CMTV00] for GNN queres wth addtoal prug rules. We argue, however, that the R-tree o Q offers lmted beefts towards reducg the query tme. Motvated by ths, Sectos 4.2 ad 4.3 we develop two alteratve methods, based o MQM ad MBM, whch do ot requre ay dex o Q. Aga, for smplcty, we descrbe the algorthms for sgle NN retreval before dscussg k>. 4. Group closest pars method Assume a cremetal CP algorthm that outputs closest pars <p,q j > (p P, q j Q) ascedg order of ther dstace. Cosder that we keep the cout(p ) of pars whch p has appeared, as well as, the accumulated dstace (curr_dst(p )) of p all these pars. Whe the cout of p equals the cardalty of Q, the global dstace of p, wth respect to all query pots, has bee computed. If ths dstace s smaller tha the best global dstace (best_dst) foud so far, p becomes the curret NN. Two questos rema to be aswered: () whch are the qualfyg data pots that ca lead to a better soluto? () whe ca the algorthm termate? Regardg the frst questo, clearly all pots ecoutered before the frst complete NN s foud, are qualfyg. Every such pot p s kept a lst < p, cout(p ), curr_dst(p )>. O the other had, f we already have a complete NN, every data pot that s ecoutered for the frst tme ca be dscarded sce t caot lead to a better soluto. I geeral, the lst of qualfyg pots keeps creasg utl a complete NN s foud. The, o-qualfyg pots ca be gradually removed from the lst based o the followg heurstc: Heurstc 4: Assume that the curret output of the CP algorthm s <p,q j >. We ca mmedately dscard all pots p such that: (-couter(p)) dst(p,q j ) + curr_dst(p) best_dst I other words, p caot yeld a global dstace smaller tha best_dst, eve f all ts u-computed dstaces are equal to dst(p,q j ). Heurstc 4 s appled two cases: () for each output par <p,q j >, o the data pot p ad () whe the global NN chages, o all qualfyg pots. Every pot p that fals the heurstc s deleted from the qualfyg lst. If p s ecoutered aga a subsequet par, t wll be cosdered as a ew pot ad prued. Fgure 4.a shows a example where the closest pars are foud cremetally accordg to ther dstace.e., (<p,q >, 2), (< p,q 2 >, 2), (< p 2,q >, 3), (< p 2,q 3 >, 3), (< p 3,q 3 >, 4), (<p 2,q 2 >, 5). After par <p 2,q 2 > s output, we have a complete NN, p 2 wth global dstace. Heurstc 4 s appled to all qualfyg pots ad p 3 s dscarded; eve f ts (o yet dscovered) dstaces to q ad q 2 equal 5, ts global dstace wll be 4 (.e., greater tha best_dst). (a) Dscovery of st NN (b) Termato Fgure 4.: Example of GCP For each remag qualfyg pot p, we compute a threshold t as: t =(best_dst-curr_dst(p )) / (-couter(p )). I the geeral case, that multple qualfyg pots exst, the global threshold T s the maxmum of dvdual thresholds t,.e., T s the largest dstace of the output closest par that ca lead to a better soluto tha the exstg oe. I Fgure 4.a, for stace, T=t =7, meag that whe the output par has dstace 7, the algorthm ca termate. Every applcato of heurstc 4 also modfes the correspodg thresholds, so that the value of T s always up to date. Based o these observatos we are ow ready to establsh the termato codto,.e., GCP termates whe () at least a GNN has bee foud (best_dst< ) ad () the qualfyg lst s empty, or the dstace of the curret par becomes larger tha the global threshold T. Fgure 4.b cotues the example of Fgure 4.a. I ths case the algorthm termates after the par (< p,q 3 >, 6.3) s foud, whch establshes p as the best NN (ad the lst becomes empty). The pseudo-code of the GCP s show Fgure 4.2. We store the qualfyg lst as a -memory hash table o pot ds to facltate the retreval of formato (.e., couter(p ), curr_dst(p )) about partcular pots (p ). If the sze of the lst exceeds the avalable memory, part of the table s stored to the dsk. I case of knn queres, best_dst equals the global dstace of the k-th complete eghbor foud so far (.e., prug the qualfyg lst ca occur oly after k complete eghbors are retreved). I the worst case, the lst may cota a etry for each pot of P.

GCP best_nn = NULL; best_dst = ; /* talzato repeat output ext closest par <p,q j > ad dst(p,q j ) f p s ot lst f best_dst < cotue; /* dscard p ad process ext par else add < p,, dst(p,q j )> lst; else /* p has bee ecoutered before ad stll resdes lst couter(p )++; curr_dst(p )= curr_dst(p )+ dst(p,q j ); f couter(p )= f curr_dst(p )< best_dst best_nn = p ; //Update curret GNN best_dst = curr_dst(p ); T=0; for each caddate pot p lst f (-couter(p)) dst(p,q j )+curr_dst(p) best_dst remove p from lst; /* prued by heurstc 6 else /* p ot prued by heurstc 6 t= (best_dst-curr_dst(p)) / (-couter(p)); f t > T the T = t; /* update threshold else remove p from lst; else /* couter(p )< f best_dst < /* a NN has bee foud already f (-couter(p )) dst(p,q j )+curr_dst(p ) best_dst remove p from lst; /* prued by heurstc 6 else /*ot prued by heurstc 6 t = (best_dst-curr_dst(p )) / (-couter(p )); f t > T the T = t ; /* update threshold utl (best_dst < ) ad (dst(p,q j ) T or lst s empty); retur best_nn; Fgure 4.2: The GCP algorthm Whe the workspace (.e., MBR) of Q s small ad cotaed the workspace of P, GCP ca termate after outputtg a small percetage of the total umber of closest pars. Cosder, for stace, Fgure 4.3a, where there exst some pots of P (e.g., p 2 ) that are ear all query pots. The umber of closest pars that must be cosdered depeds oly o the dstace betwee p 2 ad ts farthest eghbor (q 5 ) Q. Data pot p 3, for example, wll ot partcpate ay output closest par sce ts earest dstace to ay query pot s larger tha p 2 q 5. O the other had, f the MBR of Q s large or partally overlaps (or s dsjot) wth the workspace of P, GCP must output may closest-pars before t termates. Fgure 4.3b, shows such a example, where the dstace betwee the best_nn (p 2 ) ad ts farthest query pot (q 2 ) s hgh. I addto to the computatoal overhead of GCP ths case, aother dsadvatage s ts large heap requremets. Recall that GCP apples a cremetal CP algorthm that must keep all closest pars the heap utl the frst NN s foud. The umber of such pars the worst case equals the cardalty of the Cartesa product of the datasets 2. To 2 Ths may happe f there s a data pot (o the corer of the workspace) such that () ts dstace to most query pots s very small (so that the pot caot be prued) ad () ts dstace to a query pot (located o the opposte corer of the workspace) s the largest possble. allevate the problem, Hjaltaso ad Samet [HS99] proposed a heap maagemet techque (cluded our mplemetato), accordg to whch, part of the heap mgrates to the dsk whe ts sze exceeds the avalable memory space. Nevertheless, as show Secto 5, the cost of GCP s ofte very hgh, whch motvates the subsequet algorthms. p q q 2 q 3 p 2 q 4 workspace of Q q 5 workspace of P p 3 (a) Hgh prug (b) Low prug Fgure 4.3: Observatos about the performace of GCP 4.2 F-MQM MQM ca be appled drectly for dsk-resdet, odexed Q, wth however, very hgh cost due to the large umber of dvdual queres that must be performed (as show Secto 5, ts cost creases fast wth the cardalty of Q). I order to overcome ths problem, we propose F-MQM (fle-multple query method), whch splts Q to blocks {Q,.., Q m } that ft memory. For each block, t computes the GNN usg oe of the ma memory algorthms (we apply MBM due to ts superor performace - see Secto 5), ad fally t combes ther results usg MQM. The complcato s that oce a NN of a group has bee retreved, we caot effectvely compute ts global dstace (.e., wth respect to all data pots) mmedately. Istead, we follow a lazy approach: frst we fd the GNN p of the frst group Q ; the, we load memory the secod group Q 2 ad retreve ts NN p 2. At the same tme, we also compute the dstace betwee p ad Q 2, whose curret dstace becomes curr_dst(p ) = dst(p,q ) + dst(p,q 2 ). Smlarly, whe we load Q 3, we update the curret dstaces of p ad p 2 takg to accout the objects of the thrd group. After the ed of the frst roud, we oly have oe data pot (p ), whose global dstace wth respect to all query pots has bee computed. Ths pot becomes the curret NN. The process s repeated a roud rob fasho ad at each step a ew global dstace s derved. For stace, whe we read aga the frst group (to retreve ts secod NN), the dstace of p 2 (frst NN of Q 2 ) s completed wth respect to all groups. Betwee p ad p 2, the pot wth the mmum global dstace becomes the curret NN. As the case of MQM, the threshold t j for each group Q j equals dst(p j,q j ), where p j s the last retreved eghbor of Q j. The global threshold T s the sum of all thresholds. F-MQM termates whe T becomes equal or larger tha the global dstace of the best NN foud so far.

The algorthm s llustrated Fgure 4.4. I order to acheve localty, we frst sort (exterally) the pots of Q accordg to ther Hlbert value. The, each group s obtaed by takg a umber of cosecutve pages that ft memory. The exteso for the retreval of k (>) GNNs s smlar to ma-memory MQM. I partcular, best_nn s ow a lst of k pars <p, dst(p,q)> (sorted by the global dst(p,q)) ad best_dst equals the dstace of the k-th NN. The, t proceeds the same way as Fgure 4.4. F-MQM(Q: group of query pots) best_nn = NULL; best_dst = ; T=0; /* talzato sort pots of Q accordg to Hlbert value ad splt them to groups {Q,.., Q m } so that each group fts memory; whle (T < best_dst) read ext group Q j ; get the ext earest eghbor p j of group Q j ; curr_dst(p j )= dst(p j,q j ) ; t j = dst(p j,q j ); update T; f t s the frst pass of the algorthm for each cur. eghbor p of Q ( <j) /*update other NN curr_dst(p )= curr_dst(p ) + dst(p,q j ) ; else /*local NN have bee computed for all m groups for each cur. eghbor p of Q ( m, j) /*update other NN curr_dst(p )= curr_dst(p ) + dst(p,q j ) ; ext=(j+) modulo m; /*group whose global dst. s complete f curr_dst(p ext )<best_dst best_nn =p ext ; /*update curret GNN of Q best_dst = curr_dst(p ext ) ; ext=(j+) modulo m; /*ext group to process ed whle; retur best_nn; Fgure 4.4: The F-MQM algorthm F-MQM s expected to perform well f the umber of query groups s relatvely small, mmzg the umber of applcatos of the ma memory algorthm. O the other had, f there are umerous groups, the combato of the dvdual results may be expesve. Furthermore, as the case of (ma-memory) MQM, the algorthm may perform redudat computatos, f t ecouters the same data pot as a earest eghbor of dfferet query groups. A possble optmzato s to keep each NN memory, together wth ts dstaces to all groups, so that we avod these computatos f the same pot s ecoutered later through aother group. Ths however, may ot be possble f the ma memory sze s lmted. 4.3 F-MBM We ca exted both SPM ad MBM for the case that Q does ot ft memory. Sce, as show the expermets, MBM s more effcet, here we descrbe F-MBM, a adaptato of the mmum boudg method. Frst, the pots of Q are sorted by ther Hlbert value ad are serted pages accordg to ths order. A page Q cotas pots (t s possble that the umber of pots dffers, e.g., the last page may be half-full). For each group Q, we keep memory ts MBR M ad (but ot ts cotets). F-MBM desceds the R-tree of P ( DF or BF traversal), oly followg odes that may cota qualfyg pots. Gve that we have the values of M ad for each query group memory, we ca quckly detfy qualfyg odes as follows. Heurstc 5: Let best_dst be the dstace of the best GNN foud so far ad M be the MBR of group Q. A ode N ca be safely prued f: mdst( N, M ) best_dst Q Q We refer to the left part of the equalty as the weghted mdst of N. Fgure 4.5 shows a example, where 5 query pots are splt to two groups wth MBRs M, M 2 ad best_dst = 20. Accordg to heurstc 5, N ca be prued because ts weghted mdst (2 mdst(n,m ) + 3 mdst(n,m 2 )) s 20, ad t caot cota a better NN. Fgure 4.5: Example of heurstc 5 Whe a leaf ode N s reached, we have to compute the global dstace of ts data pots wth all groups. Itally the curret dstace curr_dst(p j ) of each pot p j N s set to 0. The, for each ew group Q ( m) that s loaded memory, curr_dst(p j ) s updated as curr_dst(p j )+ dst(p j,q ). We ca reduce the CPU-overhead of the dstace computatos based o the followg heurstc. Heurstc 6: Let curr_dst(p j ) be the accumulated dstace of data pot p j wth respect to groups Q,.., Q -. The, p j ca be safely excluded from further cosderato f: j l j l l= curr _ dst( p )+ mdst( p, M ) best_dst Fgure 4.6 shows a example of heurstc 6, where the frst group Q has bee processed ad curr_dst(p j ) = dst(p j,q ) = 5+3. Pot p j s ot compared wth the query pots of Q 2, sce 8+3 mdst(p j,m 2 )=20 s already equal to best_dst. Thus, p j wll ot be cosdered for further computatos (.e., whe subsequet groups are loaded memory). Fgure 4.6: Example of heurstc 6

The fal clarfcato regards the order accordg to whch qualfyg odes ad query groups are accessed. For odes we use the weghted mdst, based o the tuto that odes wth small values are lkely to lead to eghbors wth small global dstace, so that subsequet vsts ca be prued by heurstc 5. Whe a leaf ode N has bee reached, each group Q s read memory descedg order of mdst(n,m ). The motvato s that groups that are far from the ode are lkely to prue umerous data pots (thus, savg the dstace computatos for these pots wth respect to other groups). Fgure 4.7 shows the pseudo-code of F-MBM based o DF traversal (the BF mplemetato s smlar). F-MBM(Node: R-tree ode, Q: group of query pots) /* Q cossts of {Q,.., Q m } that ft memory f Node s a termedate ode sort etres N j Node (accordg to weghted mdst) lst; repeat get_ext etry N j from lst; f weghted mdst(n j )< best_dst /*N passes heurstc 5 F-MBM(N j, Q) ; /* Recurso utl weghted mdst(n j ) best_dst or ed of lst; else f Node s a leaf ode sort pots p j Node (accordg to weghted mdst) lst; for each pot p j lst : curr_dst(p j )=0; /* talzato sort groups Q descedg order of mdst(node, M ) ; repeat read ext group Q ( m) ; for each pot p j lst j l j l l= f curr _ dst( p )+ mdst( p, M ) best_dst remove p j from lst; /* p j fals heurstc 6 else /* p j passes heurstc 6 curr_dst(p j )= curr_dst(p j )+dst(p j,q ) ; utl weghted mdst(p j ) best_dst or ed lst or ed of groups; for each pot p that remas lst /*after termato of loops f curr_dst(p)< best_dst best_nn =p; //Update curret GNN best_dst = curr_dst(p) ; retur best_nn; Fgure 4.7: The F-MBM algorthm Startg from the root of the R-tree of P, etres are sorted by ther weghted mdst, ad vsted (recursvely) ths order. Oce the frst ode that fals heurstc 5 s foud, all subsequet odes the sorted lst ca also be prued. For leaf odes, f a pot volates heurstc 6, t s removed from the lst ad s ot compared wth subsequet groups. The exteso to k NN s straghtforward. 5. Expermets I ths secto we evaluate the effcecy of the proposed algorthms, usg two real datasets: () PP [Web] wth 24493 populated places North Amerca, ad () TS [Web2], whch cotas the cetrods of 9497 MBRs represetg streams (poly-les) of Iowa, Kasas, Mssour ad Nebraska. For all expermets we use a Petum 2.4GHz CPU wth GByte memory. The page sze of the R*-trees [BKSS00] s set to KByte, resultg a capacty of 50 etres per ode. All mplemetatos are based o the best-frst traversal. Both versos of MQM ad GCP requre BF due to ther cremetal behavor. SPM ad MBM (or F-MBM) could also be used wth DF. 5. Comparso of algorthms for memory-resdet queres We frst compare the methods of Secto 3 (MQM, SPM ad MBM) for ma-memory queres. For ths purpose, we use workloads of 00 queres. Each query has a umber of pots, dstrbuted uformly a MBR of area M, whch s radomly geerated the workspace of P. The values of ad M are detcal for all queres the same workload (.e., the oly chage betwee two queres the same workload s the posto of the query MBR). Frst we study the effect of the cardalty of Q, by fxg M to 8% of the workspace of P ad the umber k of retreved group earest eghbors to 8. Fgure 5. shows the average umber of ode accesses (NA) ad CPU cost as fuctos of for datasets PP ad TS. E+4 00 0 umber of ode accesses 4 6 64 256 024 (a) NA vs. (PP dataset) E+5 E+4 00 0 umber of ode accesses 4 6 64 256 024 MQM SPM MBM 0. 0.0 0.00 CPU cost (sec) 4 6 64 256 024 (b) CPU vs. (PP dataset) 0 0. 0.0 0.00 CPU cost (sec) 4 6 64 256 024 (c) NA vs. (TS dataset) (d) CPU vs. (TS dataset) Fgure 5.: Cost vs. cardalty of Q (M=8%, k=8) MQM s, geeral, the worst method ad ts cost creases fast wth the query cardalty, because ths leads to multple queres, some of whch access the same odes ad retreve the same pots. These redudat computatos, affect both the ode accesses ad the CPU cost sgfcatly (all dagrams are logarthmc scale). Although most queres access smlar paths the R-tree of P (ad, therefore, MQM beefts from the exstece of a LRU buffer), ts total cost s stll prohbtve for large due to the

hgh CPU overhead. O the other had, the cardalty of Q has lttle effect o the ode accesses of SPM ad MBM because t does ot play a mportat role the prug power of heurstc (for SPM) ad heurstcs 2, 3 (for MBM). It affects, however, the CPU tme, because the dstace computatos for qualfyg data pots crease wth the umber of query pots. MBM s better tha SPM due to the hgh prug power of heurstc 3, as opposed to heurstc 3. I order to measure the effect of the MBR sze of Q, we set =64, k=8 ad vary M from 2% to 32% of the workspace of P. As show Fgure 5.2, the cost (average NA ad CPU tme) of all algorthms creases wth the query MBR. For MQM, the termato codto s that the total threshold T (.e., sum of thresholds for each query pot) should exceed best_dst, whch, however, creases wth the MBR sze. Therefore, MQM retreves more NNs for each query pot. For SPM (MBM), the reaso s the degradato of prug power of heurstc (heurstc 2 ad 3) wth the MBR sze of Q. E+4 umber of ode accesses 00 0 2% 4% 8% 6% 32% MBR sze of Q (a) NA vs. M sze (PP) E+5 umber of ode accesses E+4 00 0 2% 4% 8% 6% 32% MBR sze of Q MQM SPM MBM 0. 0.0 0.00 0 0. 0.0 0.00 CPU cost (sec) 2% 4% 8% 6% 32% MBR sze of Q (b)cpu vs. M sze (PP) CPU cost (sec) 2% 4% 8% 6% 32% MBR sze of Q (c) NA vs. M sze (TS) (d)cpu vs. M sze (TS) Fgure 5.2: Cost vs. sze of MBR of Q (=64, k=8) Fally, Fgure 5.3, we set = 64, M=8% ad vary the umber k of retreved eghbors from to 32. The value of k does ot fluece the cost of ay method sgfcatly, because most cases a large umber of eghbors are foud the same ode wth a few extra computatos. The relatve performace of the algorthms s smlar to the 3 We mplemeted a verso of MBM wth oly heurstc 2 ad we foud t feror to SPM. Nevertheless, heurstc 2 s useful ( cojucto wth heurstc 3) because t reduces the CPU tme requremets of the algorthm. prevous dagrams: MBM s clearly the most effcet method, followed by SPM. 00 0 umber of ode accesses 2 8 6 32 k (a) NA vs. k (PP dataset) E+4 00 0 umber of ode accesses 2 8 6 32 k MQM SPM MBM 0. 0.0 0.00 CPU cost (sec) 2 8 6 32 k (b) CPU vs. k (PP dataset) 0. 0.0 0.00 CPU cost (sec) 2 8 6 32 k (c) NA vs. k (TS dataset) (d) CPU vs. k (TS dataset) Fgure 5.3: Cost vs. um. of retreved NNs (=64, M=8%) 5.2 Comparso of algorthms for dsk-resdet queres For ths set of expermets we use both datasets (PP, TS) alteratvely as query ad data pots. For GCP we assume that both datasets are dexed by R-trees, whereas for F- MQM ad F-MBM, the dataset that plays the role of Q s sorted (accordg to Hlbert values) ad splt to blocks of 0000 pots, that ft memory. The cost of sortg ad buldg the R-trees s ot take to accout. Sce ow the query cardalty s fxed to that of the correspodg dataset, we perform expermets by varyg the relatve workspaces of the two datasets. Frst, we assume that the workspaces of P ad Q have the same cetrod, but the area M (of the MBR of Q) vares betwee 2% ad 32% of the workspace of P (smlar to the expermets of Fgure 5.2). Fgure 5.4 shows NA ad CPU tme assumg that PP s the query dataset ad k=8. GCP has the worst performace ad ts cost creases fast wth M for the reasos dscussed Secto 4.. Whe M exceeds 8% percet of the workspace of P, GCP does ot termate at all due to the huge heap requremets. The other two algorthms are more tha a order of magtude faster. F- MQM outperforms F-MBM, except for NA case of large (> 4%) query workspaces. The good performace of F- MQM (compared to the ma-memory results) s due to the fact that the query set (PP) cotas 24493 data pots ad, therefore, t geerates oly 3 query groups. Each query group s processed memory (by MBM) ad ther results are combed wth relatvely small overhead.

E+7 E+6 E+5 E+4 GCP F-MQM F-MBM umber of ode accesses E+4 CPU tme (sec) E+2 E+ E+0 E- 2% 4% 8% 6% 32% MBR area of Q (a) NA vs. M sze 2% 4% 8% 6% 32% MBR area of Q (b) CPU vs. M sze Fgure 5.4: Cost vs. sze of MBR of Q (k=8, P=TS, Q=PP) Fgure 5.5 llustrates a smlar expermet, where PP plays the role of the dataset ad TS the role of the query set (recall that the cardalty of TS s almost a order of magtude hgher tha that of PP). I ths case F-MBM s clearly better, due to the large umber (20) of query groups whose results must be combed by F-MQM. Comparg Fgure 5.5 wth 5.4, we observe that the performace of F- MBM s smlar, whle F-MQM s sgfcatly worse. Ths s cosstet wth the ma-memory behavor of MQM (Fgure 5.) where the cost creases fast wth the cardalty of the query set. GCP s omtted from the dagrams because t curs excessvely hgh cost. E+8 E+7 E+6 E+5 E+4 F-MQM umber of ode accesses 2% 4% 8% 6% 32% MBR area of Q (a) NA vs. M sze F-MBM CPU tme (sec) E+2 E+ E+0 2% 4% 8% 6% 32% MBR area of Q (b) CPU vs. M sze Fgure 5.5: Cost vs. sze of MBR of Q (k=8, P=PP, Q=TS) I order to further vestgate the effect of the relatve workspace postos, for the ext set of expermets we assume that both datasets le workspaces of the same sze, ad vary the overlap area betwee the workspaces from 0% (.e., P ad Q are totally dsjot) to 00% (.e. o top of each other). Itermedate values are obtaed by startg from the 00% case ad shftg the query dataset o both axes. Fgure 5.6 shows the cost of the algorthms assumg that Q=PP. The cost of all algorthms grows fast wth the overlap area because t: () creases the umber of potetal caddates wth the threshold of F-MQM () reduces the prug power of F-MBM heurstcs ad () creases the umber of closest pars that must be output before the termato of GCP. F-MQM clearly outperforms F-MBM for up to 50% overlap. I order to expla ths, let us cosder the 0% overlap case assumg that the query workspace starts at the upper-rght corer of the data workspace. The earest eghbors of all query groups must le ear ths upper-rght corer, sce such pots mmze the total dstace. Therefore, F-MQM ca fd the best NN relatvely fast, ad termate whe all the pots or ear the corer have bee cosdered. O the other had, because each query group has a large MBR (recall that t cotas 0000 pots), umerous odes satsfy the prug heurstc of F-MBM ad are vsted. E+7 E+6 E+5 E+4 GCP F-MQM F-MBM umber of ode accesses E+4 E+2 E+ E+0 E- CPU tme (sec) E-2 0% 25% 50% 75% 00% 0% 25% 50% 75% 00% overlap area overlap area (a) NA vs. overlap area (b) CPU vs. overlap area Fgure 5.6: Cost vs. overlap area (k=8, P=TS, Q=PP) Fgure 5.7 repeats the expermet by settg Q=TS. The clear wer s F-MBM, aga due to the umerous queres that must be performed by F-MQM. We also performed expermets by varyg the umber of eghbors retreved, whle keepg the other parameters fxed. As the case of ma-memory queres, k does ot have a sgfcat effect o performace (ad the dagrams are omtted). E+8 E+7 E+6 E+5 E+4 F-MQM umber of ode accesses 0% 25% 50% 75% 00% overlap area F-MBM E+4 CPU tme (sec) E+2 E+ E+0 E- 0% 25% 50% 75% 00% overlap area (a) NA vs. overlap area (b) CPU vs. overlap area Fgure 5.7: Cost vs. overlap area (k=8, P=PP, Q=TS) I summary, the best algorthm for dsk-resdet queres depeds o the umber of query groups. F-MQM s usually preferable whe the query dataset s parttoed a small umber of groups; otherwse, F-MBM s better. GCP has very poor performace all cases. We also expermeted wth a alteratve verso of MBM that uses a R-tree o Q (stead of Hlbert sortg). The techque, however, dd ot provde performace beefts because for each qualfyg pot of P we have to compute ts accumulated dstace to all query pots ayway.

6. Cocluso Gve a dataset P ad a group of query pots Q, a group earest eghbor query retreves the pot of P that mmzes the sum of dstaces to all pots Q. I ths paper we descrbe several algorthms for processg such queres, cludg ma-memory ad dsk-resdet Q, ad expermetally evaluate ther performace uder a varety of settgs. Sce the problem s by defto expesve, the performace of dfferet algorthms ormally vares up to orders of magtude, whch motvates effcet processg methods. I the future we ted to explore the applcato of related techques to varatos of group earest eghbor search. Cosder, for stace, that Q represets a set of facltes ad the goal s to assg each object of P to a sgle faclty so that the sum of dstaces (of each object to ts earest faclty) s mmzed. Addtoal costrats (e.g., a faclty may serve at most k users) may further complcate the solutos. Smlar problems have bee studed the cotext of clusterg ad recourse allocato, but the proposed methods are dfferet from the oes preseted ths paper. Furthermore, t would be terestg to study other dstace metrcs (e.g., etwork dstace) that ecesstate alteratve prug heurstcs ad algorthms. Ackowledgemets Ths work was supported by grat HKUST 680/03E from Hog Kog RGC. Refereces [AMN+98] Arya, S., Mout, D., Netayahu, N., Slverma, R., Wu, A. A Optmal Algorthm for Approxmate Nearest Neghbor Searchg, Joural of the ACM, 45(6): 89-923, 998. [AY0] Aggrawal, C., Yu, P. Outler Detecto for Hgh Dmesoal Data. SIGMOD, 200. [B00] Bohm, C. A Cost Model for Query Processg Hgh Dmesoal Data Spaces. TODS, Vol. 25(2): 29-78, 2000. [BCG02] Bruo, N., Chaudhur, S., Gravao, L. Top-k Selecto Queres over Relatoal Databases: Mappg Strateges ad Performace Evaluato. TODS 27(2): 53-87, 2002. [BGRS99] Beyer, K., Goldste, J., Ramakrsha, R., Shaft, U. Whe Is Nearest Neghbor Meagful? ICDT, 999. [BJKS02] Beets, R., Jese, C., Karcauskas, G., Saltes, S. Nearest Neghbor ad Reverse Nearest Neghbor Queres for Movg Objects. IDEAS, 2002. [BKSS90] Beckma, N., Kregel, H.P., Scheder, R., Seeger, B. The R*-tree: A Effcet ad Robust Access Method for Pots ad Rectagles. SIGMOD, 990. [CMTV00] Corral, A., Maolopoulos, Y., Theodords, Y., Vasslakopoulos, M. Closest Par Queres Spatal Databases. SIGMOD, 2000. [F02] Fag, R. Combg Fuzzy Iformato: a Overvew. SIGMOD Record, 3 (2): 09-8, 2002. [FLN0] Fag, R., Lotem, A., Naor, M. Optmal Aggregato Algorthms for Mddleware. PODS, 200. [FSAA0] Ferhatosmaoglu, H., Stao, I., Agrawal, D., Abbad, A. Costraed Nearest Neghbor Queres. SSTD, 200. [G84] Guttma, A. R-trees: A Dyamc Idex Structure for Spatal Searchg. SIGMOD, 984. [JMF99] Ja, A., Murthy, M., Fly, P., Data Clusterg: A Revew. ACM Comp. Surveys, 3(3): 264-323, 999. [HS98] Hjaltaso, G., Samet, H. Icremetal Dstace Jo [HS99] Algorthms for Spatal Databases. SIGMOD, 998. Hjaltaso, G., Samet, H. Dstace Browsg Spatal Databases. TODS, 24(2), 265-38, 999. [HYC0] Hochreter, S., Youger, A.S., Cowell, P. Learg to Lear Usg Gradet Descet. ICANN, 200. [KGT99] Kollos, G., Guopulos, D., Tsotras, V. Nearest Neghbor Queres Moble Evromet. STDBM, 999. [KM00] Kor, F., Muthukrsha, S. Ifluece Sets Based o Reverse Nearest Neghbor Queres. SIGMOD, 2000. [KMS02] Kor, F., Muthukrsha, S. Srvastava, D. Reverse Nearest Neghbor Aggregates Over Data Streams. VLDB, 2002. [NO97] [PM97] Nakao, K., Olaru, S. A Optmal Algorthm for the Agle-Restrcted All Nearest Neghbor Problem o the Recofgurable Mesh, wth Applcatos. IEEE Tras. o Parallel ad Dstrbuted Systems 8(9): 983-990, 997. Papadopoulos, A., Maolopoulos, Y. Performace of Nearest Neghbor Queres R-trees. ICDT, 997. [PZMT03] Papadas, D., Zhag, J., Mamouls, N., Tao, Y. Query Processg Spatal Network Databases. VLDB, 2003. [RKV95] Roussopoulos, N., Kelly, S., Vcet, F. Nearest Neghbor Queres. SIGMOD, 995. [S9] Sproull, R. Refemets to Nearest Neghbor Searchg K-Dmesoal Trees. Algorthmca, 6(4): 579-589, 99. [SKS02] Shahab, C., Kolahdouza, M., Sharfzadeh, M. A Road Network Embeddg Techque for K-Nearest Neghbor Search Movg Object Databases. ACM GIS, 2002. [SR0] Sog, Z., Roussopoulos, N. K-Nearest Neghbor Search for Movg Query Pot. SSTD, 200. [SYUK00] Sakura, Y., Yoshkawa, M., Uemura, S., Kojma, H. The A-tree: A Idex Structure for Hgh-Dmesoal Spaces Usg Relatve Approxmato. VLDB, 2000. [TP02] Tao, Y., Papadas, D. Tme Parameterzed Queres Spato-Temporal Databases. SIGMOD, 2002. [TP03] Tao, Y., Papadas, D. Spatal Queres Dyamc Evromets. ACM TODS, 28(2): 0-39, 2003. [TPS02] Tao, Y., Papadas, D., She, Q. Cotuous Nearest Neghbor Search. VLDB, 2002. [Web] www.maproom.psu.edu/dcw/ [Web2] dke.ct.gr/people/ytheod/research/datasets/ [WSB98] Weber, R., Schek, H.J., Blott, S. A Quattatve Aalyss ad Performace Study for Smlarty-Search Methods Hgh-Dmesoal Spaces. VLDB, 998. [YOTJ0] Yu, C., Oo, B, Ta, K., Jagadsh, H. Idexg the Dstace: A Effcet Method to KNN Processg. VLDB, 200.