Efficient mining of group patterns from user movement data

Transcription

1 Data & Knowledge Engneerng 57 (2006) Effcent mnng of group patterns from user movement data Yda Wang a, Ee-Peng Lm a, *, San-Yh Hwang b a Centre for Advanced Informaton Systems, School of Computer Engneerng, Nanyang Technologcal Unversty, Blk N4, 2a-32, Nanyang Avenue, Sngapore , Sngapore b Department of Informaton Management, Natonal Sun Yat-Sen Unversty, Kaohsung 80424, Tawan Receved 3 February 2005; receved n revsed form 3 February 2005; accepted 27 Aprl 2005 Avalable onlne 31 May 2005 Abstract In ths paper, we present a new approach to derve groupngs of moble users based on ther movement data. We assume that the user movement data are collected by loggng locaton data emtted from moble devces trackng users. We formally defne group pattern as a group of users that are wthn a dstance threshold from one another for at least a mnmum duraton. To mne group patterns, we frst propose two algorthms, namely AGP and VG-growth. In our frst set of experments, t s shown when both the number of users and loggng duraton are large, AGP and VG-growth are neffcent for the mnng group patterns of sze two. We therefore propose a framework that summarzes user movement data before group pattern mnng. In the second seres of experments, we show that the methods usng locaton summarzaton reduce the mnng overheads for group patterns of sze two sgnfcantly. We conclude that the cubod based summarzaton methods gve better performance when the summarzed database sze s small compared to the orgnal movement database. In addton, we also evaluate the mpact of parameters on the mnng overhead. Ó 2005 Elsever B.V. All rghts reserved. Keywords: Group pattern mnng; Moble data mnng; Locaton summarzaton * Correspondng author. Tel.: ; fax: E-mal addresses: wyd66@pmal.ntu.edu.sg (Y. Wang), aseplm@ntu.edu.sg (E.-P. Lm), syhwang@ms.nsysu. edu.tw (S.-Y. Hwang) X/$ - see front matter Ó 2005 Elsever B.V. All rghts reserved. do: /.datak

2 Y. Wang et al. / Data & Knowledge Engneerng 57 (2006) Introducton 1.1. Group pattern mnng People form groups due to many dfferent reasons. Wthn an organzaton, formal groups are formed to carry out some desgnated tasks or assgnments. Group members have well-defned roles and the groups may exst tll the tasks or assgnments are completed. Informal groups, on the other hand, are formed due to emotonal, socal or psychologcal needs. Group members have less defned roles and the membershps are less stable. Group dynamcs and ts nfluence on ndvdual decson makng have been well studed by socologsts [7], and t has been shown that peer pressure and group conformty can affect the behavors of ndvduals. Such group behavors can be very useful to dfferent applcatons. For example, by knowng the groups a customer belongs to, retalers can derve common buyng nterests among customers and develop group-specfc prcng models and marketng strateges. Group dscounts and product recommendaton can be ntroduced to encourage more purchases that lead to hgher sales. In fghtng aganst terrorsm, analyzng user group patterns s one of the mportant tasks that help to reveal the lnks between terrorsts and ther roles n the group. In the past, dfferent ways to dscover groups usng clusterng technques have been proposed [22]. Very often, they are based on dfferent defntons of smlarty measure to represent the closeness between users. For example, users may be grouped by ther common nterests, ob features, educaton level, and other attrbutes. Users can also be grouped based on the transactons they perform. For example, Amazon.com groups users together by the common books they purchase. However, n many cases, these methods suffer from a common ptfall,.e., members n a derved group may not even know one another. Such knd of group dervaton approaches are therefore not sutable to many applcatons that requre group members to be acquantances. In our research, we propose a new way to derve groupng knowledge by performng data mnng on user movement log data. These movement data are assumed to be generated by moble devces that track the locatons of ther owners as they move from one place to another. These devces are equpped wth GPS (Global Postonng Systems) and other related postonng technologes. GPS can acheve postonng errors rangng from 10 to 20 m [5,6,32], whle the Asssted-GPS technology further reduces errors to between 1 and 10 m [8]. There are also terrestral-based postonng technologes on the popularly used cellular networks, such as AOA, DOA, and TOA, whch can acheve postonng accuracy around m [33]. In the ndoor envronment, users can also be tracked by RFIDs by havng RFID recevers at dfferent locatons sensng the sgnals from RFID tags. We also assume that each user locaton, n the form of x-/y-/z-coordnates, can be logged at regular ntervals over a perod of tme. In practce, the assumpton may not hold as moble devces may experence falures. They may be swtched off by ther owners from tme to tme, and the data collecton tme may not be synchronzed across users. These assumptons nevertheless are reasonable f consderng data cleanng or data transformaton to be performed before applyng our proposed mnng algorthms. To keep a focused dscusson, we shall keep the prvacy and legal ssues out the scope of ths paper. As loggng user locatons over tme can affect the prvacy of users, we beleve that these ssues should be addressed wthn a legal framework whch s beyond the scope of ths paper. Furthermore, for several practcal stuatons related to safety and securty, user movement loggng s consdered necessary and has already been done.

3 242 Y. Wang et al. / Data & Knowledge Engneerng 57 (2006) Mnng user groups from movement data s a knd of spato-temporal data mnng. That s, we are nterested to dscover groupngs of users such that members n the same group are spatally close to one another for sgnfcant amount of tme. Such user groupngs are also known as group patterns [27]. The groupng knowledge derved n ths way s unque compared to the other approaches due to the followng: Physcal proxmty between group members: The group members are expected to be physcally close to one another when they act as a group. Such characterstcs are common among many types of groups, e.g., shoppng pals, game partners, etc. Temporal proxmty between group members: The group members are expected to stay together for some meanngful duraton when they act as a group. Such characterstcs dstngushes an ad hoc cluster of people who are physcally close but unaware of one another from a group of people who come together for some planned actvty(es). Intutvely, people who spent sgnfcant tme together are expected to be aware of one another and they should mantan regular contact. Hence, the group members are expected to exert much stronger nfluence on one another Research obectves and contrbutons Our research ams to formalze the concept of group pattern based on user movement. In ths paper, we formally defne the noton of group pattern. We ntroduce max_ds and mn_dur as the maxmum physcal dstance threshold among group members and the mnmum duraton threshold for group members to stay together respectvely for dervng group patterns. In addton, we defne the weght of a group pattern as a measure of ts nterestngness. Wth a mn_we threshold, we defne the noton of vald group pattern and the vald group pattern mnng problem. In the followng, we summarze our man research contrbutons as follows: Algorthms for vald group pattern mnng. Two algorthms AGP and VG-growth are developed to mne vald group patterns. Whle the AGP algorthm s derved from the Apror algorthm for classcal assocaton rule mnng, VG-growth adopts a mnng strategy smlar to FP-growth algorthm and s based on a novel data structure known as VG-graph. Locaton summarzaton based algorthms for mnng vald 2-groups. We observe n our experments that the tme taken by AGP and VG-growth to mne vald group patterns of sze two (also known as vald 2-groups) domnates the total mnng tme, because both algorthms requre large number of user pars to be examned, especally when the number of users s large. We therefore propose a group pattern mnng framework that can accommodate dfferent locaton summarzaton methods. Four dfferent locaton summarzaton methods have been proposed to reduce the overhead of mnng vald 2-groups. Performance evaluaton of group pattern mnng algorthms. We conduct comprehensve experments to evaluate the performance of all the proposed algorthms usng datasets synthetcally generated by IBM Cty Smulator [14]. We observe that VG-growth s much faster than AGP for mnng vald k-groups, where k > 2, whle the locaton summarzaton based algorthms are much more effcent than AGP and VG-growth for mnng vald 2-groups.

4 In our prevous work [27,28], we have proposed our novel algorthms AGP and VG-growth for mnng vald group patterns and one locaton summarzaton method SLS for effcently mnng vald 2-groups. In ths paper, we provde the formal defnton of condtonal VG-graph and gve the correctness and completeness proofs of the VG-growth algorthm. Several other locaton summarzaton methods are also ncluded to further reduce the overhead of mnng vald 2-groups. Comprehensve experments are conducted to compare the performances of dfferent locaton summarzaton based algorthms and the nfluences of relevant parameters Paper outlne Y. Wang et al. / Data & Knowledge Engneerng 57 (2006) The rest of the paper s organzed as follows. The formal defntons of group pattern mnng problem s gven n Secton 2. Secton 3 descrbes two vald group pattern mnng algorthms, AGP and VG-growth, and ther performance results. In Secton 4, we ntroduce a general framework to ncorporate locaton summarzaton nto vald 2-group mnng. Our locaton summarzaton methods are ntroduced n Secton 5. In Secton 6, we present an expermental study on the locaton summarzaton based algorthms. We look at some related work n Secton 7. Fnally, we draw concluson n Secton Problem defnton 2.1. Prelmnares Group pattern mnng s to be conducted on a user movement database defned by D = (D 1,D 2,...,D M ), where D s a tme seres of tuples (t,(x,y,z)) denotng the x-, y- and z-coordnates of user u at tme t. We assume that there are N tme ponts n the tme seres rangng from 0toN 1. For smplcty, we denote the locaton of a user u at tme t by u [t].p, and hs/her x-, y-, and z-values at tme t by u [t].x, u [t].y and u [t].z respectvely. A very small user movement database example s shown n Table 1. Defnton 1. Gven a set of users G, a maxmum dstance threshold max_ds, and a mnmum tme duraton threshold mn_dur, a set of consecutve tme ponts [t a,t b ] s called a vald segment of G, f (1) "u, u 2 G, t a 6 t 6 t b, d(u [t].p,u [t].p) 6 max_ds. (2) If t a >0,$u, u 2 G, d(u [t a 1].p,u [t a 1].p)>max_ds. (3) If t b < N 1, $u, u 2 G, d(u [t b + 1].p,u [t b + 1].p)>max_ds. (4) (t b t a +1)P mn_dur. In other words, wthn a vald segment of a set of users G, all members must be close to one another for at least a mnmum tme duraton (mn_dur). The functon, d(), returns the dstance between two ponts. 1 Furthermore, vald segments are maxmal as no two vald segments of 1 In ths paper, Eucldean dstance s adopted. However, other dstance metrcs can be used for dfferent applcatons, such as Manhattan dstance and weghted Eucldean dstance, as long as they satsfy the followng four propertes: (1) d(,) P 0; (2) d(,) = 0; (3) d(,) =d(,); and (4) d(,) 6 d(,h)+d(h,).

5 244 Y. Wang et al. / Data & Knowledge Engneerng 57 (2006) Table 1 User movement database D t x y z u u u u

6 Y. Wang et al. / Data & Knowledge Engneerng 57 (2006) Table 1 (contnued) t x y z u u the same set of users can overlap each other. The thresholds, max_ds and mn_dur, are used to defne the spatal and temporal proxmty requrements between members of a group. In partcular, the mn_dur threshold helps to weed out short-lved or accdental closeness between users. Consder the user movement database n Table 1. For mn_dur = 3 and max_ds = 10, [0,3] s a vald segment of the set of users, {u 1,u 2 }. Defnton 2. Gven a set of users G, thresholds max_ds and mn_dur, we say that G, max_ds and mn_dur form a group pattern, denoted by P = hg,max_ds,mn_dur, f G has a vald segment. The vald segments of a group pattern P are those of ts G component. We also call a group pattern wth k users a k-group pattern. Defnton 3. Gven two group patterns, P = hg,max_ds,mn_dur and P 0 = hg 0,max_ds,mn_ dur, P 0 s called a sub-group pattern of P f G 0 G. In a movement database, a group pattern may have multple vald segments. The combned length of these vald segments s called the weght-count of the pattern. We therefore measure the sgnfcance of the pattern by comparng ts weght-count wth the overall tme duraton. Defnton 4. Let P be a group pattern wth vald segments s 1,...,s n, the weght-count and weght of P are defned as

7 246 Y. Wang et al. / Data & Knowledge Engneerng 57 (2006) weght-countðpþ ¼ Xn ¼1 s weghtðpþ ¼ weght-countðpþ P n ¼1 ¼ s ð2þ N N Snce weght represents the proporton of the tme ponts a group of users stay close together, the larger the weght s, the more sgnfcant (or nterestng) the group pattern s. Furthermore, f the weght of a group pattern exceeds a threshold mn_we, we call t a vald group pattern, and the correspondng group of users a vald group. For example, suppose mn_we = 50%. The group pattern h{u 2,u 4,u 5 },10,3 s vald, snce t has a vald segment {[3,9]} and a weght of 7/ 10 P 0.5. Defnton 5. Gven the thresholds max_ds, mn_dur, and mn_we, the problem of fndng all the vald group patterns (or smply vald groups) s known as vald group (pattern) mnng. ð1þ 2.2. Dscussons There are some smlartes between group pattern mnng and the classcal assocaton rule mnng. In the latter problem, the goal s to dscover all frequent temsets, whch s defned as a set of tems, wth support exceedng a mnmal support threshold. There are however several key dfferences that render the drect applcaton of assocaton rule mnng methods not feasble n vald group pattern mnng. There s no explct concept of transacton n a movement database. The movement database conssts of multple tme seres of locatons, one for each user. One can try to organze the locatons recorded at one tme pont nto some knd of transactons wth each transacton representng a set of locatons that are not more than max_ds apart at that tme pont. For example, 12 transactons can be derved based on the locatons at tme pont 3 n Table 1: {u 1,u 2 }, {u 1,u 3 }, {u 1,u 4 }, {u 1,u 5 }, {u 2,u 4 }, {u 2,u 5 }, {u 2,u 6 }, {u 4,u 5 }, {u 4,u 6 }, {u 5,u 6 }, {u 1,u 4,u 5 }, and {u 2,u 4,u 5 }. However, ths groupng of user locatons at each tme pont nto transactons can lead to extremely large number of transactons, especally for a large populaton. Ths transactonzng overhead can be prohbtve f there are many tme ponts and users. The weght defned for vald group pattern mnng s very dfferent from the support defned n assocaton rule mnng. By transactonzng the movement database, t does not address the weght countng problem. For transactons derved from a sngle tme pont, we should not double count the transactons that have the same set of users. In the above example, user par {u 4,u 5 } s contaned n three derved transactons {u 4,u 5 }, {u 1,u 4,u 5 } and {u 2,u 4,u 5 }. But the weght-count of {u 4,u 5 } should be only ncremented by 1, nstead of 3, snce all of the three transactons occur at the same tme pont 3. Therefore, t s necessary to desgn new algorthms for mnng vald group patterns.

8 3. Group pattern mnng algorthms Y. Wang et al. / Data & Knowledge Engneerng 57 (2006) In [27], we proposed two algorthms to mne group patterns, known as the Apror-lke Group Pattern mnng (AGP) algorthm and Vald Group-Growth (VG-Growth) algorthm. The former explores the Apror property of vald group patterns and extends the Apror algorthm [3] to mne vald group patterns. The latter s based on dea smlar to the FP-growth algorthm. Both Apror and FP-growth algorthms are orgnally desgned for assocaton rule mnng. In the followng, we present the proposed AGP and VG-growth algorthms Apror-lked group pattern mnng (AGP) algorthm AGP algorthm s bult upon the Apror property that also holds for group patterns. Property 1 (Apror property of group patterns). If a group pattern s vald, then all of ts subgroup patterns are vald as well. Proof. Gven P mn_we, and a group pattern P = hg, mn_dur, max_ds, fp s a vald group pattern, then P mn we, where s 0 N s are vald segments of P. Let P0 denote any sub-group pat- s tern of P. Note that, for each vald segment s of PP, therepmust exst a vald segment s 0 of P0 such that s s 0 and so s0 P s s. Hence, we have 0 s P P mn we. That s, the sub-group N N pattern P 0 s also a vald group pattern. h The AGP algorthm s shown n Fg. 1. We use C k to denote the set of canddate k-groups, and G k to denote the set of vald k-groups. The AGP algorthm starts by mnng G 1, the set of all dstnct users. It then uses G 1 to fnd G 2, whch n turn s used to fnd G 3. The process repeats untl no more vald k-groups can be found. In each teraton, Apror property s used to generate canddate groups of larger sze, and to prune the unpromsng canddate groups. However, there are two key dfferences between AGP and the classcal Apror algorthm: (1) Instead of examnng whether a transacton contans a canddate temset, the AGP algorthm tests whether users n a canddate group are close to one another at a gven tme pont. (2) Instead of smply ncrementng support counts, AGP algorthm accumulates the lengths of all vald segments so as to compute the weght of a canddate group. For example, suppose we want to mne vald group patterns from D (see Table 1) wth max_ ds = 10, mn_dur = 3, and mn_we = 50%. G 1 s frst assgned the set {{u 1 },{u 2 },{u 3 },{u 4 }, {u 5 },{u 6 }}. We then generate C 2 by a on operaton, whch s the same as that n Apror algorthm. C 2 ¼ffu 1 ; u 2 g; fu 1 ; u 3 g; fu 1 ; u 4 g; fu 1 ; u 5 g; fu 1 ; u 6 g; fu 2 ; u 3 g; fu 2 ; u 4 g; fu 2 ; u 5 g; fu 2 ; u 6 g; fu 3 ; u 4 g; fu 3 ; u 5 g; fu 3 ; u 6 g; fu 4 ; u 5 g; fu 4 ; u 6 g; fu 5 ; u 6 gg. Then we scan D to compute the weghts for each canddate 2-group and select the vald ones for G 2 :

9 248 Y. Wang et al. / Data & Knowledge Engneerng 57 (2006) Fg. 1. Algorthm AGP.

10 Y. Wang et al. / Data & Knowledge Engneerng 57 (2006) G 2 ¼ffu 1 ; u 2 g; fu 1 ; u 4 g; fu 1 ; u 5 g; fu 2 ; u 4 g; fu 2 ; u 5 g; fu 2 ; u 6 g; fu 4 ; u 5 g; fu 4 ; u 6 gg. From G 2, we generate C 3 : C 3 ¼ffu 1 ; u 2 ; u 4 g; fu 1 ; u 2 ; u 5 g; fu 1 ; u 4 ; u 5 g; fu 2 ; u 4 ; u 5 g; fu 2 ; u 4 ; u 6 g; fu 2 ; u 5 ; u 6 gfu 4 ; u 5 ; u 6 gg. Note that {u 2,u 5,u 6 } and {u 4,u 5,u 6 } are subsequently pruned from C 3 snce they have an nvald sub-group ({u 5,u 6 }) whch s not n G 2. After scannng D agan to compute the weghts, we obtan G 3 : G 3 ¼ffu 1 ; u 4 ; u 5 g; fu 2 ; u 4 ; u 5 gg. The algorthm termnates here and the dscovered vald groups are G 2 [ G 3. Tme Complexty Analyss. In the Generate_Canddate_Groups procedure, each call to Has_Invald_Subgroups procedure (Lne 05) requres Oðk G k 1 Þ tme n the worst case. Wth the two loops n Lnes 01 and 02, the tme complexty of the Generate_Canddate_Groups procedure s Oðk G k 1 3 Þ. In the man algorthm, Lnes scan the database to compute the weght, whch costs 2 k OðM N; N C k 2 Þ¼OðM N; N Ck k 2 Þ, where M s the number of dstnct users and N s the whole tme span of D. Lne 13 selects the vald groups, whch costs O(C k ). In total, the tme cost of AGP algorthm s OðR k fk G k 1 3, M Æ N, N Æ C k Æ k 2 }Þ. Ths s a man memory based analyss, whch does not consder the dsk access overhead. That s, both the movement database and the canddate sets are assumed to resde n man memory. Note that, the (N Æ C k Æ k 2 ) component represents the overheads of scannng D to check the dstance between every two users of every canddate group. Ths s the man bottleneck of all Aprorlke algorthms and we therefore develop the VG-growth algorthm to reduce such bottleneck VG-growth: an algorthm based on vald group graph AGP algorthm, lke the orgnal Apror algorthm, nvolves much overhead n canddate k- group generaton and multple database scans. In [11], Han et al. proposed a novel data structure known as FP-tree and a dvde-and-conquer algorthm, FP-growth, that mnes assocaton rules wthout the above overhead. In ths secton, we wll borrow the dea and develop the Vald Group Graph data structure and VG-growth algorthm. In a FP-tree (Frequent Pattern tree), each node represents a frequent tem, and the frequent tems are ordered n support descendng order so that the more frequently occurrng tems are more lkely to be shared and thus located closer to the top of the FP-tree. The FP-growth method starts from a frequent tem (as an ntal suffx pattern), examnes only ts condtonal pattern base (a sub-database whch contans the set of frequent tems co-occurrng wth the suffx pattern), constructs ts condtonal FP-tree, and performs mnng recursvely wth such a tree. The maor operatons of mnng are count accumulaton and prefx count adustment, whch are usually much less costly than canddate generaton and pattern matchng operatons performed n the Aprorlke algorthm. 2 We use O(A,B,C,...) to denote max{o(a),o(b),o (C),...}.

11 250 Y. Wang et al. / Data & Knowledge Engneerng 57 (2006) As we extend the FP-tree structure and FP-growth algorthm to vald group pattern mnng, the key dfferences between assocaton rule mnng and vald group pattern mnng have to be consdered. Moreover, some basc concepts adopted by FP-tree and FP-growth wll have to change as descrbed below. Each node n a FP-tree s a frequent tem, whch s also the smallest unt n assocaton rule mnng. In vald group pattern mnng, the smallest unt s a vald 2-group. A drect constructon of FP-tree-lke structure based on vald 2-groups wll however lead to excessve number of nodes n the tree. The weght used n vald group pattern mnng s more complcated than the support measure. Hence, t s necessary to store the lst of vald segments for each vald 2-group so as to derve the weght of vald groups of larger szes Vald group (VG) graph In ths secton, we defne Vald group graph on whch our proposed VG-growth algorthm wll operate to mne vald group patterns. Defnton 6. A vald group graph (VG-graph) s a weghted drected graph (V,E,s), where (1) Each vertex n V represents a user who partcpates n some vald 2-group,.e., V ¼fuu 2 G; G 2 G 2 g. Ths set of users s also known as vald users. (2) Each weghted drected edge n E represents a vald 2-group and the drecton of an edge always orgns from the vertex wth a smaller user d. (3) s s a weghtng functon that maps each edge n E to ts vald segments. Consder D n Table 1. Assume that max_ds = 10, mn_dur = 3 and mn_we = 50%. We construct the VG-graph of D usng a modfed AGP algorthm that stores vald segments as t computes G 2. G 2 s shown n Table 2 and the constructed VG-graph s shown n Fg. 2. Note that a VG-graph can be constructed for a movement database by frst dervng the set of all vald 2- groups, whch requres only one scan of the movement database. Table 2 G 2 and the vald segments Vald 2-groups Vald segment lsts {u 1,u 2 } s(u 1,u 2 ) = {[0,3],[7,9]} {u 1,u 4 } s(u 1,u 4 ) = {[3,9]} {u 1,u 5 } s(u 1,u 5 ) = {[1,3],[5,9]} {u 2,u 4 } s(u 2,u 4 ) = {[0,9]} {u 2,u 5 } s(u 2,u 5 ) = {[3,9]} {u 2,u 6 } s(u 2,u 6 ) = {[0,3],[5,7]} {u 4,u 5 } s(u 4,u 5 ) = {[3,9]} {u 4,u 6 } s(u 4,u 6 ) = {[0,6]}

12 Y. Wang et al. / Data & Knowledge Engneerng 57 (2006) u 1 s(u 1,u 2 ) s(u 1,u 4 ) s(u 1,u 5 ) u 2 s(u 2,u 4 ) s(u 2,u 6 ) s(u 2,u 5 ) u 6 s(u 4,u 6 ) u 4 s(u 4,u 5 ) u 5 Fg. 2. The VG-graph for Table 1. Assumng an adacency lst representaton, the space requred for storng a VG-graph can be determned by: VG-graph ¼aV þbeþvsl ð3þ where a and b are the space requred for storng a vertex and an edge respectvely, and vsl s the space requred for storng the vald segment lsts. Note that, V 6 M, E ¼G 2 and vsl 6 N 2dG 2 b c, where d s the space requred for storng one tme stamp, and the number of mn durþ1 N vald segments for a vald 2-group s at most b c. In addton, we defne the compresson mn durþ1 rato of a VG-graph as: compresson rato of VG-graph ¼ VG-graph ð4þ D where D denotes the sze of the orgnal movement database. Property 2. Gven a movement database D and thresholds max_ds, mn_dur and mn_we, ts correspondng VG-graph contans the complete nformaton of D relevant to vald group pattern mnng. Proof. In the VG-graph constructon process, all the vald 2-groups, assocated wth ther vald segments, are stored n the VG-graph. From Property 1, we know that f a k-group (k P 2) pattern s vald, then all of ts 2-subgroup patterns are vald as well. That s to say, each vald k-group can be generated from some vald 2-groups. Moreover, we can check the valdty of a k-group by examnng the ntersectons among the vald segments of all ts 2-subgroups. Thus the property holds. h VG-growth algorthm In ths subsecton, we present the VG-growth algorthm that uses the compact nformaton n VG-graph for mnng the complete set of vald groups.

13 252 Y. Wang et al. / Data & Knowledge Engneerng 57 (2006) Defnton 7. If (u! v) s a drected edge n a VG-graph, u s called the prefx-neghbor of v. For example, n Fg. 2, u 1 and u 2 are the prefx-neghbors of u 4. Defnton 8. A suffx-group, denoted by H, s an ordered lst of k (k P 0) vald users n user d descendng order. In partcular, the suffx-group s empty at the begnnng of VG-growth algorthm (see Lne 05 n Fg. 4). Defnton 9. The condtonal VG-graph of a suffx-group H contanng k (k P 0) users s called an k-order condtonal VG-graph, denoted by VG (k) (H) =(V k,e k,s k ), and can be constructed as follows. (1) When k =0,VG (0) (B) (.e., the 0-order condtonal VG-graph) s the VG-graph constructed from G 2. (2) When k P 1, let H ¼fu a1 ; u a2 ;...; u ak g, where a 1 > a 2 > > a k P 1. Then, VG (k) (H)= (V k,e k,s k ) can be constructed from VG ðk 1Þ ðh fu ak gþ ¼ ðv k 1 ; E k 1 ; s k 1 Þ as: V k ¼fu u 2 V k 1 ; ðu! u ak Þ2E k 1 g E k ¼fðu! u Þðu! u Þ2E k 1 ; u 2 V k ; u 2 V k ; s k ðu ; u Þ P mn we Ng ð5þ where s k ðu ; u Þ¼fss 2 s \ ; s P mn durg ð6þ and s \ ¼ s k 1 ðu ; u Þ\s k 1 ðu ; u ak Þ\s k 1 ðu ; u ak Þ: ð7þ The VG-growth algorthm conducts a traversal on the VG-graph, vstng vertces accordng to ther vertex ds. We llustrate the algorthm usng the VG-graph n Fg. 2. The vertces are vsted as follows: Vertex u 1 : Select the set of prefx-neghbors of vertex u 1, denoted by V u1. Snce V u1 s empty, the mnng process for u 1 termnates wth no vald group generated. Vertex u 2 : Select the set of prefx-neghbors of vertex u 2,.e., V u2 ¼fu 1 g. For each vertex v n V u2, we generate a vald 2-group by concatenatng v wth u 2,.e., {u 2,u 1 }. Select the set of edges on V u2, denoted by EðV u2 Þ. Here, V u2 contans only one vertex, and EðV u2 Þ¼;. The mnng process for u 2 termnates. Vertex u 4 : V u4 ¼fu 1 ; u 2 g, whch generates two vald 2-groups: {u 4,u 1 } and {u 4,u 2 }. EðV u4 Þ¼ fðu 1! u 2 Þg wth s(u 1,u 2 ) = {[0,3],[7,9]}. Adust the vald segments of edge {(u 1! u 2 )} aganst u 4 :s(u 1,u 2 )=s(u 1,u 2 ) \ s(u 1,u 4 ) \ s(u 2,u 4 ) = {[7,9]}. Snce the adusted vald segments do not meet the mn_we requrement, the mnng process for u 4 termnates. Vertex u 5 : V u5 ¼fu 1 ; u 2 ; u 4 g. Generate three vald 2-groups: {u 5,u 1 }, {u 5,u 2 }, and {u 5,u 4 }. Next, Select the drected edges on V u5 : EðV u5 Þ¼fðu 1! u 2 Þ; ðu 1! u 4 Þ; ðu 2! u 4 Þg wth assocated segment lsts: s(u 1,u 2 ) = {[0,3], [7,9]}, s(u 1,u 4 ) = {[3,9]}, and s(u 2,u 4 ) = {[0,9]}. Now, we adust the assocated segment lsts aganst u 5 as follows:

14 Y. Wang et al. / Data & Knowledge Engneerng 57 (2006) sðu 1 ; u 2 Þ¼sðu 1 ; u 2 Þ\sðu 1 ; u 5 Þ\sðu 2 ; u 5 Þ¼f½0; 3Š; ½7; 9Šg \ f½1; 3Š; ½5; 9Šg \ f½3; 9Šg ¼ f½3š; ½7; 9Šg sðu 1 ; u 4 Þ¼sðu 1 ; u 4 Þ\sðu 1 ; u 5 Þ\sðu 4 ; u 5 Þ¼f½3; 9Šg \ f½1; 3Š; ½5; 9Šg \ f½3; 9Šg ¼ f½5; 9Šg sðu 2 ; u 4 Þ¼sðu 2 ; u 4 Þ\sðu 2 ; u 5 Þ\sðu 4 ; u 5 Þ¼f½0; 9Šg \ f½3; 9Šg \ f½3; 9Šg ¼ f½3; 9Šg Edges wth adusted segment lsts not meetng the mn_dur and mn_we requrements are removed. The edge (u 1! u 2 ) s removed n ths step. V u5 and EðV u5 Þ (after segment lst adustment) form the condtonal VG-graph of u 5 (VG(u 5 )), whch contans three vertces {u 1,u 2,u 4 } and two edges (u 1! u 4 ) and (u 2! u 4 ) wth assocated segment lsts. u 5 s a suffx-group as t wll be ncorporated as a suffx to every vald group found n VG(u 5 ). We perform mnng recursvely on VG(u 5 ). We have V u5 u 1 ¼ V u5 u 2 ¼; and V u5 u 4 ¼fu 1 ; u 2 g. From V u5 u 4, two vald 3-groups: {u 5,u 4,u 1 } and {u 5,u 4,u 2 } are derved. Tll now, the mnng process for u 5 termnates. The mnng process for u 5 s shown n Fg. 3. Vertex u 6 : From V u6 ¼fu 2 ; u 4 g, we derve two vald 2-groups: {u 6,u 2 }, {u 6,u 4 }. EðV u6 Þ¼ fðu 2! u 4 Þg wth s(u 2,u 4 ) = {[0,9]}. After adustment, s(u 2,u 4 )=s(u 2,u 4 ) \ s(u 2,u 6 ) \ s(u 4,u 6 )= {[0,3],[5,6]}. The vald segment [5,6] does not meet the mn_dur requrement and s removed, leavng [0,3] to be the only vald segment. Snce [0,3] does not meet the mn_we requrement, ths edge (u 2! u 4 ) s removed. The mnng process for u 6 therefore termnates. After vstng all the vertces, VG-growth termnates. The complete VG-growth algorthm s shown n Fg. 4. Lemma 1. Let VG (k) (H) be the condtonal VG-graph of a suffx-group H ¼fu a1 ;...; u ak g (k P 0, a 1 > a 2 > > a k ), then every edge (u! u ) n VG (k) (H) represents a vald k + 2 group fu ; u ; u a1 ;...; u ak g. Proof. We prove ths lemma by nducton. [Base Case]. When k =0,VG (0) (;) s the orgnal VG-graph. It s clear that every edge (u! u ) n the orgnal VG-graph represents a vald 2-group {u,u } and the base case holds. [Inductve Hypothess]. Suppose the lemma holds for some n, 0 6 n 6 k. That s, every edge (u! u ) n VG (n) (H) =(V n,e n,s n ) wth H ¼fu a1 ;...; u an g represents a vald n + 2 group fu ; u ; u a1 ;...; u an g. u 1 u 1 s(u 1,u 2 ) s(u 1,u 4 ) s(u 1,u 4 ) u 2 s(u 2,u 4 ) u 4 Adust vald segments and dscard nvald ones u 2 s(u 2,u 4 ) u 4 Pck edges E(V u5 ) Condtonal VG-graph of u 5 : VG(u 5 ) Fg. 3. The mnng process for vertex u 5.

15 254 Y. Wang et al. / Data & Knowledge Engneerng 57 (2006) Fg. 4. VG-growth algorthm. [Inductve Step]. Consder VG ðnþ1þ ðh [fu anþ1 gþ ¼ ðv nþ1 ; E nþ1 ; s nþ1 Þ. Based on the defnton of condtonal VG-graph, for each edge (u! u ) 2 E n+1, there must exst (u! u ), ðu! u anþ1 Þ, and ðu! u anþ1 Þ n E n. Gven the nductve hypothess, these three edges represent three vald n + 2 groups: fu ; u ; u a1 ;...; u an g, fu ; u anþ1 ; u a1 ;...; u an g, and fu ; u anþ1 ; u a1 ;...; u an g, denoted by G I, G II, and G III respectvely. In addton, the vald segments of G I, G II, and G III are s n (u,u ), s n ðu ; u anþ1 Þ, and s n ðu ; u anþ1 Þ respectvely. Next, from Eqs. (6) and (7), we know that s n+1 (u,u ) satsfes mn_dur and thus s n+1 (u,u ) s the vald segments of group G IV = G I [ G II [ G III. From Eq. (5), s n+1 (u,u ) also satsfes mn_we, thus, G IV s vald. That s, each edge (u! u )nvg ðnþ1þ ðh [fu anþ1 gþ represents a vald n +3 group fu ; u ; u anþ1 ; u a1 ;...; u an g. Thus, the lemma holds for n +1. h Theorem 1 (Correctness). VG-growth only generates vald groups.

16 Y. Wang et al. / Data & Knowledge Engneerng 57 (2006) Proof. Wthout loss of generalty, 3 let G ¼fu a1 ;...; u ak g be a vald group generated by VGgrowth, where a 1 > a 2 > > a k. Accordng to the mnng process of VG-growth, G s generated when vstng vertex u ak 1 n VG (k 2) (H) where H ¼fu a1 ;...; u ak 2 g, and there s an edge ðu ak! u ak 1 Þ n VG (k 2) (H). Based on Lemma 1, we know that ths edge represents a vald k-group fu a1 ;...; u ak g. Thus, ths property s proved. h Theorem 2 (Completeness). VG-growth generates all vald groups. Proof. Let G ¼fu a1 ;...; u ak g be any vald group, where a 1 > a 2 > > a k and k P 2. We need to prove that VG-growth wll generate G as a vald group. We prove ths n two cases: (1) k = 2; and (2) k P 3 as follows: Case 1: when k =2.G wll be generated durng mnng the set of vald 2-groups usng AGP (see lne 02 n Fg. 4). Case 2: when k P 3. Gven that G ¼fu a1 ;...; u ak g s a vald group, u a1 ;...; u ak are vald users n the orgnal VG-graph. Based on the defnton of condtonal VG-graph, there must exst a complete subgraph formed by vertces u aþ1 ;..., and u ak n VG () (H) wth H ¼fu a1 ;...; u a g, ", 26 < k. Consderng the case when = k 2, there must be an edge ðu ak! u ak 1 Þ n VG ðk 2Þ ðfu a1 ;...; u ak 2 gþ. Therefore, when vstng vertex u ak 1, G wll be generated as a vald group by VG-growth. Thus, ths Theorem s proven. h 3.3. Performance evaluaton of AGP and VG-growth In ths secton, we evaluate and compare the performance of AGP and VG-growth. The experments have been conducted usng movement databases generated by IBM Cty Smulator [14] on a Pentum-IV machne wth a CPU clock rate of 2.4 GHz and 1 GB of man memory. Note that both AGP and VG-growth were mplemented assumng that the movement database resdes n man memory. Ths reduces the requred tme for our experments. Cty Smulator can generate realstc three-dmensonal user movement over cty layout that ncludes streets and buldngs. A dataset M1kN1k that contans 1000 users and 1000 tme ponts was generated, coverng a 1000 m 1500 m 100 m area of 48 roads and 72 buldngs wth dfferent heghts (the hghest s around 90 m). We recorded the total executon tme (T), the tme for mnng vald 2-groups (T 2 ), and the tme for mnng all other vald groups (T k ) for dfferent mn_we values rangng from 1% to 10%. T = T 2 + T k as both AGP and VG-growth fnd vald 2-groups frst before the rest. The max_ds and mn_dur thresholds were 30 and 4 respectvely. In the experments, the unt of measurement adopted for dstance was meter, and the nterval between every two consecutve tme ponts represents 10 mn. Thus, N = 1000 and mn_dur = 4 represent about one week and 40 mn respectvely. Table 3 summarzes the parameters used n ths set of experments. 3 Although there s no mplct orderng among the users of a group, we can always sort them by user ds.

17 256 Y. Wang et al. / Data & Knowledge Engneerng 57 (2006) Table 3 Performance comparson between AGP and VG-growth Dataset M N Dataset sze (MB) Thresholds M1kN1k max_ds = 30, mn_dur =4 mn_we = 1%, 2%, 4%, 6%, 10% T (mn) AGP VG T2 & Tk (mn) AGP-T2 VG-T2 AGP-Tk VG-Tk a mn_we (%) b mn_we (%) G & G G G2 Number of Vertces M1kN1k c mn_we (%) d mn_we (%) Fg. 5. Experment results: AGP vs. VG-growth: (a) T (M1kN1k), (b) T 2 and T k (M1kN1k), (c) G and G 2 (M1kN1k), (d) number of vertces n VG-graph. As shown n Fg. 5(a), VG-growth outperformed AGP n executon tme T, especally when mn_we s small (<4%). In partcular, when mn_we = 1%, VG-growth runs 10 tmes faster than AGP. Fg. 5(b) shows that the executon tme dfferences come from T k, snce AGP and VGgrowth share the same procedure of mnng vald 2-groups. For small mn_we, VG-growth was more effcent than AGP due to larger number of vald groups wth sze >2 that can be mned usng VG-graph. In contrast, AGP suffered from large number of canddate groups, causng large overhead n database scans and valdty checkng of canddate groups. As the mn_we ncreased, T 2 began to domnate T due to larger proportons of vald 2- groups, as shown n Fg. 5(c). For example, when mn_we = 10%, most vald groups were of sze

18 Y. Wang et al. / Data & Knowledge Engneerng 57 (2006) and both AGP and VG-growth spent almost all the tme fndng vald 2-groups. In other words, the VG-graph has lttle use to mprove the executon tme. Ths also motvates our proposed summarzaton approach to mne vald 2-groups whch wll be descrbed n Sectons 4 and 5. In the experment, we mplemented the VG-graph usng adacency lst structure. Each vertex n the VG-graph was stored wth a lst of ts prefx-neghbors and the correspondng vald segment lsts. Vertex ds and the tme stamps were represented as 4 bytes ntegers and each lst ponter requred 4 bytes. The byte sze of the VG-graph was obtaned accordngly. As shown n Fg. 5(d), the number of vertces n the VG-graph was the about same as the number of users when mn_we was less than 6%. It decreased wth ncreasng mn_we. For example, when mn_we = 10%, there were only 618 vald users (.e., vertces) among the 1000 users n M1kN1k. Fg. 6(a) shows the sze of VG-graph n KB for dfferent mn_we values. Our experments have shown that the compresson rato of VG-graph for dataset M1kN1k, whch occupes 12 MB space on hard dsk, was between 1% and 6%. Ths ndcates very good compresson ratos acheved by VG-graph as t only contans the set of vald users and only stores the vald segments of each vald 2-group rather than the actual locaton records of each user Sze of VG-graph (KB) VG-graph Sze of VG-graph (MB) VG-graph : M changes VG-graph : N changes a mn_we (%) b M/N T, T 2 and T k (mn) T T2 Tk T, T 2 and T k (mn) T T2 Tk c N d M Fg. 6. Experment results: AGP vs. VG-growth: (a) sze of VG-graph, (b) sze of VG-graph: M/N changes, (c) scale-up wth N (VG-growth), (d) scale-up wth M (VG-growth).

19 258 Y. Wang et al. / Data & Knowledge Engneerng 57 (2006) Table 4 Scalablty of VG-growth M (thousands) N (thousands) Dataset sze (MB) Thresholds 1 1/3/5/7/ max_ds = 30, mn_dur =4 1/3/5/7/ mn_we = 1%, 2%, 4%, 6%, 10% The scalablty results of VG-growth wth respect to M and N are shown n Fg. 6(b) (d). We measured both the executon tme and the sze of VG-graph for dfferent numbers of users (M) and tme ponts (N) as shown n 4. We only provde the curves for mn_we = 4%, snce the curves for other mn_wes have the same trends. Fg. 6(b) shows the scalablty of VG-graph when M or N changes. The sze of VG-graph ncreased almost quadratcally wth M but ncreased very lttle wth N. Ths s because the ncrease of M leads to not only more vertces n the VG-graph but also more vald segments. On the other hand, the ncrease of N only causes more vald segments whle the number of vertces ncreases only a lttle. From Fg. 6(c) and (d), we fnd that T ncreased almost lnearly wth N and almost quadratcally wth M. Ths s due to the domnatng T 2 whch has O(N Æ M 2 ) tme complexty. 4. Framework for mnng vald 2-groups usng locaton summarzaton In ths secton, we propose to address the overhead of mnng vald 2-groups. We frst descrbe a common framework to ncorporate dfferent locaton summarzaton methods nto vald 2-group mnng. Subsequently, we present several locaton summarzaton methods that adopt dfferent summarzaton models and assumptons on the summarzaton parameters. The proposed framework conssts of two steps: preprocessng and mnng. In the preprocessng step, each userõs movement data are frst dvded nto tme wndows of the same sze and locatons wthn each tme wndow are summarzed usng a summarzaton model. After that, the upper bounds of weght-count and vald segment length for each user par are computed based on the summarzed locaton data. In the mnng step, we frst generate a set of canddate 2-groups based on the upper bound nformaton. Ths set of canddate 2-groups s expected to be smaller than all possble 2-groups. Moreover, nstead of scannng the large movement database, the much smaller summarzed locaton database s scanned to check the valdty of each canddate 2-group. Only when vald segments could not be determned based on the summarzed data, the orgnal database wll then be accessed. We elaborate the detals n the followng subsectons Preprocessng of user movement data Let D 0 denote the summarzed data of user u, n whch the number of tme ponts n the orgnal movement data of u, D, s reduced to N 0 ¼b N c, where w s the tme wndow sze and N s the w number of tme ponts n D. For smplcty, we assume that N s a whole number. Note that a tme w pont t 0 n the summarzed database D 0 corresponds to a tme wndow [t0 Æ w,(t 0 +1)Æ w) nd.we use u [t 0 ].P to denote {u [t].pt 0 Æ w 6 t <(t 0 +1)Æ w}. Based on a summarzaton model (SM), whch s some 3D geometry shape such as sphere, cube, etc., u [t 0 ].P s summarzed to an nstance of the

20 Y. Wang et al. / Data & Knowledge Engneerng 57 (2006) z User d: u 1, N = 25, w = 5, N' = 5 sm(u 1, 0) sm(u 1, 1) sm(u 1, 2) sm(u 1, 3) sm(u 1, 4) O y x t t' Fg. 7. Example of nstances of summarzaton model. correspondng SM, denoted by sm(u,t 0 ). D 0 s therefore {sm(u,0),sm(u,1),...,sm(u,n 0 1)}. In addton, we defne the summarzaton rato of a summarzed database as D0, where D D0 and D are the szes of D 0 and D respectvely. For example, Fg. 7 llustrates the nstances of summarzaton model, where N and w are 25 and 5 respectvely. Note that, the locaton ponts wthn each tme wndow are summarzed nto a sphere,.e., an nstance of the sphere summarzaton model, whch wll be descrbed n detal later. The summarzed database contans fve spheres, each represented by a center and a radus. In ths paper, we wll examne four dfferent SMs, namely: Sphere locaton summarzaton method (SLS); Cubod locaton summarzaton method (CLS); Grd-sphere locaton summarzaton method (GSLS); Grd-cubod locaton summarzaton method (GCLS). These methods wll be further descrbed n Secton 5. Extensons to GSLS and GCLS to consder maxmum speed constrant on user movement are gven n Appendx A. These extensons however yeld performance results smlar to GSLS and GCLS. Hence, we do not report ther results n the paper. Wth D 0, the number of tme ponts to be scanned are reduced from N to N. However, ths does w not address the problem of scannng D 0 for large number of canddate 2-groups. Thus, n the preprocessng step, we pre-compute the upper bounds of weght-count and vald segment length for each user par based on D 0. The pre-computaton s carred out under the assumpton that the upper bound of max_ds, denoted by max ds, s gven. Defnton 10. Let t 0 be a tme pont n the summarzed database D 0. Let max ds be the upper bound of max_ds (.e., max ds P max ds). Then sm(u,t 0 ) and sm(u,t 0 ) are sad to be possbly close, f: MnDstanceðsmðu ; t 0 Þ; smðu ; t 0 ÞÞ 6 max ds ð8þ

21 260 Y. Wang et al. / Data & Knowledge Engneerng 57 (2006) where MnDstance(sm(u,t 0 ),sm(u,t 0 )) s a functon returnng the mnmum dstance between sm(u,t 0 ) and sm(u,t 0 ). Defnton 11. Gven a summarzed database D 0, a user par {u,u }, and max ds, a set of consecutve tme ponts ½t 0 a ; t0 b Š s called a possbly close segment (PCS) of {u,u }, f: (1) 8t 0 2½t 0 a ; t0 b Š, sm(u,t 0 ) and sm(u,t 0 ) are possbly close. (2) If t 0 a > 0, smðu ; t 0 a 1Þ and smðu ; t 0 a 1Þ are not possbly close. (3) If t 0 b < N 0, smðu ; t 0 b þ 1Þ and smðu ; t 0 b þ 1Þ are not possbly close. We use S({u,u }) to denote the set of PCSs of {u,u },.e., Sðfu ; u gþ ¼ f½t 0 a ; t0 b Š½t0 a ; t0 b Š½0; N 0 Þ; ½t 0 a ; t0 b Š s a PCS of fu ; u gg ð9þ Property 3. "s 2 s(u,u ), 9½t 0 a ; t0 b Š2Sðfu ; u gþ such that s ½t 0 a w; ðt0 b þ 1ÞwÞ. Proof. Recall that s(u,u ) s the set of vald segments of {u,u }. Gven any vald segment s of {u,u }, s =[t p,t q ](06 t p < t q < N), [t p,t q ] must le wthn one tme wndow of sze w,oracross more than one tme wndows, denoted by ½t 0 m ; t0 n Š. We have t p P ðt 0 m wþ and t q < ðt 0 n þ 1Þw. Snce u and u are not more than max_ds apart "t 2 [t p,t q ], smðu ; t 0 k Þ and smðu ; t 0 kþ should be possbly close at t 0 k (m 6 k 6 n), snce max ds P max ds. Hence, we have (1) ½t0 m ; t0 n Š tself s a PCS of {u,u }, or (2) ½t 0 m ; t0 n Š s covered by a PCS (say, ½t0 p ; t0 q Š)of{u,u }. Let ½t 0 a ; t0 b Š be ½t0 m ; t0 n Š (for case 1), or ½t0 p ; t0 qš (for case 2). In both cases, we have ½t 0 a ; t0 b Š2Sðfu ; u gþ, and s ¼½t p ; t q Š½t 0 a w; ðt0 b þ 1ÞwÞ. Therefore, ths property holds. h The above property says S({u,u }) conssts of possbly close segments (n D 0 ) that cover all the vald segments of {u,u }nd. Ths property provdes the foundaton of the correctness and completeness for the summarzaton based algorthms. Defnton 12. Gven a user par {u,u }, the longest possbly close segment length of {u,u }s defned as: Qðfu ; u gþ ¼ w max ½t 0 a;t 0 b Š2Sðfu ;u gþ ðt 0 b t0 a þ 1Þ ð10þ Property 4. "s 2 s(u,u ), Q({u,u }) P s. Proof. Let s max be the longest vald segment of {u,u }. We want to show that Q({u,u }) P s max. Due to Property 3, there exsts a PCS: ½t 0 a ; t0 b Š2Sðfu ; u gþ such that s max ½t 0 a w; ðt0 b þ 1ÞwÞ. Snce ½t0 a t0 b Š2Sðfu ; u gþ, Qðfu ; u gþ P w ðt 0 b t0 a þ 1Þ P s max. Thus, the property s proven. h Ths property asserts that the longest possbly close segment length of a user par s an upper bound of the vald segment length of ths par of users.

22 Y. Wang et al. / Data & Knowledge Engneerng 57 (2006) Defnton 13. The upper bound weght-count of {u,u } s defned as: X Rðfu ; u gþ ¼ w ðt 0 b t0 a þ 1Þ ½t 0 a;t 0 b Š2Sðfu ;u gþ ð11þ Property 5. R({u,u }) P weght-count ({u,u }). Proof. Recall that weght-countðfu ; u gþ ¼ P n ¼1 s, where s 2 s(u,u ). Let S({u,u }) be the set of PCSs of {u,u }. Note that, for any PCS 2 S({u,u }), there are two possble cases: (1) ths PCS covers one or more vald segment(s); or (2) ths PCS does not cover any vald segment, snce max ds P max ds. Let S 0 ({u,u }) denote the set of PCSs that covers one or more vald segment(s). Obvously, S 0 ({u,u }) S({u,u }). Next, from Property 3, we know that, for each vald segment s, there exsts a PCS coverng s. Thus, X n X s 6 w PCS ¼1 PCS2S 0 ðfu ;u gþ From the defnton of upper bound weght-count, we know: X X Rðfu ; u gþ ¼ w PCS P w PCS PCS2Sðfu ;u gþ PCS2S 0 ðfu ;u gþ Therefore, R({u,u }) P weght-count({u,u }). Thus, the property s proven. h Ths property asserts that the upper bound weght-count of a user par s ndeed the upper bound on the weght-count for ths par of users. Let P denote the set of all user pars together wth ther longest possbly close segment length and upper bound weght-count,.e., P ¼ fðfu ; u g; Qðfu ; u gþ; Rðfu ; u gþþ1 6 < 6 Mg ð12þ where M s the number of dstnct users. P contans the pre-computed upper bounds nformaton about the vald segment length and the weght-count for each user par. To effcently fnd canddate 2-groups that satsfy the mn_dur requrement, we sort P by Q value n descendng order. We also use ðp k.c 2 ; P k.qðc 2 Þ; P k.rðc 2 ÞÞ to denote the kth tuple n P. The detaled algorthm for the preprocessng step s shown n Fg Mnng of vald 2-groups After the summarzed database D 0 and precomputed upper bound nformaton P are constructed, the mnng step can be carred out to fnd the set of vald 2-groups, as shown n Fg. 9. User specfed max_ds, mn_dur and mn_we are nput to the mnng step. From P, we frst determne a set of canddate 2-groups, C 2, such that for each c 2 2 C 2, Q(c 2 ) P mn_dur and R(c 2 ) P mn_we Æ N. Next, we compute the weght-count of each c 2 2 C 2 by scannng the summarzed database D 0. We classfy the closeness of two nstances of SM at a summarzed tme pont nto three cases:

23 262 Y. Wang et al. / Data & Knowledge Engneerng 57 (2006) Fg. 8. Preprocessng step of framework. Case 1: all locaton ponts wthn the two nstances of SM are no more than max_ds apart (see lnes 06 and 07 n Fg. 9). Case 2: all locaton ponts wthn the two nstances of SM are more than max_ds apart (see lnes n Fg. 9). Case 3: otherwse,.e., only some locaton ponts nsde the two nstances of SM are less than max_ds (see lne 15 n Fg. 9). Should case 3 arses, the correspondng tme wndow n the orgnal movement database D wll be examned to determne the exact weght-count. Tme complexty analyss. In Fg. 9, lne 02 generates the set of canddate 2-groups based on mn_dur and mn_we. The tme complexty of procedure GetCanddate2Groups s O(k), where k s the number of pars wth Q(c 2 ) P mn_dur. Note that, C 2 = k 0 (k 0 6 k), where k 0 s the number of pars that satsfes both Q(c 2 ) P mn_dur and R(c 2 ) P mn_we Æ N. Lnes compute the weght-count for each canddate 2-group. The tme cost of lnes s: n 1 Æ T Max + n 2 Æ (T Max + T Mn )+n 3 Æ (T Max + T Mn + T COD ), where n 1, n 2 and n 3 are the number of tmes when the above three cases are encountered respectvely. T Max, T Mn, and T COD are the tme costs of procedures MaxDstance, MnDstance and CheckOrgnalDB respectvely. Note that, n 1 + n 2 + n 3 = N 0 Æ C 2 as there are altogether N 0 Æ C 2 teratons and T COD = O(w). Thus, the tme complexty of lnes s O(n 1 + n 2 + w Æ n 3 ). Lnes select and output the set of vald 2-groups, whch costs O(C 2 )=O(k 0 ).

24 Y. Wang et al. / Data & Knowledge Engneerng 57 (2006) Fg. 9. Mnng step of framework. Thus, the total tme cost s Oðk þ n 1 þ n 2 þ w n 3 þ k 0 Þ. In the best case, n 3 = 0, the total tme cost becomes Oðk þ N 0 k 0 þ k 0 Þ¼O N w k0.