1 PL Sequentil Mining: Open Soure Code C.I. Ezeife Shool of Computer Siene University of Windsor Windsor, Ontrio N9B 3P4 Yi Lu Deprtment of Computer Siene Wyne Stte University Detroit, Mihign Yi Liu Shool of Computer Siene University of Windsor Windsor, Ontrio N9B 3P4 ABSTRACT PL lgorithm uses preorder linked, position oded version of tree nd elimintes the need to reursively re-onstrut intermedite trees during sequentil mining s done y tree tehnique. PL produes signifint redution in response time hieved y the lgorithm nd provides position ode mehnism for rememering the stored dtse, thus, eliminting the need to re-sn the originl dtse s would e neessry for pplitions like those inrementlly mintining mined frequent ptterns, performing strem or dynmi mining. This pper presents open soure ode for oth the PL nd lgorithms desriing our implementtions nd experimentl performne nlysis of these two lgorithms on syntheti dt generted with IBM quest dt genertor. An implementtion of the Apriori-like GSP sequentil mining lgorithm is lso disussed nd sumitted. Keywords sequentil ptterns, we usge mining, tree, pre-order linkge. INTRODUCTION Bsi ssoition rule mining with the Apriori lgorithm [] finds dtse items (ttriutes) tht our most often together in dtse trnstions. Thus, given set of trnstions (similr to dtse reords), where eh trnstion is set of items (ttriutes), n ssoition rule is of the form X Y, where X nd Y re sets of items nd X Y =. Assoition rule mining lgorithms generlly first find ll frequent ptterns (itemsets) s ll omintions of items or ttriutes with support (perentge ourrene in the entire dtse), greter or equl to predefined minimum support. Then, in the seond stge of min- (A full version of this pper is ville in the Journl of Dt Mining nd Knowledge Disovery, 5-38, 25.) Permission to mke digitl or hrd opies of ll or prt of this work for personl or lssroom use is grnted without fee provided tht opies re not mde or distriuted for profit or ommeril dvntge nd tht opies er this notie nd the full ittion on the first pge. To opy otherwise, to repulish, to post on servers or to redistriute to lists, requires prior speifi permission nd/or fee. WOODSTOCK 97 El Pso, Texs USA Copyright 2X ACM X-XXXXX-XX-X/XX/XX...$5.. Tle : Smple We Aess Sequene Dtse TID We ess Sequenes d 2 ee 3 fe 4 fe ing, ssoition rules re generted from eh frequent pttern y defining ll possile omintions of rule nteedent (hed) nd onsequent (til) from items omposing the frequent ptterns suh tht nteedent onsequent = nd nteedent onsequent = frequentpttern. Then, only rules with onfidene (numer of trnstions tht ontin the rule divided y the numer of trnstions ontining the nteedent) greter thn or equl to pre-defined minimum onfidene re retined s vlule, while the rest re pruned. Sequentil mining is n extension of si ssoition rule mining tht ommodtes ordered set of items or ttriutes, where the sme item my e repeted in sequene. While si frequent pttern hs set of non-ordered items tht hve ourred together up to minimum support threshold, frequent sequentil pttern hs sequene of ordered items tht hve ourred frequently in dtse trnstions t lest s often s the minimum support threshold. Thus, the mesures of support nd onfidene used in ssoition rule mining for deiding frequent itemsets re used in sequentil mining for deiding frequent sequenes. Just s n i-itemset ontins i items, n n-sequene ontins n ordered items (events). One pplition of sequentil mining is we usge mining for finding the reltionship mong different we users esses from we ess logs [5], [], [4] nd [9]. Anlysis of these ess dt n help for server performne enhnement nd diret mrketing in e-ommere s well s we personliztion. Before pplying sequentil mining tehniques to we log dt, the we log trnstions re pre-proessed to group them into set of ess sequenes for eh user identifier nd to rete we ess sequenes in the form of trnstion dtse. An exmple sequene dtse to e mined for frequent sequenes used in [7], is given s Tle. Aess sequene S = e e 2... e l is lled susequene of n ess sequene, S = e e 2...e n, nd S is super-sequene of S, denoted s S S, if nd only if for every event e j in S, there is n equl event e k in S, while the order tht

2 events ourred in S is the sme s the order of events in S. For exmple, with S =, S = d, we n sy tht S is susequene of S. We n lso sy tht is susequene of S, lthough there is ourring etween nd in S. In the sequene d, while d is suffix susequene of, is prefix susequene of d. Tehniques for mining sequentil ptterns from we logs fll into Apriori or non- Apriori. The Apriori-like lgorithms generte sustntilly huge sets of ndidte ptterns, espeilly for long sequentil ptterns [2], [3], [3], [4], [5]. -tree mining [7] is non-apriori method whih stores the we ess ptterns in ompt prefix tree, lled -tree, nd voids generting long lists of ndidte sets to sn support for. However, -tree lgorithm hs the drwk of reursively re-onstruting numerous intermedite -trees during mining in order to ompute the frequent prefix susequenes of every suffix susequene in frequent sequene. This proess is very time-onsuming. The PL lgorithm [7] ssigns unique position ode to eh node of the -tree, uilds the -tree hed links in pre-order fshion rther thn in the order the nodes rrive s done y the -tree lgorithm. With the pre-order linked, position oded -tree, the PL lgorithm is le to mine frequent sequenes, strting with the prefix sequene without the need to reursively re-onstrut ny intermedite -trees. This pproh results in tngile performne gin. This pper presents disussions of the implementtions of three key sequentil mining lgorithms used in experimentl nd performne justifition of the PL lgorithm in [7]. The C++ implementtions of three sequentil mining lgorithms, PL [7], [7], nd GSP [3] re disussed.. Relted Work Work on mining sequentil ptterns in we log inlude the GSP [3], the PSP [3], the G sequene [8] nd the grph trversl [5] lgorithms. Agrwl nd Sriknt proposed three lgorithms (Apriori, AprioriAll, AprioriSome) for sequentil mining in [2]. The GSP (Generlized Sequentil Ptterns) [3] lgorithm, whih is 2 times fster thn the Apriori lgorithm ws proposed. The GSP Algorithm mkes multiple psses over dt. The first pss determines the frequent -item ptterns (L ). Eh susequent pss strts with seed set: the frequent sequenes found in the previous pss (L k ). The seed set is used to generte new ndidte sequenes (C k ) y performing n Apriori gen join of L k with (L k ). This join requires tht every sequene s in the first L k joins with other sequenes s in the seond L k if the lst k-2 elements of s re the sme s the first k-2 elements of s. For exmple, if frequent 3-sequene set L 3 hs the following 6 sequenes: {((,2)(3)), ((,2)(4)), (()(3, 4)), ((,3)(5)), ((2)(3,4)), ((2)(3)(5))}, to otin frequent 4- sequenes, every frequent 3-sequene should join with the other 3-sequenes tht hve the sme first two elements s its lst two elements. Sequene s=((,2)(3)) joins with s =((2)(3,4)) to generte ndidte 4-sequene ((,2)(3,4)) sine the lst 2 elements of s, (2)(3), mth with the first 2 elements of s. Similrly, ((,2) (3)) joins with ((2)(3)(5)) to form ((,2)(3)(5)). There re no more mthing sequenes to join in L 3. The join phse is followed with the pruning phse, when the ndidte sequenes with ny of their ontiguous (k-)-susequenes hving support ount less thn the minimum support, re dropped. The dtse is snned for supports of the remining ndidte k-sequenes to find frequent k-sequenes(l k ), whih eome the seed for the next pss, ndidte (k+)-sequenes. The lgorithm termintes when there re no frequent sequenes t the end of pss, or when there re no ndidte sequenes generted. The GSP lgorithm uses hsh tree to redue the numer of ndidtes tht re heked for support in the dtse. The PSP (Prefix Tree For Sequentil Ptterns) [3] pproh is muh similr to the GSP lgorithm [3], ut stores the dtse on more onise prefix tree with the lef nodes rrying the supports of the sequenes. At eh step k, the dtse is rowsed for ounting the support of urrent ndidtes. Then, the frequent sequene set, L k is uilt. The Grph Trversl mining [4], [5], uses simple unweighted grph to store we sequenes nd grph trversl lgorithm similr to Apriori lgorithm to trverse the grph in order to ompute the k-ndidte set from the (k- )-ndidte sequenes without performing the Apriori-gen join. From the grph, if ndidte node is lrge, the djeny list of the node is retrieved. The dtse still hs to e snned severl times to ompute the support of eh ndidte sequene lthough the numer of omputed ndidte sequenes is drstilly redued from tht of the GSP lgorithm. Other tree sed pprohes inlude [8] lled G sequene mining. This lgorithm uses wildrds, templtes nd onstrution of Aggregte tree for mining. The FP-tree struture [9] first reorders nd stores the frequent non-sequentil dtse trnstion items on prefix tree, in desending order of their supports suh tht dtse trnstions shre ommon frequent prefix pths on the tree. Then, mining the tree is omplished y reursive onstrution of onditionl pttern ses for eh frequent -item (in ordered list lled f-list), strting with the lowest in the tree. Conditionl FP-tree is onstruted for eh frequent onditionl pttern hving more thn one pth, while mximl mined frequent ptterns onsist of ontention of items on eh single pth with their suffix f-list item. FreeSpn [8] like the FP-tree method, lists the f- list in desending order of support, ut it is developed for sequentil pttern mining. PrefixSpn [6] is pttern-growth method like FreeSpn, whih redues the serh spe for extending lredy disovered prefix pttern p y projeting portion of the originl dtse tht ontins ll neessry dt for mining sequentil ptterns grown from p. We ess pttern tree (), is non-apriori lgorithm, proposed y Pei et l. [7]. The -tree stores the we log dt in prefix tree formt similr to the frequent pttern tree [9] (FP-tree). lgorithm first sns the we log to ompute ll frequent individul events, then it onstruts -tree over the set of frequent individul events of eh trnstion efore it reursively mines the onstruted tree y uilding onditionl tree for eh onditionl suffix frequent pttern found. The proess of reursive mining of onditionl suffix tree ends when it hs only one rnh or is empty. An exmple pplition of the -tree lgorithm for finding ll frequent events in the we log (onstruting the -tree nd mining the ess ptterns from the tree) is shown with the dtse in Tle 2. Suppose the minimum support threshold is set t 75%, whih mens n ess sequene, s should hve ount of 3 out of 4 reords in

3 Tle 2: Smple We Aess Sequene Dtse for -tree H e d e r T l e R o t TID We ess sequene Frequent susequene d 2 ee 3 fe 4 ff : 3 : 3 : 3 : : : H e d e r T l e R o t : 4 : C o n d t i o n l W A P - t r e ( ) C o n d t i o n l W A P - t r e ( ) H e d e r T l e R o t : 3 : : 3 : H e d e r T l e R o t : : 4 R o t C o n d t i o n l W A P - t r e ( ) C o n d t i o n l W A P - t r e ( d ) C o n d t i o n l W A P - t r e ( e ) Figure : Constrution of the Originl Tree our exmple, to e onsidered frequent. Construting tree, entils first snning dtse one, to otin events tht re frequent,,,. When onstruting the -tree, the non-frequent (like d, e, f) prt of every sequene is disrded. Only the frequent su-sequenes shown in olumn three of Tle 2 re used s input. With the frequent sequene in eh trnstion, the -tree lgorithm first stores the frequent items s heder nodes for linking ll nodes of their type in the -tree in the order the nodes re inserted. When onstruting the -tree, virtul root (Root) is first inserted. Then, eh frequent sequene in the trnstion is used to onstrut rnh from the Root to lef node of the tree. Eh event in sequene is inserted s node with ount from Root if tht node type does not yet exist, ut the ount of the node is inresed y if the node type lredy exists. Also, the hed link for the inserted event is onneted (in roken lines) to the newly inserted node from the lst node of its type tht ws inserted or from the heder node of its type if it is the very first node of tht event type inserted. One the frequent sequentil dt re stored on the omplete -tree (Figures ), the tree is mined for frequent ptterns strting with the lowest frequent event in the heder list, in our exmple, strting from frequent event s the following disussion shows. From the -tree of Figure, it first omputes prefix sequene of the se or the onditionl sequene se of s: : 2; : ; : ; : -; : ; : ; : - The onditionl sequene list of suffix event is otined y following the heder link of the event nd reding the pth from the root to eh node (exluding the node). After disrding the non-frequent prt in the ove sequenes, the onditionl sequenes sed on re listed elow: Figure 2: Reonstrution of Trees for Mining Conditionl Pttern Bse : 2; : ; : ; : -; : ; : ; : - Using these onditionl sequenes, onditionl tree, -tree, is uilt using the sme method s shown in Figure. The re-onstrution of trees tht progressed s suffix sequenes, disovered frequent ptterns found long this line, nd. The reursion ontinues with the suffix pth,. The lgorithm keeps running, finding the onditionl sequene ses of,,. Figure 2 shows the trees for mining onditionl pttern se. After mining the whole tree, disovered frequent pttern set is: {,,,,,,,,,,,, }. Although -tree lgorithm sns the originl dtse only twie nd voids the prolem of generting explosive ndidte sets s in Apriori-like lgorithm, its min drwk is reursively re-onstruting lrge numers of intermedite -trees nd ptterns during mining tking up omputing resoures. Pre-Order linked tree lgorithm (PL) [7], [], is version of the tree lgorithm tht ssigns unique inry position ode to eh tree node nd performs the heder node linkges pre-order fshion (root, left, right). Both the pre-order linkge nd inry position odes enle the PL to diretly mine the sequentil ptterns from the one initil tree strting with prefix sequene, without re-onstruting the intermedite trees. To ssign position odes to PL node, the root hs null ode, nd the leftmost hild of ny prent node hs ode tht ppends to the position ode of its prent, while the position ode of ny other node hs ppended to the position ode of its nerest left siling. The PL tehnique presents muh etter performne thn tht hieved y the -tree tehnique s shown in extensive performne

4 nlysis..2 Motivtions nd Contriutions PL lgorithm [7], reently proposed sequentil mining tool in the Journl of Dt Mining nd Knowledge Disovery hs mny ttrtive fetures tht mkes it suitle s uilding lok for mny other sophistited sequentil dt mining pprohes like inrementl mining [6], we lssifition nd personliztion. Fetures of the PL lgorithm n lso e pplied to non-sequentil mining strutures like the FP-tree. PL nd two other key sequentil mining lgorithms ( nd GSP), hve een implemented nd tested thoroughly on pulily ville dt sets ( index.shtml) nd with rel dt. Meningful use nd pplition of these effiient tools will gretly inrese if their implementtions re mde pulily ville. This pper ontriutes y disussing our C++ implementtions of three key sequentil mining lgorithms PL, nd the GSP used in performne studies of work in [7]..3 Outline of the Pper Setion 2 disusses exmple mining with the Pre-Order Linked -Tree (PL) lgorithm. Setion 3 disusses the C++ implementtions of the PL, nd the GSP lgorithms for sequentil mining. Setion 4 disusses experimentl performne nlysis, while setion 5 presents onlusions nd future work. 2. AN EXAMPLE SEQUENTIAL MINING WITH PL ALGORITHM Unlike the onditionl serh in -tree mining, whih is sed on finding ommon suffix sequene first, the PL tehnique finds the ommon prefix sequenes first. The min ide is to find frequent pttern y progressively finding its ommon frequent susequenes strting with the first frequent event in frequent pttern. For exmple, if d is frequent pttern to e disovered, the lgorithm progressively finds suffix sequenes d, d, d nd d. The PL tree, on the other hnd, would find the prefix event first, then, using the suffix trees of node, it will find the next prefix susequene nd ontinuing with the suffix tree of, it will find the next prefix susequene nd finlly, d. Thus, the ide of PL is to use the suffix trees of the lst frequent event in n m-prefix sequene to reursively extend the susequene to m+ sequene y dding frequent event tht ourred in the suffix trees. Using the position odes of the nodes, the PL is le to know the desendnt nd siling nodes of oldest prent nodes on the suffix root set of frequent heder element eing heked for ppending to prefix susequene if it is frequent in the suffix root set under onsidertion. An element is frequent if the sum of the supports of the oldest prent nodes on ll its suffix root sets is greter thn or equl to the minimum support. Assume we wnt to mine the we ess dtse (wsd) of Tle 2 For frequent sequenes given minimum support of 75% or 3 trnstions. Construting nd mining the PL tree goes through the following steps. () Sn WASD (olumn 2 of Tle 2 one to find ll frequent individul events, L s {:4, :4, :4}. The events d:, e:2, f:2 Figure 3: Constrution of PL-Tree Using Pre- Order Trversl hve supports less thn the minimum support of 3 nd (2) Sn WASD gin, onstrut PL-tree over the set of individul frequent events (olumn 3 of Tle 2, y inserting eh sequene from root to lef, leling eh node s (node event: ount: position ode). Then, fter ll events re inserted, trverse the tree pre-order wy to onnet the heder link nodes. Figure 3 shows the ompletely onstruted PL tree with the pre-order linkges. (3) Reursively mine the PL-tree using ommon prefix pttern serh: The lgorithm strts to find the frequent sequene with the frequent -sequene in the set of frequent events(fe) {,, }. For every frequent event in FE nd the suffix trees of urrent onditionl PL-tree eing mined, it follows the linkge of this event to find the first ourrene of this frequent event in every urrent suffix tree eing mined, nd dds the support ount of ll first ourrenes of this frequent event in ll its urrent suffix trees. If the ount is greter thn the minimum support threshold, then this event is ppended (ontented) to the lst susequene in the list of frequent sequenes, F. The suffix trees of these first ourrene events in the previously mined onditionl suffix PL-trees re now in turn, used for mining the next event. To otin this onditionl PL-tree, we only need to rememer the roots of the urrent suffix trees, whih re stored for next round mining. For exmple, the lgorithm strts y mining the tree in Figure 4() for the first element in the heder linkge list, following the link to find the first ourrenies of nodes in :3: nd :: of the suffix trees of the Root sine this is the first time the whole tree is pssed for mining frequent -sequene. Now, the list of mined frequent ptterns F is {} sine the ount of event in this urrent suffix trees is 4 (sum of :3: nd :: ounts), nd more thn the minimum support of 3. The mining of frequent 2-sequenes tht strt with event would ontinue with the next suffix trees of rooted t {:3:, ::} shown in Figure 4() s unshdowed nodes. The ojetive here is to find if 2-sequenes, nd re frequent using these suffix trees. In order to onfirm frequent, we need to onfirm event frequent in the urrent suffix tree set, nd similrly, to onfirm frequent, we should gin follow the link to onfirm event frequent using this suffix tree set, sme for. The proess ontinues to otin sme frequent sequene set {,,,,,,,,,,,, } s the -tree lgorithm.

5 2 : 2 : : { :,,,, } { :, } { :,,,, } 2 : 2 : : 3 : 3 : ( ) 3 : 3 : : : : Ro t : : : Ro t : : : : : : : : : : 2 : 2 : : 3 : s u x r f t i e { :, } Ro t s u x i r f t e { : :, :, } s u r x f i t e { : } ( ) ( d ) 2 : 2 : : 3 : : : : ( ) 3 : 3 : : : : : : : : : Ro t : : : : : Figure 4: Mining PL-Tree to Find Frequent Sequene Strting with 3. C++ IMPLEMENTATIONS OF THE THREE SEQUENTIAL MINING ALGORITHMS Although, we ly more emphsis on the PL lgorithm developed in our l, we lso provide the soure odes of the nd GSP lgorithms used for performne nlysis. The soure odes re disussed under seven hedings of () development nd running environment, (2) input dt formt nd files, (3) minimum support formt, (4) output dt formt nd files, (5) funtions used in the progrm, (6) dt strutures used in the progrm nd (7) dditionl informtion. All of the odes re doumented with informtion on this setion nd more for ode redility, mintinility nd extendility. Eh progrm is stored in.pp file nd ompiled with g++ filenme.pp. 3. C++ Implementtion of the PL Sequentil Mining Algorithm This is the progrm of PL lgorithm whih is sed on the desription in [7], C.I. Ezeife nd Y. Lu. Mining We Log sequentil Ptterns with Position Coded Pre- Order Linked -tree in DMKD 25..DEVELOP- MENT ENVIRONMENT: Although initil version is developed under the hrdwre/softwre environment speified elow, the progrm runs on more powerful nd fster multiproessor UNIX environments s well. Initil environment is: (i)hrdwre: Intel Celeron 4 PC, 64M Memory; (ii)operting system: Windows 98; (iii)development tool: Inprise(Borlnd) C++ Builder 6.. The lgorithm is developed under C++ Builder 6., ut ompiles nd runs under ny stndrd C++ development tool. 2. INPUT FILES AND FORMAT: Input file is test.dt: For simplifying input proess of the progrm, we ssume tht ll input dt hve een preproessed suh tht ll events elonging to the sme user id hve een gthered together, nd formed s sequene nd sved in text file, lled, test.dt. The test.dt file my e omposed of hundreds of thousnds of lines of sequenes where eh line represents we ess sequene for eh user. Every line of the input dt file ( test.dt ) inludes UserID, length of sequene nd the sequene whih re seprted y t spes. An exmple input line is: Here, represents UserID, the length of sequene is 5, the sequene is,2,4,,3. 3. MINIMUM SUPPORT FORMAT: The progrm lso needs to ept vlue etween nd s minimum support. The minimum support input is entered intertively y the user during the exeution of the progrm when prompted. For minimum support of 5%, user should type.5, nd for minsupport of 5%, user should type.5, nd so on. 4. OUTPUT FILES AND FORMAT: result PL.dt: One the progrm termintes, we n find the result frequent ptterns in file nmed result PL.dt. It my ontin lines of ptterns, eh representing frequent pttern. 5. FUNCTIONS USED IN THE CODE: (i)buildtree: uilds the PL tree, (ii)uildlinkge: Builds the pre-order linkge for PL tree, (iii)mkecode: mkes the position ode for node, (v)hekposition: heks the position etween ny two nodes in the PL tree, (v)miningproess: mines sequentil frequent ptterns from the PL tree using position odes. 6. DATA STRUCTURE: Three strut re used in this progrm: (i) the node strut indites PL node whih ontins the following informtion:.the event nme,.the numer of ourrene of the event,. link to the position ode, d. length of position ode, e. the linkge to next node with sme event nme in PL tree, f. pointer to its left son, g. pointer to its right siling, h. pointer to its prent, i. the numer of its sons. (ii) position ode strut implemented s linked list of unsigned integer to mke it possile to hndle dt of ny size without exeeding the mximum integer size. (iii) linkge strut. 7. ADDITIONAL INFORMATION: The run time is displyed on the sreen with strt time, end time nd totl seonds for running the progrm. 3.2 C++ Implementtion of the Sequentil Mining Algorithm This is the lgorithm progrm sed on the desription in [7]: Jin Pei, Jiwei Hn, Behzd Mortzvisl, nd Hu Zhu, Mining Aess Ptterns Effiiently from We Logs, PAKDD 2..DEVELOPMENT ENVIRONMENT: The sme s desried for PL nd n run in ny UNIX system s well. (i)hrdwre: Intel Celeron 4 PC, 64M Memory; (ii)operting system: Windows 98; (iii)development tool: Inprise(Borlnd) C++ Builder 6.. The lgorithm is developed under C++ Builder 6., nd lso ompiles nd runs under ny stndrd C++ development tools. 2. INPUT FILES AND FORMAT: i) Input file test.dt: Pre-proessed sequentil input reords

6 re red from the file test.dt. The test.dt file is omposed of hundreds of thousnds of lines of sequenes where eh line represents we ess sequene for eh user. Every line inludes UserID, length of sequene nd the sequene whih re seperted y t spes. For exmple, given line: represents UserID, 5 is the length of sequene, the sequene is,2,4,,3. ii) Input file middle.dt: used to sve the onditionl middle ptterns. During the tree mining proess, following the linkges, one the sum of support for n event is found greter thn minimum support, ll its prefix onditionl ptterns re sved in the middle.dt file for next round mining. The formt of middle.dt is s follows: Eh line inludes the length of the sequene, the ourrene of the sequene, nd the events in the sequene. For exmple, given line in middle.dt: , the length of sequene is 5, 4 indites the sequene ourred 4 times in the previous onditionl tree nd the sequene is,2,4,,3. 3. MINIMUM SUPPORT: The progrm lso needs to ept vlue etween nd for minimum support or frequeny. The user is prompted for minimum support when the progrm strts. 4. OUTPUT FILES AND FORMAT: result.dt One the progrm termintes, the result ptterns re in file nmed result.dt, whih my ontin lines of ptterns. 5. FUNCTIONS USED IN THE CODE: (i)buildtree: Builds the tree/onditionl tree (ii)miningproess: produes sequentil pttern/onditionl prefix su-pttern from tree/onditionl tree. 6. DATA STRUCTURE: three strut re used in this progrm: (i) the node strut indites node whih ontins the following informtion:.the event nme,.the numer of ourrene of the event,. the linkge to next node sme event nme in tree, d. pointer to it s left son, e. pointer to it s rights siling, f. pointer to it s prent. (ii) linkge strut desried in the progrm. 7. ADDITIONAL INFORMATION: The run time is displyed on the sreen with strt time, end time nd totl seonds for running the progrm. 3.3 C++ Implementtion of the GSP Sequentil Mining Algorithm This is GSP lgorithm implementtion, whih demostrtes the result desried in [3]: R. Sriknt nd R. Agrwl. Mining sequentil ptterns: Generliztions nd performne improvements, 996..DEVELOPMENT ENVIRONMENT: The sme s desried for PL nd runs in ny UNIX system s well. (i)hrdwre: Intel Celeron 4 PC, 64M Memory; (ii)operting system: Windows 98; (iii)development tool: Inprise(Borlnd) C++ Builder 6.. The lgorithm is developed under C++ Builder 6. ut lso ompiles nd runs under ny stndrd C++ development tools. 2. INPUT FILES AND FORMAT: Input file test.dt: For simplifying input proess of the progrm, we ssume tht ll input dt hve een preproessed suh tht ll events elonging to the sme user id hve een gthered together, nd formed s sequene nd sved in test.dt. The test.dt file is omposed of hundreds of thousnds of lines of sequenes, where eh line represents we ess sequene for eh user. Every line inludes UserID, length of sequene nd the sequene whih re seperted y t spes. For exmple, given line: represents UserID, the length of the sequene is 5, the sequene is,2,4,,3. 3. MINIMUM SUPPORT: The progrm lso needs to ept vlue etween nd s minimum support. The user is prompted for minimum support when the progrm strts. 4. OUTPUT FILES AND FORMAT: result GSP.dt At the termintion of the progrm, the result ptterns re in file nmed result GSP.dt, whih my ontin lines of frequent ptterns. 5. FUNCTIONS USED IN THE CODE: (i)gsp: reds the file nd mines levelwise ording to the lgorithm. 6. DATA STRUCTURES: There re strut for i) ndidte sequene list nd its ount nd ii) sequene. 7. ADDITIONAL INFORMATION: The run time is displyed on the sreen with strt time, end time nd totl seonds for running the progrm. 4. PERFORMANCE AND EXPERIMENTAL ANALYSIS OF THREE ALGORITHMS The PL lgorithm elimintes the need to store numerous intermedite trees during mining, thus, drstilly utting off huge memory ess osts. The PL nnottes eh tree node with inry position ode for quiker mining of the tree. This setion ompres the experimentl performne nlysis of PL with the tree mining nd the Apriori-like GSP lgorithms. The three lgorithms were initilly implemented with C++ lnguge running under Inprise C++ Builder environment. All initil experiments were performed on 4MHz Celeron PC mhine with 64 megytes memory running Windows 98 (for work in [7]). Current experiments re onduted with the sme implementtions of the progrms nd still on syntheti dtsets generted using the resoure dt genertor ode from Resoures/index.shtml. This syntheti dtset hs een used y most pttern mining studies [3, 2, 7]. Experiments were lso run on rel dtsets generted from we log dt of University of Windsor Computer Siene server nd preprossed with our we log lener ode. The orretness of the implementtions were onfirmed y heking tht the frequent ptterns generted for the sme dtset y ll lgorithms re the sme. A high speed UNIX SUN mirosystem with totl of 6384 M memory nd 8 x 2 MHz proessor speed is used for these experiments. The syntheti dtsets onsist of sequenes of events, where eh event represents n essed we pge. The prmeters shown elow re used to generte the dt sets. D Numer of sequenes in the dtse C : Averge length of the sequenes S : Averge length of mximl potentilly frequent sequene N : numer of events For exmple, C.S5.N2.D6K mens tht C =, S = 5, N = 2, nd D = 6K. It represents group of dt with verge length of the sequenes s, the verge length of mximl potentilly frequent sequene is 5, the numer of individul events in the dtse is 2,

7 nd the totl numer of sequenes in dtse is 6 thousnd. The dtsets with different prmeters test different spets of the lgorithms. Generlly, if the numer of these four prmeters eomes lrger, the exeution time eomes longer. Experiments re onduted to test the ehvior of the lgorithms with respet to the four prmeters, minimum support threshold, dtse sizes, numer of dtse tle ttriutes nd the verge length of sequenes. For eh experiment, while one of the prmeters hnges, others re pegged t some onstnt vlues. Oservtions re mde t three levels of smll, medium nd lrge (e.g., smll dtse size my onsist of tle with reords less thn 4 thousnds, medium size tle hs etween 5 nd 2 thousnds reords, while lrge hs over 3 thousnd). Experiment : Exeution time trend with different minimum supports (smll size dtse, 4K reords): This experiment uses fixed size dtse nd different minimum support to ompre the performne of PL, nd GSP lgorithms. The dtsets re desried s C2.5.S5.N5.D4K, nd lgorithms re tested with minimum supports etween.5% nd 2% ginst the 4 thousnd (4K) dtse with 5 ttriutes nd verge sequene length of 2.5. The results of this experiment re shown in Tle 3 nd Figure 5 with the numer of frequent ptterns found reported s Fp. From this result, it n e seen tht t minimum support of.5%, the numer of frequent ptterns found is highest nd is 2729, the PL lgorithm rn lmost 3 times fster thn the lgorithm. As the minimum support inreses, the numer of frequent ptterns found dereses nd the gin in performne y the PL lgorithm over the lgorithm dereses. It n e seen tht the more the numer of frequent ptterns found in dtset, the higher the performne gin hieved y the PL lgorithm over the lgorithm. This is euse the two lgorithms spend out the sme mount of time snning the dtse nd onstruting the tree, ut the PL lgorithm sves on storing nd reding intermedite re-onstruted tree, whih re s mny s the numer of frequent ptterns found. The exeution times of the two lgorithms re the sme when there re nerly no frequent ptterns Minimum Support (in %) PL Figure 5: Exeution Times Trend with Different Minimum Supports (smll dtse) Tle 3: Exeution Times for Dtset t Different Minimum Supports (smll dtse) Alg Runtime (in ses) t Different Supports(%) Fp= Fp= Fp = Fp= Fp= Fp= Fp GSP PL Experiment 2: Exeution times trend with different minimum supports (Medium size dtse, 2K reords): This experiment uses fixed size dtse nd different minimum support to ompre the performne of PL, nd GSP lgorithms. The lgorithms re tested with minimum supports etween.5% nd 5% ginst the 2 thousnd (2K) dtse, 5 ttriutes nd verge sequene length of 2.5. The results of this experiment re shown in Tle 4 nd Figure 6 with the numer of frequent ptterns found reported s Fp. It n e seen tht the trend in performne is the sme s with smll dtse. When the minimum support rehes 5%, there re no frequent ptterns found nd the running times of the two lgorithms eome the sme Minimum Support (in %) PL Figure 6: Exeution Times Trend with Different Minimum Supports (medium dtse) Experiment 3: Exeution Times for Dtset t Different Minimum Supports (lrge dtse): This experiment uses fixed size dtse nd different minimum support to ompre the performne of PL, nd GSP lgorithms. The lgorithms re tested with minimum supports etween.5% nd 5% ginst one million (M) reord dtse. The results of this experiment re shown in Tle 5 nd Figure 7. Experiment 3: Exeution Times for Dtset t Different Minimum Supports (lrge dtse ut very low minimum supports): This experiment uses fixed size dtse nd different mini-

8 Tle 4: Exeution Times for Dtset t Different Minimum Supports (medium dtse) Alg Runtime (in ses) t Different Supports(%) Fp= Fp= Fp = Fp= Fp= Fp= Fp GSP PL , 2,, 8, 6, 4, 2, PL Minimum Support (in %) PL Figure 7: Exeution Times for Dtset t Different Minimum Supports (lrge dtse) Minimum Support (in %) Figure 8: Exeution Times for Dtset t Different Minimum Supports (lrge dt se nd low minsupports) Tle 6: Exeution Times for Dtset t Different Minimum Supports (lrge dtse nd low minsupports) Alg Runtime (in ses) t Different Supports(%) Fp= Fp= Fp = Fp= Fp= Fp= Fp 222K 6K PL mum support to ompre the performne of PL, nd GSP lgorithms. The lgorithms re tested with minimum supports etween.% nd.2% ginst one million (M) reord dtse. The results of this experiment re shown in Tle 6 nd Figure 8. Experiment 4: Exeution times trend with different dtse sizes (smll size dtse, 2K to 4K): This experiment uses fixed size dtse nd different minimum supports to ompre the performne of PL, nd GSP lgorithms. The lgorithms re tested on dtses of sizes 2K to 4K t minimum support of %. The gin in performne y the PL lgorithm is onstnt ross different sizes euse the numer of frequent ptterns in dif- Tle 5: Exeution Times for Dtset t Different Minimum Supports (lrge dtse) Alg Runtime (in ses) t Different Supports(%) Fp= Fp= Fp = Fp= Fp= Fp= Fp GSP PL ferent sized dtsets generted y the dt genertor seem to e out the sme t prtiulr minimum support. The results of this experiment re shown in Tle 7 nd Figure 9. Experiment 5: Exeution times trend with different dtse sizes (medium size dtse, 2K to 2K): This experiment uses fixed size dtse nd different minimum support to ompre the performne of PL, nd GSP lgorithms. The lgorithms re tested with dtse sizes etween 2 nd 2 thousnds t minimum support of M. The results of this experiment re shown in Tle 8 nd Figure. Experiment 6: Exeution times trend with different dtse sizes (lrge size dtse, 3K to Tle 7: Exeution Times for Different Dtse Sizes t Minsupport (smll dtse) Alg Runtime (in ses) t Different Supports(%) 2K 4K 6K 8K K 2K 4K GSP PL

9 K 4K 6K 8K K 2K 4K Dtse Sizes PL Figure 9: Exeution Times Trend with Different Minimum Supports (lrge dtse) K 4K 6K 8K K 2K Dtse Sizes PL Figure : Exeution Times Trend with Different Dtse sizes (medium dtse) 9K): This experiment uses fixed size dtse nd different minimum support to ompre the performne of PL, nd GSP lgorithms. The lgorithms re tested with dtse sizes etween 3K nd 9K reords t minimum support of %. The results of this experiment re shown in Tle 9 nd Figure. Experiments were lso run to hek the ehvior of the lgorithms with vrying sequene lengths nd numer of dtse ttriutes, nd the PL lwys runs fster thn the lgorithm when the verge sequene length is less thn 2 nd there re some found frequent ptterns. However, the PL performne strts to degrde wehn the verge sequene length of the dtse is more thn 2 euse with extremely long sequenes, there re nodes with position odes tht re more thn mxint, in our urrent implementtion, we use numer of vriles to store node s position ode tht re linked together. Thus, mnging nd retrieving the position ode for exessively long Tle 8: Exeution Times for Different Dtse Sizes t Minsupport (medium dtse) Alg Runtime (in ses) t Different Dtse sizes(%) 2K 4K 6K 8K K 2K GSP PL K 4K 5K 6K 7K 8K 9K Dtse Sizes PL Figure : Exeution Times Trend with Different Dtse sizes (lrge dtse) sequenes ould turn out to e time onsuming. 5. CONCLUSIONS This pper disusses the soure ode implementtion of the PL lgorithm presented in [7] s well s the our implementtions of two other sequentil mining lgorithms [7] nd GSP [3] tht PL ws ompred with. Extensive experimentl studies were onduted on the three implemented lgorithms using IBM quest syntheti dtsets. From the experiments, it ws disovered tht the PL lgorithm, whih mines pre-order linked, position oded version of the tree, lwys outperforms the other two lgorithms. PL improves on the performne of the effiient tree-sed tree lgorithm y mining with position odes nd their suffix tree root sets, rther thn storing intermedite trees reursively. Thus, it sves on Tle 9: Exeution Times for Different Dtse Sizes t Minsupport (lrge dtse) Alg Runtime (in ses) t Different DB sizes(%) 3K 4K 5K 6K 7K 8K 9K GSP PL

10 proessing time nd more so when the numer of frequent ptterns inreses nd the minimum support threshold is low. The PL lgorithm runs fster thn the lgorithm even with very lrge dtses. PL s performne seems to degrde some with very long sequenes hving sequene lenght more thn 2 nd this is euse of the inrese in the size of position ode tht it needs to uild nd proess for suh long sequenes leding to very deep PL tree. The experiments show tht mining we log using PL lgorithm is muh more effiient thn with -tree nd GSP lgorithms. For mining sequentil ptterns from we logs, the following spets my e onsidered for future work. The PL lgorithm ould e extended to hndle sequentil pttern mining in lrge trditionl dtses to hndle onurreny of events. Also, effiient we usge mining, ould enefit from relting usge to the ontent of we pges. Other res of interest for future work inlude distriuted mining with PL trees nd pplying these tehniques to inrementl mining of we logs nd sequentil ptterns. 6. ACKNOWLEDGMENTS This reserh ws supported y the Nturl Siene nd Engineering Reserh Counil (NSERC) of Cnd under n operting grnt (OGP-9434) nd University of Windsor grnt. 7. REFERENCES [] R. Agrwl nd R. Sriknt. Fst lgorithms for mining ssoition rules in lrge dtses. In Proeedings of the 2th Interntionl Conferene on very Lrge Dtses Sntigo, Chile, pges , 994. [2] R. Agrwl nd R. Sriknt. Mining sequentil ptterns. In Proeedings of the th Interntionl Conferene on Dt Engineering (ICDE 95) Tipei Tiwn, pges 3 4, 995. [3] R. S. R. Agrwl. Mining sequentil ptterns: Generliztions nd performne improvements. In Proeedings of the Fifth Interntionl Conferene On Extending Dtse Tehnology (EDBT 96) Avignon Frne, pges 3 7, 996. [4] J. Borges nd M. Levene. Dt mining of user nvigtion ptterns. In Proeedings of the KDD Workshop on We Mining Sn Diego Cliforni, pges 3 36, 999. [5] O. Etzioni. The world wide we: Qugmire or gold mine. Communitions of the ACM, 39():65 68, 996. [6] C. Ezeife nd M. Chen. Mining we sequentil ptterns inrementlly with revised plwp tree. In Proeedings of the fifth Interntionl Conferene on We-Age Informtion Mngement (WAIM 24), pges Springer Verlg, July 23. [7] C. Ezeife nd Y. Lu. Mining we log sequentil ptterns with position oded pre-order linked wp-tree. Interntionl Journl of Dt Mining nd Knowledge Disovery (DMKD) Kluwer Pulishers, ():5 38, 25. [8] J. Hn, J. Pei, B. Mortzvi-Asl, Q. Chen, U. Dyl, nd M. Hsu. Freespn: Frequent pttern-projeted sequentil pttern mining. In Proeedings of the 2 Int. Conferene on Knowledge Disovery nd Dt Mining (KDD ), pges , 2. [9] J. Hn, J. Pei, Y. Yin, nd R. Mo. Mining frequent ptterns without ndidte genertion: A frequent-pttern tree pproh. Interntionl Journl of Dt Mining nd Knowledge Disovery, 8():53 87, Jn 24. [] Y. Lu nd C. Ezeife. Position oded pre-order linked wp-tree for we log sequentil pttern mining. In Proeedings of the 7th Pifi-Asi Conferene on Knowledge Disovery nd Dt Mining (PAKDD 23), pges Springer, My 23. [] S. Mdri, S. Bhowmik, W. Ng, nd E. Lim. Smpling lrge dtses for finding ssoition rules. In Proeedings of the first Interntionl Dt Wrehousing nd Knowledge Disovery DWk99, 999. [2] F. Mssegli, F. Cthl, nd P. Ponelet. Psp : Prefix tree for sequentil ptterns. In Proeedings of the 2nd Europen Symposium on Priniples of Dt Mining nd Knowledge Disovery (PKDD 98) Nntes Frne LNAI, pges 76 84, 998. [3] F. Mssegli, P. Ponelet, nd R. Cihetti. An effiient lgorithm for we usge mining. Networking nd Informtion Systems Journl (NIS), 2(5-6):57 63, 999. [4] A. Nnopoulos nd Y. Mnolopoulos. Finding generlized pth ptterns for we log dt mining. Dt nd Knowledge Engineering, 37(3): , 2. [5] A. Nnopoulos nd Y. Mnolopoulos. Mining ptterns from grph trversls. Dt nd Knowledge Engineering, 37(3): , 2. [6] J. Pei, J. Hn, B. Mortzvi-Asl, nd H. Pinto. Prefixspn: Mining sequentil ptterns effiiently y prefix-projeted pttern growth. In Proeedings of the Interntionl Conferene on Dt Engineering (ICDE ), pges , 2. [7] J. Pei, J. Hn, B. Mortzvi-Asl, nd H. Zhu. Mining ess ptterns effiiently from we logs. In Proeedings of the Pifi-Asi Conferene on Knowledge Disovery nd Dt Mining (PAKDD ) Kyoto Jpn, 2. [8] M. Spiliopoulou. The lorious wy from dt mining to we mining. Journl of Computer Systems Siene nd Engineering Speil Issue on Semntis of the We, 4:3 26, 999. [9] J. Srivstv, R. Cooley, M. Deshpnde, nd P. Tn. We usge mining: Disovery nd pplitions of usge ptterns from we dt. SIGKDD Explortions,, 2.


