Prsonalzd Wb arch by Usr Intrst Hrarchy Abstract Most of th wb sarch ngns ar dsgnd to srv all usrs, ndpndnt of th nds of any ndvdual usr. Prsonalzaton of wb sarch s to carry out rtrval for ach usr ncorporatng ndvdual usr s ntrsts. Currnt prsonalzaton tchnqus us th hrarchy of hyprlnks or map a usr qury to a st of catgors tc. But, thy do not us both th contnts of a wb pag and a Usr Intrst Hrarchy (UIH) mplctly. W propos a novl tchnqu to ordr rtrvd rsults dpndng on hs/hr ntrsts, whch us th contnts of a wb pag and a UIH. Th UIH s larnt mplctly from a bookmark. Exprmntal rsults dtrmnd that our approach to prsonalz wb sarch satsfs usr s rqust mor accuratly than Googl n rtrvng both ntrstng wb pags and potntally ntrstng wb pags. 1. Introducton Our contrbutons ar: Our mthods (W and W_Max) dos not nd th procss of slctng mportant/sgnfcant words/phrass; W dvlopd mthods that provds hghr prformanc n rtrvng ntrstng wb pags than Googl (Googl, 004); W dvlopd mthods that provds hghr prformanc n rtrvng potntally ntrstng wb pags than Googl; Whn w wghtd mor to C than to L, F, and, th prformanc ncrasd; Th numbr of outsd words s th bst ndcator to dntfy Indx pags;. Rlatd work Pag t al. (1998) frst proposd prsonalzd wb sarch by modfyng global PagRank algorthm wth th nput of bookmark or hompag. But thr work manly focusd on global mportanc by takng advantag of th lnk structur of th wb. And, Havlwala (Havlwala, 1999) addrssd PagRank could b computd for vry larg subgraphs of th wb on machns wth lmtd man mmory. Brn t al. (Brn t al., 1998) suggstd th da of basng th PagRank computaton for th purpos of prsonalzaton, but was nvr fully xplord. Bharat and Mhala (001) suggstd an approach calld Hlltop, that gnrats a qury-spcfc authorty scor by dtctng and ndxng pags that appar to b good xprts for crtan kywords, basd on thr outlnks. Hlltop s dsgnd to mprov rsults for popular qurs; howvr, qury trms for whch xprts wr not found wll not b handld by th Hlltop algorthm. Havlwala (00) usd prsonalzd PagRank scors to nabl topc snstv wb sarch. Thy concludd that th us of prsonalzd PagRank scors can mprov wb sarch, but th numbr of hub vctors usd was lmtd to 16 du to th computatonal rqurmnts. Jh and Wdom (003) scald th numbr of hub pags byond 16 for fnrgrand prsonalzaton. Our mthod dos not us th structur of hyprlnks. Lu t al. (00) also trd to map a usr qury to a st of catgors, whch rprsnt th usr s sarch ntnton. Ths st of catgors srvd as a contxt to dsambguat th words n th us s qury, whch s smlar to Vvsmo (Vvsmo, 004). Thy studd how to supply, for ach usr, a small st of catgors as a contxt for ach qury submttd by th usr, basd on hs/hr sarch hstory. Our approach dos not prsonalz th st of catgors, but prsonalz th whol rtrvd rsults. Anothr approach of wb prsonalzaton s to prdct forward rfrncs basd on partal knowldg about th hstory of th ssson. Zukrman t al. (1999) and Cadz t al. (000) us a Markov modl to larn and rprsnt sgnfcant dpndncs among pag rfrncs. hahab and Bana-Kashan (003) proposd a wb-usagmnng framwork usng navgaton pattrn nformaton. Thy ntroducd fatur-matrcs (FM) modl to dscovr and ntrprt usrs accss pattrns. Ths approach s dffrnt from ours snc ours do not us navgaton pattrn nformaton but th contnts of wb pags. PowrBookmarks (L t al., 1999) s a wb nformaton organzaton, sharng, and managmnt tool, whch montors and utlzs usrs accss pattrns and provd usful prsonalzd srvcs, such as automatd URL bookmarkng, documnt rfrshng, bookmark xpraton, and subscrpton srvcs for nw or updatd documnts. BookmarkOrganzr (Maark, and Bn-haul, 1996) s an automatd systm that mantans a hrarchcal organzaton of a usr s bookmarks. Ths uss th classcal HAC algorthm (Voorhs, 1986), but by applyng slcng tchnqu (slc th tr at rgular ntrvals and collaps nto on sngl lvl all lvls btwn two slcs). Both BookmarkOrganzr and PowrBookmarks rduc th ffort rqurd to mantan th bookmark, but thy ar nsnstv to th contxt browsd by usrs and dos not hav rordrng functon.
3. Expandd problms statmnt Prsonalzaton of wb sarch s to carry out rtrval for ach usr ncorporatng ndvdual usr s ntrsts. Our approach ordr rtrvd rsults dpndng on hs/hr ntrsts, whch us th contnts of a wb pag and a UIH (Usr Intrst Hrarchy). Th UIH s larnt mplctly from a bookmark by Dvsv Hrarchy Clustrng algorthm (DHC) (Km and Chan, 003). Th parnt and Chld nods n UIH ar rlatd to ach othr. Two sblngs n UIH ar rlatd to ach othr. Instad of buldng wb sarch ngn databas, th rtrvd rsults from Googl (004) wll b usd. nc th purpos of ths papr s to mak a mor accurat ordrng than publc ordrng as shown n th dashd box n Fgur 1, th us of Googl s rsult also hlps our comparson prformanc ws. Bookmark DHC arch ngn Output: prsonalzd rordrng 4. Approach In ordr to provd prsonalzd, rordrd sarch rsults to a usr, w nd to rcalculat th prsonalzd scor for ach pag. Thrfor, th goal s to assgn hghr scors to ntrstng wb pags and lowr scors to unntrstng wb pags to a usr. For a gvn wb pag, p 0, th prsonalzd pag scor (PP) quaton can b wrttn as: PP = (1-c) Publc (p 0 ) + c Prsonal (p 0 ) Eq. 1 Th rank ordr rturnd by Googl s usd as publc scor; a nw quaton for th Prsonal functon wll b xplord n ths papr. Dpndng on th valu of th constant c, th publc functon and th prsonal functon wll b wghtd dffrntly. For a smplcty rason, both functons ar wghd qually n ths papr c s 0.5. W appld dffrnt mthods to calculat th prsonalzd scor, : Unform corng modl (U), Wghtd corng modl (W), W wth Max modl, and W wth um modl. 4.1. Unform corng UIH Rtrvd rsults Th calculaton of th prsonalzd scor of a wb pag can b calculatd by addng all th scors of th trms n th wb pag. Th prsonalzd pag scor of a wb pag p j can b formulatd as: corng p j m = = 1 Eq. Prsonalzd ordrng Fgur 1. Dagram of corng Ths papr focus on dvsng a scorng mthod that rcvs two nputs (UIH and rtrvd rsults) and outputs prsonalzd ordrng. Th UIH s larnt from a bookmark. W assum lnks n a bookmark would b ntrstng to a usr. Th top 100 rtrvd lnks ar usd for our xprmnt. Our systm calculats a prsonalzd scor for ach pag and cooprats thm wth ts publc scor (Googl ordr). Th output s rordrd lnks basd on th prsonal UIH and publc ordr. Input: Usr Intrst Hrarchy (UIH); rtrvd rsults from Googl., whr m s th total numbr of dstnct phrass n a wb pag and s th scor for ach dstnct phras. In ordr to calculat th trm scor for a trm, ach mthod uss C, L, F, and G componnts. C s th lvl of clustr whr a phras blongs to, L s th lngth of a phras such as how many words ar n th phras, F s th frquncy of a phras, and G s th sgnfcanc of a phras. For xampl, ttld phrass ar consdrd as mor sgnfcant than normal phrass. A smpl prsonalzd trm scor for a phras can b formulatd as: = P C ) + P( L ) + P( F ) + P( G ) Eq. 3 ( W want to assgn hghr rlvanc towards th laf nods. Howvr, how much mor scor should b assgnd to th clustrs as th lvl gos dpr can b qustonabl. Thr hav bn rsarchs that ndcat that usr-dfnd qury scors can b usd ffctvly
(alton and Waldstn, 1978; Harpr, 1980; Croft and Das, 1989). From th acquston pont of vw, t s not clar how many lvls of mportanc can b spcfd by usrs f w ask a usr drctly. In I 3 R (Croft and Thompson, 1987), thy usd only two lvls: mportant or dfault. In Harpr s thss (1980), h usd 5 lvls of mportanc and Croft and Das (1989) usd 4 lvls. W provd qury scors dpndng on th lvl of a clustr n UIH nstad of askng a usr. Th phrass n th laf nods ar mor spcfc/mportant than th phrass n th root nods. Th prcntag of phrass n th lvl of a clustr can b usd. If th numbr of phrass n a laf clustr s th sam as th root nod, thos phrass ar not mor sgnfcant than th phrass n th root nod; howvr, f thr s a fw phrass rmand n th laf clustr, thy ar mor sgnfcant and carry mor nformaton. Th prcntag of a clustr for a phras can b calculatd by dvdng th numbr of phrass n a crtan lvl whr th phras s clustr blong to by th total numbr of phrass n a root clustr. P(C ) = # of phrass n th clustrs of a crtan lvl total # of dstnct phrass Phrass hav dffrnt lngths. Longr phrass ar mor spcfc than shortr ons. In ordr to assgn mor scor towards longr phrass, th prcntag can also b usd. Th prcntag of a crtan lngth for a phras s calculatd by dvdng th numbr of phras wth th lngth of th phras by th total numbr of dstnct phrass. P(L ) = # of phrass wth a crtan lngth total # of dstnct phrass Mor frqunt phrass ar mor sgnfcant than lss frqunt phrass. Th numbr of frquncy of ach phras n a wb pag s countd. Th prcntag s calculatd by dvdng th numbr of phrass wth a crtan frquncy that th phras has by th total numbr of dstnct phrass. P(F ) = # of phrass wth a crtan frquncy total # of dstnct phrass om phrass ar ttl, bold, or talc. W call thos formattd phrass as sgnfcant phrass. Thos phrass ar mor mportant than thos wth normal format. In gnral, sgnfcant phrass ar fwr than normal phrass. Th prcntag of sgnfcanc can b calculatd by dvdng th numbr of sgnfcant phras by th total numbr of dstnct phrass. Th smallr th componnt valu s, th mor sgnfcant. In nformaton thory, th ntropy s calculatd by multplyng th proporton of a trm blongng to class k wth th xpctd ncodng lngth masurd n bts (logarthmc part) (Mtchll, 1997). Entropy ( E) k = 1 p k log p k, whr a componnt has only two classs blong to a catgory or not. If w apply ntropy functon to Eq. 3 thn th drvd formula wll b: = P( C C )) ( P C P C Eq. 4 P( L L )) ( P L P L P( F F )) ( P F P F P G G )) P( G G )) ( = P( C C )) ( P L P L Eq. 5 P( F F )) ( P G P G P( C C )) ( P L P L P F F )) P( G G )) ( In Eq. 5, th complmnt part s mor ambguous than th proporton of an objct. For xampl whn th lngth of a trm s 3 thr can b lngth 4, 5, or 7. All othr nformaton othr than about lngth 3 wll blong to th complmnt part. W, thrfor, rmovd th complmnt parts and drvd a formula as: = P( C C )) ( P L P L Eq. 6 P F F )) P( G G )) ( Th proporton part s not sgnfcant n our quaton, bcaus all P(C), P(L), P(F), and P(G) ar ndpndnt from ach othr. W furthr drop th proporton part. Th drvd formula for a prsonalzd trm scor that mphaszs a componnt basd on th xpctd ncodng lngth masurd n bts wll b: = log C )) log L )) log F )) log G )) Eq. 7 Th tm complxty s th numbr of dstnct phrass n a wb pag bcaus all prcntags for componnts can b calculatd as prprocssng. Ths formula s calld unform scorng (U), snc C, L, F, and G ar qually wghtd, P(G ) = # of sgnfcant phrass total # of dstnct phrass
4.. Wghtd corng Th U modl uss unform wghts for ach componnt. It s possbl that som componnts ar mor mportant than th othrs. For nstanc, th lvl of clustr (C) may b mor sgnfcant than frquncy (F). Thrfor, w attmptd to dffrntat th wghts for ach componnt. Th prsonalzd pag scor for wghtd scorng mthod s formulatd by rplacng Eq. 8 n th Eq. : = w1 log C )) w log L )) w 3 log F )) w 4 log )) Eq. 8 nc th prformanc may dpnd on th wghts, t s mportant how w assgn th wghts. A smpl hurstc s usd, that s w assum th clustr lvl s at last two tms mor mportant than othr componnts. Basd on ths hurstc, w 1 =0.4, w =0., w 3 =0., w 4 =0. ar assgnd. 4.3. Wghtd corng wth MAX a,b,c,d, 4 5 a 1 a,b,c,d,,f,g,h b,c,d Phras cors Phras cors a 50 4 b 115 f 80 c 40 g 110 d 30 h 66 Clustr Id MAX um 1 115 515 115 56 3 110 190 4 50 50 5 115 185 6 110 110 Fgur. A UIH wth scors 3 6 f,g g Prvous two modls (U and W) ar dsgnd undr an assumpton that all branchs ar wll balancd. Howvr, n most cass th branchs may b not balancd. In that cas, w prfr to assgn hghr mportanc towards thck branchs, bcaus thos branchs may b mor mportant than othr thn branchs. Frst th maxmum scor of phrass n ach clustr s found, and thn all parnt clustrs maxmum scors ar addd to th scor of a phras. Th prsonalzd pag scor of W wth MAX can b calculatd as: 4.4. Wghtd corng wth UM W prsnt ths mthod just for a comparson wth othr mthods. Th abov W wth MAX mthod may allow nough mphass towards thck branchs. Th followng approach assgns mor scor towards thck branchs by usng th sum of scors n a clustr than W wth MAX mthod dos. In othr words, ths approach uss th sum n a clustr n th plac of th Max n a clustr. Th prsonalzd pag scor s calculatd as: p j = T m + = 1 max Eq. 9 p j = T sum m + = 1 Eq. 10, whr m s th total numbr of dstnct phrass n a wb pag. For xampl, th scor of a phras d wll b T max +Up j ={(115+115) + 30}. Th frst 115 s from th clustr 1 n Fgur, th scond 115 s from th clustr, and th 30 s from th scor of tslf, d. For anothr xampl, th scor of a phras g wll b (115+110) + 110. Th frst 115 s from clustr 1; scond 110 s from clustr 3; th last 110 s from tslf, g. Tm complxty rmans th sam as W, snc th maxmum scor n a clustr can b calculatd n prprocss. For nstanc, th scor of a phras d wll b T sum +Up j ={(515+56) + 30}. Th frst 515 s from th clustr 1 n Fgur, th scond 56 s from th clustr, and th 30 s from th scor of tslf, d. For anothr xampl, th scor of a phras g wll b (515+190) + 110. Th frst 115 s from clustr 1; scond 190 s from clustr 3; th last 110 s from tslf, g. Tm complxty s th sam as W wth Max, snc th complxty for th calculaton of th sum s th sam as th complxty for fndng th Max.
4.6. Masurng Indx Pags W obsrvd that usrs do not tnd to masur ndx pags as Good. It s bcaus manly ndx pags usually contan long lst of hyprlnks wth fw dscrptons. It s hard to dfn an ndx pag bcaus most of wb pags provd ndx asd provdng som dscrptons as wll. Th dcson s also rathr objctv. W roughly dfn an ndx pag as th wb pag that ams to provd lnks nstad of dscrpton. Thr wll b many ways to dntfy ndx pags. W smplfy our approach n ths papr by usng hyprlnk nformaton only. Thr ar 3 ways of usng hyprlnk nformaton: th hyprlnk structur (Jh and Wdom, 003), numbr of hyprlnks n a wb pag (Pag t al., 1998), and th words nsd/outsd hyprlnks. In ths papr, th numbr of hyprlnks and th words nsd th hyprlnks ar usd. Whn countng th numbr of hyprlnks, hyprlnks can b classfd nto two groups: lnkng to a dscrptv pag and lnkng to a r-lnkabl pag. A dscrptv pag s lk a txt fl or a pdf fl, whch usually dos not contan hyprlnks to anothr wb pags. A lst of xtnsons for dscrptv pags s:, /,./, malto:, javascrpt:, starts wth # (anchor),.pdf,.txt,.ppt,.java,.zp,.gz,.rdf,.jpg,.bmp,.gf,.jpg,.wmv,.x,.mp,.doc,.rtf,.dll,.h,.cpp,.xls,.mdb,.dv,.tx..v,.lbgcj And w also manually chckd hyprlnks and dd not count thos fls nthr:.am,.n,.mdf A r-lnkabl pag s lk an html pag, whch usually contans lnks to anothr wb pags. A lst of xtnsons for r-lnkabl pags s:.htm,.html,.shtml,.jhtml,.phtml,.asp,.gsp,.aspx,.asx,.php,.php3,.cfm, tc. nc an ndx pag gnrally contans many othr rlnkabl pags, w dcd not to count th hyprlnks connctd to dscrptv pags. In th followng, th numbr of hyprlnks mans th numbr of r-lnkabl hyprlnks n a wb pag. Th words nsd hyprlnks mans th words undrlnd for th notfcaton of lnks; outsd words mans vs vrsa. Th words nsd/outsd hyprlnks can b collctd by buldng a parsr program. Two modls ar dvlopd for scorng ndx pags: numbr of outsd words and numbr of outsd words dvdd by numbr of hyprlnks. W assum th mor th outsd word numbr s n a wb pag th lss a wb pag s lkly to b an ndx pag. Th formula s th numbr of outsd words dvdd by th numbr of hyprlnks. Bcaus th frst modl can not dstngush th dffrnc btwn two wb pags whn both wb pag has th sam numbr of outsd words, but on has mor numbr of hyprlnks. Th fnal formula for th pag scor adjustd wth th ndx pag masur (IPM) s: IPM + p 5. Exprmnt = Eq. 11 j p j In our xprmnt, twnty-two data sts wr collctd from 11 dffrnt usrs. W askd ach voluntr to submt sarch trms that can contan any Boolan oprators (.g., rvw forum +"scratch rmovr"). All sarch trms usd ar lstd n Appndx 1. Of th 11 human subjcts, 4 wr undrgraduat studnts and 7 wr graduat studnts. In trms of major, 7 wr Computr cncs, wr Aronautcal cncs, 1 was Chmcal Engnrng, and 1 was Marn Bology. Thn, w usd Googl to rtrv rlatd 100 lnks for ach sarch trm. Thos collctd lnks wr classfd/scord by usr basd on two catgors: ntrst and potntal ntrst. Th dfnton of ntrst was whthr a usr found what thy want; th dfnton of potntal ntrst was whthr a wb pag wll b ntrstng to a usr n th futur. Th data st of ntrst wll hav mor authorty bcaus t mts th purpos of our rsarch drctly. Th data st of potntal ntrst wll b mor flxbl n trms of rflctng a usr s prsonal ntrsts. It wll also partally dscrb whthr a usr gts what thy want. W catgorzd th ntrst as Good, Far, and Poor ; th potntal ntrst s catgorzd as Ys and No. A wb pag s scord as Good, Far, and Poor dpndng on ach ndvdual s subjctv opnon basd on th dfnton of ntrst. It s also markd as Ys or No basd on th usr s potntal ntrst. Thos wb pags that contan non-txt wr xcludd bcaus w wr handlng only txts. Furthrmor, thos brokn lnks that wr stll rankd hgh rronously by Googl wr xcludd from th tst, bcaus thos wb pag wll b scord Poor by usr. Our mthod also assgns low scors to thos sts snc thr ar not much data on th wb pag. That way w attmptd to rmov ngatv bass ovr th rrors n Googl, whch rsults n prvntng from lowrng th prformanc of Googl.. W also rqustd ach voluntr to submt th lnks n thr bookmarks. If thr ar lss thn 50 lnks n thr bookmark, w askd thm for collctng mor lnks up to
around 50. Th mnmum numbr of lnks s 38 and th maxmum numbr s 7. Th data usd n ths study s accssbl at http://cs.ft.du/~hkm/dssrtaton/dssrtaton.htm. Wb pags from both bookmarks and Googl wr parsd to rtrv only txts wthout grammar. Much tm had bn spnt makng a parsr rslnt on th many problms that rsd n th ntwork or wb artcls. A larg fracton of wb pags had ncorrct HTML, makng us dffcult to buld a parsr. All scrpt languags wr rmovd bcaus thy tndd to contan functons rathr txts. All words n combo boxs wr also rmovd bcaus thy dd not appar on th scrn untl a usr clcks t. All unmportant contxts such as commnts, styl, tc. wr also rmovd. Mcrosoft.NET languag was usd, and th program ran on Intl Pntum 4 CPU. nc Vctor pac Mod (VM) has bn wdly usd n th tradtonal Informaton Rtrval (IR) flds (Gravano t al., 1999; Grossman t al., 1997), t wll b usful to mplmnt ths mthod and compar th rsult wth our mthods. VM calculats th smlarty btwn two vctors. In our cas, on s from a wb pag (V) and th othr s from a usr s UIH (H) - V=<v 1,v,v 3,,v m > and H=<h 1,h,h 3,,h m >, whr ach lmnt s a scor corrspondng to a dstnct phras n a wb pag. p j = sm( H, V ) = m j= 1 h v j m m ( h j= 1 j ) j= 1, whr ach scor h and v s calculats as: h = ln( C) ln( L) + TFIDF v = ln( F) ln( ) ln( L) j ( v ) Th C, L, and TFIDF (trm frquncy nvrs documnt frquncy) for th vctor H wr drvd from UIH n othr words from bookmarks. Th F,, and L for th vctor V wr drvd from a wb pag to b scord, n othr words from a wb pag to b scord. W valuatd th ordrng algorthms basd on how many Good or potntally ntrstng lnks ach mthod collctd wthn crtan numbr of top lnks th accuracy n crtan numbr of top lnks (Bharat and Mhala, 001). It s ralstc n a sns many nformaton rtrval systms ar ntrstd n th top 10 or 0 groups. Prcson/rcall graph (van Rjsbrgn, 1979) s usd for valuaton also. It s on of th most common masurng mthods n nformaton rtrval. Tradtonal prcson/rcall graphs ar vry snstv to th ntal rank postons and valuat ntr rankngs (Croft and Das, 1989). Whn th st s th top numbr of wb pags, th formula for prcson and rcall ar: j Rcall = Numbr of Good pags rtrvd n th st / Numbr of Good pags n th st 6. Analyss Our mthods wr analyzd wth two data sts: th st of pags chosn as ntrstng wb pag and th st of pags chosn as potntally ntrstng wb pags by usr. W mor focusd on th st of pags chosn as ntrstng wb pag, bcaus t dscrbs th prformanc of our mthods mor drctly. W prsntd and analyzd th rsult wth potntally ntrstng wb pags also, bcaus t dscrbs usrs satsfacton to th rtrvd rsults as wll. 6.1. Intrstng Wb Pag 6.1.1. Avrag and tandard Dvaton. Th avrag and standard dvaton (D) of prcson for ach mthod wr xamnd frst, bcaus w suspctd th rsult of top 1 (ntal rank postons) would b lss rlabl than top 10 as mntond n (Croft and Das, 1989). For nstanc, th rsult of top 1 wll hav hghr D than th rsult of top 10. Avrag and standard dvaton of sarch trms wr summarzd wth rspct to th top numbrs. Th dtal valus of avrag and D for ach top numbr wr shown n Tabl 6 and Tabl 7 for Googl and W_Max rspctvly. Each column showd Avrag and D corrspondng to th Top numbrs. W_Max s on of th mthods that rturnd th hghst rsults. Th Googl had 0.36 accuracy on avrag and 0.49 of D for Top 1 and 0.8 accuracy on avrag and 0.0 of D for Top 10 as shown n Tabl 6. In ordr to mprov th radablty, th rsults wr dpctd n graphs. Fgur 3 dpctd th rsults from Googl and Fgur 4 dpctd th rsults from W_Max rspctvly. Th x-axs showd th Top numbrs and y-axs rprsntd th avrag prcson of sarch trms. Th rsults showd that th D was hgh n gnral. But th ponts from th top 1 through 6 gnrally had hghr D than th rst ponts. Both Googl and W_Max showd smlar phnomnon. Ths rsults conclud that th ntal rank postons of th Prcson/rcall graph ar lss rlabl. Prcson = Numbr of Good pags rtrvd n th st / sz of th st
much mpact to th scor of a wb pag, bcaus th sum of scors of phrass n clustrs s much bggr than th scor of a phras. W_Max prvntd ths problm by usng th maxmum valus nstad of th sum. It mad som mnor mprovmnt by makng th bst rsults n 3 columns and hghr than Googl n 4 columns. Ths rsults ndcatd our mthods achvd hghr prformanc than Googl. Fgur 3. Googl Tabl 1. Prcson n Top 1, 5, 6, 10, and 0 on Intrstng Wb Pag Top 1 Top 5 Top 6 Top 10 Top 0 Googl 0.36 0.34 0.333 0.77 0.70 VM 0.18 0.5 0.65 0.73 0.43 Random 0.14 0.5 0. 0.05 0.09 U 0.3 0.31 0.311 0.33 0.305 W 0.36 0.34 0.356 0.314 0.309 W_um 0.36 0.34 0.341 0.33 0.300 W_Max 0.41 0.33 0.341 0.33 0.311 6.1.. Top Lnks Analyss. Fgur 4. W_Max Each mthod was valuatd basd upon how many wb pags mat th usr s sarch purpos. Th prcson was th accuracy of how many wb pags wr ntrstng to usrs n th top numbr of wb pags rturnd by ach mthod. Th valu was th numbr of Good pags rtrvd n th st dvdd by th numbr of th top pags. nc usrs ar usually ntrstd n th lnks n top 10 or 0 (Chn and ycara, 1998), w stoppd prsntng furthr rsults. Th frst column n Tabl 1 s th mthods, th scond column s th prcson n th top 1, th thrd column s th prcson n th top 5, tc. Th valus n ach cll ar th avrag of sarch cass from 11 voluntrs. Th top valus n ach column wr formattd as bold. Th rsults showd that Random was th lowst as w xpctd. VM was lowr than Googl. All our mthods proposd wr mor accurat than Googl. W that had usr assgnd wghts was mor accurat than U that had unform wghts, bcaus n all columns W showd hghr prformanc. W_um adaptd th structur of UIH, but mad lowr prformanc n top 6 and 0 than W. W assum that th scor of a phras could not mak W conductd t-tst btwn Googl and W for Top 10 and Top 0. Thos ponts wr mor rlabl than othr prvous ponts. nc usrs tnds to tak a mor clos look at th lnks n Top 10 than Top 0, th rsult for Top 10 s mor mportant than th on for Top 0. Thr was no statstcally sgnfcant dffrnc btwn Googl and W mthods nthr for Top 10 wth 95% confdnc (P=0.59 and t=.0) nor for Top 0 wth 95% confdnc (P=0.51 and t=.0). Th t-tst btwn Googl and W_Max for Top 10 and for Top 0 wr also conductd. Thr was no statstcally sgnfcant dffrnc btwn Googl and W_Max mthods nthr for Top 10 wth 95% confdnc (P=0.41 and t=.0) nor for Top 0 wth 95% confdnc (P=0.49 and t=.0). Th Tabl showd th summary. nc th avrag accuracy of Googl or othr sarch ngns s around 55% (Dlany, 004), t was hard for th rsults to show any statstcally sgnfcant dffrncs. W, thrfor, clam that our mthods had hghr prformanc than Googl on avrag, vn though thr was no statstcal sgnfcanc. Tabl. tatstcal sgnfcanc btwn our mthods and Googl Mthod Top 10 Top 0 P=0.59 P=0.51 Googl vs. W t=.0 t=.0 Googl vs. W_MAX P=0.41 t=.0 P=0.49 t=.0
6.1.3. Prcson/Rcall Analyss. Prcson/rcall graph n Fgur * hlps vsualzaton of th rsults shown n Tabl 1. Th x-axs was rcall and y- axs was prcson. Th closst ln to th rght-uppr cornr showd th bst prformanc. Not that th bgnnng porton s lss rlabl, snc a prcson/rcall graph had hgh standard dvaton at th ntal ponts. Ths graph clarly showd us that all our mthods mad hghr prformanc than Googl xcpt th bgnnng part (0-0.1 n th x-axs). Both W and W_Max mad closr lns towards uppr-rght-cornr than U and W_um n gnral. Whn t-tst (pard two sampl for mans) was conductd btwn Googl and W n ordr to masur th dffrnc btwn two lns, thr was no statstcally sgnfcant dffrnc wth 95% confdnc (P=0.35 and t=.160). Googl and W_Max dd not show statstcally sgnfcant dffrnc wth 95% confdnc (P=0.1 and t=.08). Th usd data ponts wr {Top 1,, 3, 4, 5, 6, 7, 8, 9, 10, 1, 15, 18, and 0}. It was bcaus Googl showd hghr prformanc than othr mthods n th ntal ponts (Top, 4, and 5). Howvr, both mthods showd statstcally sgnfcant dffrnc aftr Top 5 wth 95% confdnc (P=0.00 and t=.31 for Googl and W; P=0.00 and t=.31 for Googl and W_Max). Th usd data ponts for scond statstcal tst wr {Top 6, 7, 8, 9, 10, 1, 15, 18, and 0}. It was hard to compar W and W_Max. W prfrrd W_Max bcaus t showd mor compttv prformanc at top 1, vn though th bgnnng part s lss rlabl. Prcson 0.5 0.45 0.4 0.35 0.3 0.5 0. 0.15 0.1 Googl VM Random U W W_um W_Max 0 0. 0.4 0.6 0.8 1 Rcall Fgur 5. Prcson/rcall graph for ntrstng wb pags 6.. Potntally Intrstng Wb Pag 6..1. Top Lnks Analyss Potntally ntrstng wb pags ar thos wb pags that wll b ntrstng to a usr n th futur. Th ara of usr s potntal ntrst oftn gos byond th boundary of th sarch trm s spcfc manng. omtms ths unxpctd rsults satsfy usrs. Thrfor, t s also a contrbuton f our mthod shos hghr prformanc n fndng potntally ntrstng wb pags. Our rsults n Tabl 3 showd how many potntally ntrstng wb pags ach mthod found. Th way of radng ths tabl s th sam as th way of radng th tabl for ntrstng wb pag. Random mthod showd th lowst prformanc. Both Googl and VM showd lowr prformancs than our mthods by havng lowr avrag accuracy n all 5 columns. Ths rsults wr, w prdctd, bcaus th UIH that was mad from bookmark supportd th usr s potntal ntrst; howvr, Googl that usd th global ntrst was dffcult to prdct ndvdual usr s broad potntal ntrsts. W showd hghr prformanc than W_um and W_Max by showng top prformancs n 4 columns. Tabl 3. Prcson n Top 1, 5, 6, 10, and 0 on Potntally Intrstng Wb Pag Top 1 Top 5 Top 6 Top 10 Top 0 Googl 0.59 0.53 0.515 0.514 0.475 VM 0.45 0.48 0.500 0.491 0.439 Random 0.36 0.39 0.371 0.350 0.364 U 0.59 0.58 0.561 0.536 0.493 W 0.64 0.6 0.614 0.541 0.498 W_um 0.64 0.59 0.591 0.545 0.495 W_Max 0.64 0.59 0.606 0.545 0.489 W conductd t-tst (pard two sampl for mans) btwn Googl and W for Top 10 and Top 0. Thr was no statstcally sgnfcant dffrnc btwn th two mthods nthr for Top 10 wth 95% confdnc (P=0.78 and t=.0) nor for Top 0 wth 95% confdnc (P=0.80 and t=.0). W also conductd t-tst btwn Googl and W_Max for Top 10 and Top 0. Thr was no statstcally sgnfcant dffrnc btwn th two mthods nthr for Top 10 wth 95% confdnc (P=0.74 and t=.0) nor for Top 0 wth 95% confdnc (P=0.88 and t=.0). Evn though thr was no statstcal sgnfcanc, w clam that our mthods had hghr prformanc than Googl on avrag for th sam rason as xpland for ntrstng wb pags.
Tabl 4. tatstcal sgnfcanc btwn our mthods and Googl Mthod Top 10 Top 0 P=0.78 P=0.80 Googl vs. W t=.0 t=.0 Googl vs. W_MAX 6... Prcson/Rcall Analyss P=0.74 t=.0 P=0.88 t=.0 Prcson/rcall graph n Fgur * that was drvd from Tabl 3 also showd smlar rsults. Random mad th lowst prformanc; both Googl and VM showd lowr prformanc than ours. Our mthods postond closr to th uppr-rght cornr that th Googl most of th tm, whch mans our mthods had hghr prformanc. Whn t-tst (pard two sampl for mans) was conductd btwn Googl and W, thr was a clar statstcally sgnfcant dffrnc btwn th two mthods wth 95% confdnc (P=0.00 and t=.16). Googl and W_Max also showd statstcally sgnfcant dffrnc wth 95% (P=0.00 and t=.16). Th usd 14 data ponts wr th sam as th st for ntrstng wb pags. W showd hghr prformanc than U. But othr mthods ar dffcult to compar. Ths rsult also dtrmnd that our mthods prformd bttr than Googl. Prcson 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 Googl VM Random U W W_um W_Max 0.3 0 0. 0.4 0.6 0.8 1 Rcall Fgur 6. Prcson/rcall graph for potntally ntrstng wb pags 6.3. Indcator for Indx Pags In ordr to fnd ndx pags w wantd to know whch on s a good ndcator. W usd hyprlnk nformaton. Most of th possbl ndcators ar compard: Pagz, HREFnum, InsdWordNum, OutsdWordNum, OutsdWordNum/Pagz, InsdWordNum/Pagz, OutsdWordNum/HREFnum, InsdWordNum/HREFnum, Pagz/HREFnum, OutsdWordNum/InsdWordNum, and HREFnum/Pagz. In th Tabl 5, th frst column shows dffrnt ndcators, and th scond column s corrlaton valus btwn th valus rturnd by ach mthod and th scor of ntrst to a wb pag. All corrlaton valus wr clos to 0. Howvr, w blv th small dffrnc tlls us at last whch mthod s mor usful than th othr mthods. It s ntrnscally vry hard to fnd any strong corrlaton valus among thm. OutsdWordNum followd by InsdWordNum/HREFnum was th bst ndcator wth 0.0606 and -0.0534 of corrlaton valus rspctvly. As long as an ndcator can xplan strongr corrlaton, t dos not mattr whthr t s postv or ngatv corrlaton. Whn w addd ths ndx pag masur (IPM) by ths ndcator as shown n Eq. 11, th prformancs of mthods dd not mprov much. o, w put ths problm of dvsng a scorng functon combnd wth IPM as futur work. 7. Concluson Th purpos of ths rsarch s to dvs a nw mthod of ordrng rtrvd wb sarch rsults n a bst way to ach ndvdual usng UIH. W valuat mthods basd on how many ntrstng wb pags or potntally ntrstng wb pags ach algorthm collcts wthn crtan numbr of top lnks (Bharat and Mhala, 001). Tradtonal Prcson/rcall graphs (van Rjsbrgn, 1979) ar also usd for valuaton. W compard 7 mthods: Googl, Random, U, W, W_um, W_Max, and VM. Googl s th most popular sarch ngn and poss th bst ordrng rsults currntly. Random mthod was chosn to s th mprovd prformanc of Googl and our nw mthods compard to th random cas. VM was chosn bcaus t s on of th most popular nformaton rtrval mthods. W usd th sam words/phrass for our mthods and VM. W usd two data sts: ntrstng wb pags and potntally ntrstng wb pags. Th rsult showd that our 4 mthods mad hghr prformanc than Googl, VM, and Random n both data sts. W could obsrv that both W and W_Max wr mor accurat than U n both data sts and n both masurs: Top tabl and
prcson/rcall graph. W_Max mad th hghst prcson valus of 0.41, 0.33, and 0.31 n Top 1, 10, and 0 rspctvly n ntrstng data st. It also drw th closst ln towards uppr-rght cornr n th prcson/rcall graph n ntrstng wb pags data sts. Howvr, W mad th hghst prcson valus of 0.64, 0.6, 0.614, and 0.498 n Top 1, 5, 6, and 0 n th top tabl for th potntally ntrstng wb pags data st; but t drw smlar lns to W_Max n a prcson/rcall graph. W choos W as th most prfrabl mthod, bcaus t was smplr than W_Max. Ths rsults concludd that our mthods provdd mor accurat ordr than Googl n both data sts. W trd to fnd th bst ndcator for ndx pags and ncorporat t wth our scorng mthods, bcaus usr dos not tnd to prfr ndx pags. Our approach was to us hyprlnk nformaton. Our rsults showd us that th numbr of outsd words was th bst ndcator to dntfy ndx pags out of many othr ndcators t had th hghst corrlaton valu of 0.0606. Evn though th corrlaton valu was clos to 0, t was th hghr than th othr corrlaton valus by th othr ndcators. As mntond abov, w wr usng th whol words xcpt stop words. Ths approach rmovd th procss of slctng mportant/sgnfcant words/phrass unlk othr Informaton rtrval tchnqus (Pazzan t al., 1997). Du to ths advantag w can handl smallr data st and rduc th dangr of lmnatng nw mportant words/phrass. W wll apply ndx pag nformaton to our scorng mthods for mprovng th accuracy of ordrng n th futur. Masurng th prformanc wth th clustrd sarch rsults such as Vvsmo may b dffrnt from Googl s. In a clustrd sarch ngn, a lnk that dos not blong to th top 10 n whol can bcom th top n som clustrs. nc our mthod showd hghr prformanc spcally aftr th top 5, w assum that our mthod may gt hghr prformanc also n th clustrd sarch ngns. W can also xtnd UIH to mor ntractv wb sarch ngn, bcaus UIH has gnral to spcfc ntrsts. For xampl whn th usr qury rsds n an ntrmdat clustr, w can ask a usr to choos mor spcfc ntrsts provdng th phrass n th chld clustrs, or shft to anothr branch n a UIH. 8. Acknowldgmnt W apprcat all voluntrs who partcpatd n our xprmnt: Akk, Mchl, Tmmy, Matt crptr, Ayanna, Da-h, Ja-gon, J-hoon, Jun-on, Chrs Tannr, and Grant Bms. 9. Rfrncs 1. (Bharat and Mhala, 001) K. Bharat and G.A. Mhala (001), Whn dxprts agr: usng nonafflatd xprts to rank popular topcs. In Procdngs of th 10th Intl. World Wd Wb Confrnc.. (Brn t al., 1998). Brn, R. Motwan, L. Pag, and T. Wnograd (1998), What can you do wth a wb n your pockt. In Bulltn of th IEEE Computr octy Tchncal Commtt on Data Engnrng. 3. (Cadz t al., 000) I. Cadz, D. Hckrman, C. Mk, P. myth,. Wht (000), Vsualzaton of Navgaton Pattrns on Wb t Usng Modl Basd Clustrng. Tchncal Rport MR-TR-00-18, Mcrosoft Rsarch, Mcrosoft Corporaton, Rdmond, WA. 4. (Chn and ycara, 1998) L. Chn and K. ycara (1998), WbMat: A prsonal agnt for browsng and sarchng, In Proc. of th nd Intl. conf. on Autonomous Agnts, pp.13-139. 5. (Croft and Das, 1989) W.B. Croft and R. Das (1989), Exprmnts wth qury acquston and us n documnt rtrval systms, In Proc. of 13th ACM IGIR. 6. (Croft and Thompson, 1987) Croft, W.B. and Thompson, R.T. (1987), I3R: A nw approach to th dsgn of documnt rtrval systms, Journal of th Amrcal octy for Informaton cnc, 38: 389-404. 7. (Dlany, 004) K.J. Dlany (004), tudy Qustons Whthr Googl Rally Is Bttr, Wall trt Journal. (Eastrn dton). Nw York, May 5, 004. pp. B.1 http://proqust.um.com/pqdwb?rqt=309&vinst=prod&vnam =PQD&VTyp=PQD&sd=5&ndx=45&rchMod=1&Fmt=3&d d=000000641646571&clntid=15106 8. (Googl, 004) Googl co. (004) http://www.googl.com/ 9. (Gravano t al., 1999) L. Gravano, H. Garca-Molna, and A. Tomasc (1999), Gloss: Txt-sourc Dscovry ovr th Intrnt. ACM Transactons on Databas ystms, 4():9-64, Jun. 10. (Grossman t al., 1997) D. Grossman, O. Frdr, D. Holms, and D. Robrts (1997), Intgratng tructurd Data and Txt: A Rlatonal Approach. Journal of th Amrcan octy for Informaton cnc, 48(), Fbruary. 11. (Harpr, 1980) Harpr, D.J. (1980), Rlvanc fdback n documnt rtrval systms: An valuaton of probablstc stratgs. Ph.D. Thss, Computr Laboratory, Unvrsty of Cambrdg. 1. (Havlwala, 1999) T.H. Havlwala (1999), Effcnt computaton of PagRank, Tchncal Rport, tanford
Unvrsty Databas Group. http://dbpubs.stanford.du/pub/1999-31 13. (Havlwala, 00) T.H. Havlwala (00), Topcsnstv PagRank, In Proc. of th 11th Intl. World Wd Wb Confrnc, Honolulu, Hawa, May 00. 14. (Jh and Wdom, 003) G. Jh and J. Wdom (003), calng prsonalzd wb sarch, In Proc. of th 1th Intl. Confrnc on World Wd Wb, Budapst, Hungary, pp. 0-4, May. 15. (Km and Chan, 003) H. Km and P.K. Chan (003), Larnng mplct usr ntrst hrarchy for contxt n prsonalzaton. Intrnatonal Confrnc on Intllgnt Usr Intrfacs, pp.101-108. 16. (L t al., 1999) W.. L, Q. Vu, D. Agrawal, Y. Hara, and H. Takano (1999), PowrBookmarks: A ystm for prsonalzabl wb nformaton organzaton, harng, and Managmnt, In Proc. of th 8 th Intl. World Wd Wb Confrnc, Toronto, Canada. 17. (Lu t al., 00) F. Lu, C. Yu, and W. Mng (00), Prsonalzd Wb arch by Mappng Usr Qurs to Catgors, CIKM 0, ACM Prss, Vrgna, UA. 18. (Maark, and Bn-haul, 1996) Y.. Maark, and I.Z. Bn-haul (1996), Automatcally organzng bookmarks pr contnts. In Procdngs of th Ffth Intrnatonal World Wd Wb Confrnc. 19. (Mtchll, 1997) T.M. Mtchll (1997), Machn Larnng, Nw York: McGraw Hll. 0. (Pag t al., 1998) L. Pag,. Brn, R. Motwan, and T. Wnograd, (1998) Th PagRank ctaton rankng: Brngng ordr to th wb, Tchncal Rport, tanford Unvrsty Databas Group, 1998. http://ctsr.nj.nc.com/368196.html 1. (Pazzan t al., 1997) Pazzan, M., and Bllsus, D. (1997) Larnng and Rvsng Usr Profls: Th Idntfcaton of Intrstng Wb ts, Machn Larnng, 7(3), 313-331, 1997.. (alton and Waldstn, 1978) alton, G. and Waldstn, R.G. (1978), Trm rlvanc wghts n on-ln nformaton rtrval, Informaton Procssng and Managmnt, 14, 9-35. 3. (hahab and Bana-Kashan, 003) C. hahab and F. Bana-Kashan (003), Effcnt and Anonymous Wb-Usag Mnng for Wb Prsonalzaton, INFORM Journal on Computng-pcal Issu on Data Mnng, 15 (), prng. 4. (van Rjsbrgn, 1979) C.J. van Rjsbrgn. (1979) Informaton Rtrval, Buttrworths, London, 68 176. 5. (Vvsmo, 004) Vvsmo co. (004) http://www.vvsmo.com 6. (Voorhs, 1986) E.M. Voorhs, (1986) Inplmntng agglomratv hrarchc clustrng algorthms for us n documnt rtrval, Informaton Procssng & Managmnt, (6) 465-476. 7. (Zukrman t al., 1999) I. Zukrman, D.W. Albrcht, A.E. Ncholson (1999), Prdctng Usrs Rqusts on th WWW. Proc. of th 7 th Intl. Confrnc on Usr Modlng (UM), Banff, Canada. 75-84. Tabl 5. Corrlatons btwn MtarchPurpos vs. othr mthods Mthods wmtarchpurpos vpagz 0.047 vhrefnum -0.0104 vinsdwordnum -0.090 voutsdwordnum 0.0606 voutsdwordnumovrpagz -0.0051 vinsdwordnumovrpagz -0.0340 voutsdwordnumovrhrefnum 0.007 vinsdwordnumovrhrefnum -0.0534 vpagzovrhrefnum -0.0056 voutsdwordnumovrinsdwordnum -0.0148 vhrefnumovrpagz -0.031
Tabl 6. Googl Top # 1 3 4 5 6 7 8 9 10 1 15 18 0 30 50 100 Avg. 0.36 0.45 0.38 0.36 0.34 0.33 0.31 0.30 0.8 0.8 0.8 0.8 0.8 0.7 0.6 0.4 0.0 D. 0.49 0.41 0.33 0.8 0.6 0.8 0.3 0.1 0.1 0.0 0.0 0.0 0.19 0.19 0.18 0.17 0.14 Tabl 7. W_Max Top # 1 3 4 5 6 7 8 9 10 1 15 18 0 30 50 100 Avg. 0.41 0.39 0.39 0.33 0.33 0.34 0.33 0.3 0.34 0.33 0.34 0.3 0.3 0.31 0.30 0.4 0.0 D. 0.50 0.41 0.35 0.9 0.6 0.5 0.4 0.3 0.5 0.4 0.3 0. 0.0 0.0 0.1 0.17 0.14 Appndx 1. arch trms usd Id arch Trms 1 arospac aronautcal 3 Carbban Hstory 4 Fr cross-sttch scnc pattrns 5 boston pcs 6 complx varabls 7 DMC(dgtal mda cntr) 8 XML Rpostory 9 bos opratng systm 10 artfcal ntllgnc 11 ddr mmory 1 cpu bnchmark 13 snpr rfl 14 mltary wapons 15 rvw forum +"scratch rmovr" 16 wndows xp +thm +skn 17 xtrm programmng prncpls 18 java dsgn pattrs 19 woodworkng tutoral 0 nural ntworks tutoral 1 Australa advntur tours Australa cology