Etropy-Based Lk Aalyss for Mg Web Iformatve Structures Hug-Yu Kao, Sha-Hua L *, Ja-Mg Ho *, Mg-Sya Che Electrcal Egeerg Departmet Natoal Tawa Uversty Tape, Tawa, ROC E-Mal: {bobby@arbor.ee.tu.edu.tw, msche@cc.ee.tu.edu.tw} Isttute of Iformato Scece * Academa Sca Tape, Tawa, ROC E-Mal: {shl, hoho}@s.sca.edu.tw ABSTRACT I ths paper, we study the problem of mg the formatve structure of a ews Web ste whch cossts of thousads of hyperlked documets. We defe the formatve structure of a ews Web ste as a set of dex pages (or referred to as TOC,.e., table of cotets, pages) ad a set of artcle pages lked by TOC pages through formatve lks. It s oted that the Hyperlk Iduced Topcs Search () algorthm has bee employed to provde a soluto to aalyzg authortes ad hubs of pages. However, most of the cotet stes ted to cota some extra hyperlks, such as avgato paels, advertsemets ad baers, so as to crease the add-o values of ther Web pages. Therefore, due to the structure duced by these extra hyperlks, s foud to be suffcet to provde a good precso solvg the problem. To remedy ths, we develop a algorthm to utlze etropy-based lk aalyss to me Web formatve structures. Ths algorthm s referred to as LAMIS, stadg for etropy-based Lk Aalyss o Mg web Iformatve Structures. The key dea of LAMIS s to utlze formato etropy for represetg the kowledge that correspods to the amout of formato a lk or a page the lk aalyss. Expermets o several real ews Web stes show that the precso ad recall of LAMIS s much superor to those obtaed by heurstc methods ad also that the lk aalyss techques derved are very powerful to mg the formatve structures of ews Web stes. I average, the augmeted LAMIS leads to promet performace mprovemet ad creases the precso by a factor ragg from 33% to 232% whe the desred recall falls betwee 0. ad. Keywords Iformatve structure, lk aalyss, hubs ad authortes, achor text, etropy, formato extracto.. Itroducto Recetly, there has bee explosve progress the developmet of the World Wde Web. Ths progress creates umerous ad varous formato cotets publshed as HTML pages o the Iteret. Furthermore, for the purpose of mateace ad scalablty of Web stes, most of the Web cotets are mgratg from statc Permsso to make dgtal or hard copes of all or part of ths work for persoal or classroom use s grated wthout fee provded that copes are ot made or dstrbuted for proft or commercal advatage ad that copes bear ths otce ad the full ctato o the frst page. To copy otherwse, or republsh, to post o servers or to redstrbute to lsts, requres pror specfc permsso ad/or a fee. CIKM'02, November 4-9, 2002, McLea, Vrga, USA. Copyrght 2002 ACM -83-492-4/02/00 $.00. pages ad geeral text fles to dyamc forms whch are geerated from predefed templates ad dverse cotet retreved from back-ed databases. I geeral, pages most commercal Web stes, e.g., search eges, e-commerce stores, ews, etc, are so dyamcally geerated. Such Web stes are called systematc Web stes ths paper. A ews Web ste that geerates pages wth daly hot ews ad archves hstorc ews s a typcal example of a systematc Web ste. Due to the evoluto of automatc geerato of Web pages, the umber of Web pages grows explosvely [6] [2]. However, there s a lot of redudat ad rrelevat formato Web stes [], especally automatcally geerated pages of systematc Web stes. Examples of redudat ad rrelevat formato clude advertsemet baers, browsg meus, catalogs of servces, aoucemets of copyrght ad prvacy polcy, ad those cotets tagged wth hyperlks for easy access to related formato. I systematc Web stes, e.g. ews Web stes, wth the use of redudat ad rrelevat lks, t s coveet ad easy for users to browse ad extract formatve parts usg fewer clcks or shortcuts ay page. However, these redudat lks crease the dffculty for search eges or text mers to extract useful formato exactly sce they are usually desged to dex or process everythg cludg redudat ad rrelevat formato. I geeral, these redudat ad rrelevat formato s usually ot closely related to the theme of correspodg pages, thereby makg t dffcult to retreve ad classfy the topcs of a cotet page correctly. Cosder the example Fgure. We dvde the root page of WashgtoPost (http://www.washgtopost.com, a popular Eglsh ews Web ste) to several parts wth dfferet styles ad cotets,.e., () a baer wth lks ''ews'', ''OPoltcs'', ''Etertamet'', ''Lve Ole'', etc. o the top, (2) a meu wth 22 lks of ews categores o the left, (3) a baer wth advertsemet lks, (4) geeral aoucemets about washgtopost, () a block wth promoted hot ews ad advertsemets, (6) a TOC block, ad (7) a lst wth headle ews. I ths case, parts () ad (2) are dstrbuted amog most pages WashgtoPost ad are redudat formato for users. However, they are stll dexed by search eges. Such dexg duces the creasg of the dex sze, whle beg useless for users ad harmful for the qualty of search results. Parts (3), (4) ad () are rrelevat to the cotext of the page ad are called rrelevat formato. These parts wll make the topc of the page drft whe terms these parts are dexed. The last two parts, (6) ad (7), fact draw more atteto from users ad are prmary cotets. Users ca get ews artcles wth oe clck from achors these
two parts. The followg examples descrbe ther mpacts wth more detal: Example : After searchg ''game hardware tech jobs'' Google (http://www.google.com), oe of the most popular search eges, we foud 20 pages of CNET (http://www.cet.com, a ews ad product revew Web ste for computg ad techology) the top-00 results. However, oe of these pages came from the Web pages categorzed CNET Job Seeker 2, whch cotas the desred formato. There are matched query terms redudat parts of pages amog all these 20 pages ad three of them do ot cota ay matched terms the formatve parts of pages. The matched terms redudat parts of a page wll crease the rak of the page, eve though they are usually gored by users. Note that sx of the 20 pages are raked as top-6 ad the page wth the hghest rakg does ot cota ay desrable formato the formatve parts. Fgure : A sample page of ews Web stes I ews Web stes, ews artcle pages are daly updated ad ews formato agets oly eed to crawl TOC pages frst ad fd ews pages through lks TOC pages wth the help of the formatve structure. They, therefore, do ot eed to daly crawl ad dex redudat ad rrelevat formato. I addto to creasg precso ad decreasg the cost of search eges, crawlers ad formato agets, formatve structures of Web stes are useful for wrapper geerato formato extracto, e.g., WIEN [7], IEPAD [] etc. Ths s because repeated patters would be more evdet clustered artcle pages. From our observatos, systematc Web stes, the documet structures of pages wth ear equal authorty ad hub values are aalogous to oe aother. The precso of wrapper geerato wll be mproved after the preprocessg of mg formatve structures o pages. Cosequetly, wth the cotuous growth o the umber of pages wth redudat ad teuous cotexts, fdg desred formato o the Web has become a mportat ad dffcult task to solve. Explctly, extractg formatve structures of Web stes to recogze useful pages ad lks s a crucal ssue to crease precso ad decrease the cost of search eges ad formato agets. I ths paper, we focus o the mg of ews Web stes to demostrate the problem ad the soluto proposed detal. From observg TOC pages ad artcle pages a ews Web ste, oe may ote that TOC pages ted to hold more outlks tha The result s quered from www.google.com o February, 2002. 2 CNET Job Seeker: http://dce.cet.com/seeker.epl?rel_code&op artcle pages, ad also that legths of o-achor cotext of artcle pages are loger tha those of TOC pages. However, these characterstcs are dverse systematc Web stes ad are depedet o presetato styles, polces, ad types of cotets of Web stes. Moreover, the characterstcs wll be blurred a large cotext ad become less dscrmatg whe redudat/rrelevat formato ad lks are duced. The works [4][] provde good learg mechasms to recogze advertsemets ad redudat ad rrelevat lks of Web pages. However, these methods eed to buld the trag data frst ad related doma kowledge must be cluded to extract features for geerato of classfcato rules. Therefore, t s dffcult to extract the formatve structures of preset systematc Web stes by a automatc ad heurstc techque. Specfcally, whe formato characterstcs are cosdered lk graphs, TOC pages the formatve structure of a Web ste preset characterstcs of good hubs, ad artcle pages are good authortes lked by TOC pages. Therefore, lk aalyss algorthms, e.g., (Hyperlk Iduced Topc Search [6]), provde a reasoable soluto to mg the formatve structures of ews Web stes. Explctly, based o lk aalyss, [6] ad the PageRak algorthm [4] appled Google have provded a ew rakg techology for Web search eges. The algorthm based o mutual reforcemet relatoshp provdes a ovatve methodology for Web searchg ad topcs dstllato. Accordg to the defto [6], a Web page s a authorty o a topc f t provdes good formato, ad s a hub f t provdes lks to good authortes. I recet research work o lk aalyss of hyperlked documets, s appled to the research area of topc dstllato ad several kds of lk weghts are volved to dcate the sgfcace of lks hyperlked documets. I the Clever system [0], weghts tued emprcally are added to dstgush same-ste lks ad others. I [3], the metrcs of smlarty of whole cotets lked documets are appled o lk weghts ad the use of text surroudg the lks as keyword-based evdece to determe a weght for each lk s proposed [9]. Cosderg the dstrbuto of terms documets, Chakrabart et al [7] combe the TFIDF-weghted model ad mcro-hub to represet the sgfcace of achors regos wth formato eeded. Note that the topc dstllato s dfferet from the formatve structure mg several aspects: () The former dstlls hubs ad authortes for a gve query ad the latter mes the formatve structure cosstg of TOC ad artcle pages o a gve Web ste; (2) The base set of topc dstllato cossts of the root set, whch s geerated from the subset of query results, ad the eghborg odes of the root set. The formatve structure mg targets o all pages of a Web ste; (3) Algorthms o topc dstllato usually omt tra-lks ad epotstc lks to perform the mutual reforcemet betwee stes. Such lks are mportat ad should be cosdered as lk caddates the formatve structure of a Web ste; (4) Most adaptve topc dstllato algorthms based o take the relatoshp betwee queres ad documets to cosderato. However, these algorthms do ot work well o mg formatve structures the abset of a target query. Furthermore, as descrbed [8][8], the lk aalyss algorthms, e.g., are vulerable to the effect of epotstc clque attack ad Tghtly-Kt Commuty (TKC). The effects wll be more sgfcat for mg formatve structures of Web stes
due to the huge amout of epotstc lks ad clques a Web ste. Cosequetly, we propose ths paper a approach called Etropy-based Lk Aalyss o Mg Iformatve Structure (LAMIS). LAMIS s a automatc formatve structure extractg system based o the weghted lk aalyss of Web pages. Whe a root URL of a ste s gve, the system raks ts Web pages accordg to dstct degrees of hub ad authorty. However, because a page, especally a systematc page commercal Web stes, usually carres multple characterstcs of hub, authorty ad ose, t s geeral dffcult to dscrmate pages accordg to ther hub or authorty values whe a geeral lk aalyss algorthm s appled. Oe may assg weghts o each lk of a page to reduce the effects of ose lks. I recet researches of topc dstllato ad lk aalyss, there are some weghtg mechasms proposed [3][7][0]. However, due to the lack of query terms, these weghtg mechasms do ot work well whe we wat to me the formato structure of a Web ste. Therefore, a ew weghtg mechasm whch cosders the sgfcace of lks ad the cotets of formato they carry a Web ste, s eeded. The key dea of LAMIS s to utlze the formato etropy ad wegh lks wth the kowledge that correspods to the amout of formato a lk or a page the lk aalyss. Results of expermets o several real ews Web stes have show that LAMIS sgfcatly outperforms the compao heurstc algorthms ad avods the drawback of geeral lk aalyss algorthms for mg formatve structures. I average, the augmeted LAMIS leads to promet performace mprovemet ad creases the precso by a factor ragg from 33% to 232% whe the desred recall falls betwee 0. ad. Beeftg from the mprovemet of TOC recogto, LAMIS s show to be able to me the formatve structure more effcetly ad precsely. The remader of ths paper s orgazed as follows. I Secto 2, we descrbe the basc dea of LAMIS. The etropy of achor text ad ehaced lk aalyss s preseted ths secto. The system desg ad mplemetato s descrbed Secto 3. I Secto 4, we evaluate the performace of LAMIS by several expermets o Chese ad Eglsh ews Web stes. We dscuss oe of the effects of oses ad the augmeted feature of LAMIS ths secto. The paper cocludes wth Secto. 2. Model of Lk Aalyss I the secto, we wll descrbe the detal of our approach wth a llustratve example. The troducto of etropy ad ts applcato to achor text are descrbed Secto 2.. The proposed etropy-based lk aalyss s preseted Secto 2.2. 2.. Etropy of Achor Text Whle users are browsg the Web, the achor text s a mportat clue for users to track ad search ther desred formato. Hece, we extract terms from achor texts to deote the sgfcace of achors. I ths paper, a term correspods to a meagful keyword. The motvato of our approach s that terms dstrbuted more pages a Web ste usually carry less formato to users. I cotrast, those appearg fewer pages carry more formato of terest. Hece, we extract terms from achor texts ad use the etropy, whch s determed from the probablty dstrbuto of terms aroud the whole documet sets, to represet the formato stregth (rch or poor) of terms. Shao's formato etropy [22] s appled o the term-documet matrx whch s geerated from the term extracto module to calculate the etropy. By defto, the etropy E ca be expressed as p log p, where p s the probablty of evet ad s umber of evets. By ormalzg the weght of a term to be [0, ], the etropy of term T s: T ) w log 2 w, () j whch w j s the value of ormalzed term-frequecy. w j s a etty the term-documet matrx to represet the weght of a term a page,.e., tf, where tf j j s the term frequecy of wj k tf k term page j. To ormalze the etropy value of a term to the rage [0, ], the base of the logarthm s chose to be the umber of pages. Equato () thus becomes: T ) wj log wj, where D, D s the set of j. (2) pages We the defe the etropy of achor AN as the average etropy of all terms AN below: k j T j ) ( AN ), where T, T2 E. (3), K, Tk, are terms achor AN k It s oted that f a achor cotas o terms, the etropy of the achor s assged to oe. AN 3 P 0 P 0 AN 30 AN 4 AN 32 AN 00 AN 0 AN 42 P AN 2 P 2 AN 0 AN AN 40 AN 33 P 3 P 4 AN 43 AN 3 AN 00 : "hot ews" AN 0 : "sales" P AN 0 : "sports" AN : "busess" AN 2 : "sales" AN 3 : "hom e" P 2 AN 20 : "hom e" P 3 AN 30 : "hot ews" AN 3 : "sales" AN 32 : "hom e" AN 33 : "busess ews" AN 20 P 0 P P 2 P 3 P 4 P 0 0 0 P 0 P 2 0 0 0 P 3 0 P 4 0 Adjacet M atrx P 4 AN 40 : "hot ews" AN 4 : "sales" AN 42 : "hom e" AN 43 : "sports ews" P 0 P P 2 P 3 P 4 T 0 (hot) 0 0 T (ews) 0 0 2 2 T 2 (sales) 0 T 3 (sports) 0 0 0 T 4 (busess) 0 0 0 T (home) 0 TD M atrx Fgure 2 A smple Web ste, D Cosder for example a smple ews Web ste Fgure 2. Page P 0 s the homepage of the Web ste. Page P s the TOC page wth two ews achors lkg to P 3 ad P 4 whch are ews artcle pages the Web ste. Page P 2 s a advertsemet page lked by other four pages. Most pages cota achors,.e., home, hot ews, ad sales, lkg to Page P 0, P, ad P 2 respectvely. Page P 3 ad P 4 have cross-referece lks as the related ews oes to each other. The etropy of terms achor texts of the Web ste ca be calculated as below:
T ) 0 T ) T ) T ) 2 T ) T ) 3 3 2 log 0.682, 3 3 2 2 log log 0.6, 4 4 2 log 0.86, ad 4 4 log 0.430. 2 2 From the example Fgure 2, we ote that terms uformly dstrbuted aroud documets have ther etropy values close to oe, meag that these terms provde very few formato for users. I geeral, these terms come from lks of avgato paels ad advertsemet baers, lke T, T 2, ad T. Hece, achors cota hgh etropy terms are deemed as less formatve oes. For example, the etropy of achor AN 00 s E ( T ) 0 + T 0.669, ad 2 achor AN 00 s thus cosdered as a less formatve lk. I cotrast, whe terms ca oly be foud oe achor, they ow the lowest etropy 0, meag that users ca oly fd formato relevat to these terms through achors they are located. Users are usually more terested such formato of smaller etropy. Ths term meas that a achor wth a smaller etropy term should be assged a larger weght tha the oe wth a larger etropy term. The etropy values of achors the sample ste are lsted Table. The most formatve lks are AN 0 ad AN whch lk to artcle pages P 3 ad P 4 from TOC page P. Table : Etropy values of achors Fgure 2 P 0 P P 2 AN 00 AN 0 AN 0 AN AN 2 AN 3 AN 20 0.669 0.86 0.430 0.430 0.86 0.86 0.86 P 3 P 4 AN 30 AN 3 AN 32 AN 33 AN 40 AN 4 AN 42 AN 43 0.669 0.86 0.86 0.43 0.669 0.86 0.86 0.43 2.2. Ehaced Lk Aalyss wth Achor Texts I the lk graph of a Web ste G(V, E), algorthm computes two scores for each ode v V,.e., the hub score H( ad the authorty score. Itally, the two scores of each ode are assged to a equal postve umber. The, t teratvely updates the scores as follows utl H ad A coverge: ( u, H ( ad H ( ( v,. (4) Note that the scores are ormalzed each terato. It s proved [6] that H ad A wll evetually coverge,.e., termato of s guarateed. I our approach, we corporate etropy values of achors as lk weghts to preset the sgfcace of lks a page. Therefore, Equato (4) s modfed as follows: ( u,, () H ( * α ad H ( * α ( v, where α s the weght of AN. Accordg to the defto of etropy, α s defed as follows: α AN ). (6) Equato (6) meas that the more formato a lk carres, the larger the weght of the lk. Whle usg mutual reforcemet approach, such as, to perform the lk aalyss o pages of a Web ste, the TKC effect [8] s more obvous tha o the base set of a specfc query. Ths s because that the umber of cycle lks ad epotstc lks domates the set of lks a Web ste. As defed [8], a tghtly-kt commuty s a small but hghly tercoected set of stes. Stes the TKC wll score hgh lk aalyss algorthms, eve though they are ot authortatve o the gve topc. O mg formatve structure of a Web ste, umerous tra-doma lks wll form several ad complex TKCs. They wll uavodably affect the result of algorthm sgfcatly. The SALSA proposed [8] s desged to resst ths effect ad wll be emprcally evaluated by our expermetal studes later. The etropy-based SALSA ca be descrbed as below: H ( ( u, ( v, H ( * * α E out deg ree( * * α. deg ree( The oto out-degree( meas the umber of out-lks of ode u, ad oto -degree( s the umber of lks potg to ode u. We apply Equato (6) o etropy values of achors Fgure 2 to obta correspodg weghts. The we use Equatos (4) ad () to calculate values of hub ad authorty of all pages to show the effect of etropy-based weghtg. I Table 2, hub ad authorty values are obtaed after 0 teratos. Page P s raked as top- hub page by etropy-based ad pages P 3 ad P 4 are raked wth the hghest authorty as well. The result agrees wth our expectato. It ca also be see that raks the advertsemet page as the best authortatve page ad ews artcle pages as good hub oes. Table 2: Results of lk aalyss of ad Etropy-based Method Etropy-based Authorty Hub Authorty Hub P 0 0.3 0.297 0.229 0.42 P 0.49 0.24 0.338 0.76 P 2 0.76 0.60 0.244 0.03 P 3 0.32 0.3 0.622 0.4 P 4 0.32 0.3 0.622 0.4 3. The Desg of LAMIS I ths secto, the LAMIS system s desged to explore hyperlk formato to extract ad detfy the hub ad authorty pages. The oly parameter for the extracto system s the startg URL of a Web ste ad there s o maual vetos ad pror kowledge about the Web ste requred by LAMIS. We wll troduce the system archtecture the Secto 3. ad the process of our adaptve lk aalyss wll be descrbed Secto 3.2. (7)
3.. The System Archtecture Fgure 3 shows the three ma compoets LAMIS, cludg () a Web crawler to crawl pages ad buld the lk graph of a Web ste, (2) a feature extractor module whch extracts features of pages, cludg the -degree ad out-degree of pages, the legth of cotext ad terms lks, ad (3) a etropy-based lk aalyss module whch recogzes TOC pages ad bulds up the formatve structure of a Web ste. I the crawler module, a startg URL,.e., the root ode the lk graph of a Web ste, s gve ad pages are crawled accordg to the lk structure of the Web ste. I LAMIS, we ca assg dfferet crawl depths to get a dfferet vew of the Web ste. The deeper the crawl depth, the more precse the aalyss wll be, at the cost of hgher computg complexty. From Equato (), t ca be verfed that the complexty of lk aalyss algorthms s O( E ). I our observato o several systematc Web stes, the formatve structures of Web stes are mostly located from depth- to depth-2 of lk graphs (the root ode s depth-0). Therefore, wthout loss of geeralty the crawl depth s assged to be three our expermets. Oce a page has bee crawled, related features the page are extracted by the feature extractor module. The, the achor texts all lks are parsed to extract meagful terms ad assocated term frequeces are also couted. As we kow, extractg Eglsh terms s relatvely smple. Applyg stemmg algorthms ad removg stop words based o a stop-lst, Eglsh keywords (terms) ca be extracted [2]. Extractg terms used oretal laguages seems to be more dffcult because of the lack of separators these laguages. I LAMIS, we use a algorthm to extract keywords from Chese seteces based o a Chese term base whch s geerated va collectg hot queres, excludg stop words, from our search ege 3. HTML Fles Page Archve parse page Structure Learer form ato structure of a web ste Startg URL of a web ste Crawler Module archve web pages Iformato Structure Dstllato M odule Cotet Block & Feature Extracto Module LAMIS record lk relato ad achor text Feature Database read achor text / record features Fgure 3: Flowchart of LAMIS system For each term, we mata a term-documet matrx (abbrevated as T-D matrx) to represet the correspodg term frequecy. Whle assocated features are beg extracted ad the T-D matrx s costructed, the etropy-based lk aalyss module raks pages 3 The searchg servce s a project sposored by Yam, (http://yam.com/). It served the Web users from November, 998 to December, 2000. accordg to the values of hub ad authorty. TOC pages are selected from the set of hgh-rakg pages, ad the we fd correspodg artcle pages through formatve lks TOC pages. The extracted sets of odes ad lks form a formato structure of the Web ste. Ths tur meas that f we ca rak ad recogze TOC pages more precsely, we ca get a more accurate formatve structure drectly. 3.2. The process of lk aalyss I the etropy-based lk aalyss module, we calculate the values of hub ad authorty of each page accordg to Equatos () ad (6) after the lk weghts are assged. I, hub ad authorty wll coverge to the prcpal egevector of the lk matrx [6]. The weghted algorthm s also proved to be coverged f weghtg factors are postve [3]. I LAMIS, the weghtg factor α, s bouded [0,] so that the covergece of LAMIS follows. I our expermets, 9.7% of the hub values are zero after 0 teratos o average. Moreover, our specto, TOC pages ted to hold hgh hub values, ad we hece use the raked lst to retreve TOC pages amog the whole page sets. 4. Expermets ad Evaluato I the secto, we descrbe several expermets coducted o some real ews Web stes to evaluate the performace mprovemet attaed by LAMIS. The detal of datasets s descrbed the Secto 4.. The mprovemet of the etropy-based lk aalyss s preseted Secto 4.2. We dscuss oe of ose factors ad devse a techque to remedy the effect Secto 4.3. Secto 4.4 shows a overall comparso of all expermets. 4.. Datasets I our expermets, the datasets 4 cota eght Chese ad fve Eglsh ews Web stes as descrbed at Table 3. All of these ews stes provde real-tme ews ad hstorcal ews browsg servces. News these Web stes cover several domas, cludg poltcs, face, sports, lfe, teratoal ssues, etertamet, health, cultures, etc., ad are updated from tme to tme. I our expermets, the crawl depth s set to 3, ad after pages have bee crawled, the doma experts spect the cotet of each page Chese ews Web stes ad mark the classes of pages to buld the aswer set of datasets accordg to prevous experece ad doma kowledge the mateace of the ews search ege (NSE) of Yam. As show Table 3 the umbers of TOC pages vary amog stes, eve though the umbers of total pages of some stes are smlar. As wll be see later, the dversty of formato structures datasets fact dcates the several geeral applcablty of LAMIS. After extractg features from crawled pages, we compute the etropy values of extracted terms ad achors by Equatos (2) ad (3). As llustrated Fgure 4, average, 7.6% of lks are ot formatve,.e. ther etropy values are larger tha 0.8. As we expect, they are maly lks avgato paels, advertsemets ad copyrght baers ad are deed pretty uformly dstrbuted amog all pages. 4 Pages of Web stes datasets are crawled at 200/2/27 ad 2002/4/. The datasets ca be retreved our research ste http://kp06.s.sca.edu.tw/sd/dex.html. The doma experts are the ste maagers of Yam News Search Ege (NSE, http://ews.yam.com).
Table 3: Datasets ad related formato Ste Abbr. URLs of News Web Stes Total TOC Lks cotet pages pages blocks CDN www.cd.com.tw 26 2 339 892 CTIMES ews.chatmes.com 3747 79 26848 79077 CNA www.ca.com.tw 400 33 849 444 CNET tawa.cet.com 433 78 2844 92 CTS www.cts.com.tw 36 3 89 649 TVBS www.tvbs.com.tw 740 3 330 937 TTV www.ttv.com.tw 86 22 330 4990 UDN udews.com 4676 22 34882 844 CNN www.c.com 626 N/A * 2276 643 WP www.washgtopost.com 30 N/A 0367 8203 LATIMES www.latmes.com 9 N/A 2069 8720 CSMONITOR www.csmotor.com 368 N/A 3972 4260 DISPATCH www.dspatch.com 603 N/A 7 862 *: We oly cosder top-20 precso expermets of Eglsh Web ste. Hece, we do ot fd out all TOC pages Eglsh Web stes. Accumulatg Dstrbuto (%) 00 90 80 70 60 0 40 30 20 0 0 CDN CNA CTS TTV CTIMES CNET TVBS UDN 0 0. 0.2 0.3 0.4 0. 0.6 0.7 0.8 0.9 Lk Etropy Fgure 4: Accumulatg dstrbuto of lk etropy 4.2. The Improvemet of Etropy-based Lk Aalyss I order to show the mprovemet acheved by LAMIS, we costruct several expermets uder dfferet crtera as lsted Table 4. We evaluate the performace of expermets by measurg the precso ad recall of the hub-rakg lst. For each Web ste, we exame the precsos at stadard recall levels,.e., 0%, 0%, 20%,, 00%. The, same as [2], we average the precso at each recall level as follows: N P ( r) P( r), N where P (r) s the average precso at the recall level r, N s the umber of datasets used, ad P (r) s the precso at recall level r for the -th dataset. For performace comparso amog dvdual datasets, we use R-Precso ad Precso Hstograms for sgle value summares [2]. R-Precso RP A () for algorthm A over dataset s defed as the precso rate at the R-th posto the rakg, where R s the total umber of aswers ad the precso hstogram s defed as: RP ( ) RP ( ) RP ( ). A / B A B We scale up the precso hstogram by the factor calculated o the umber of aswers each dataset for hghlghts o the mproved umber of retreval aswers. At frst, we compare the performaces of, SALSA, ad LAMIS. I Table 4, LAMIS ca be descrbed as combato of PA-AEN- meag that etropy-based the page mode. The otato PA meas that the algorthm s operated the page mode,.e. each page s treated as a ode. Note that we use the otato LN (.e., lk ormalzato) to dcate the dfferece betwee ad SALSA,.e., SALSA s equal to -LN. I Fgure, we ca see that does ot work well sce the average R-precso s oly 0.27 (show Table ). Moreover, R-precsos of fve datasets are smaller tha 0.0, showg that caot be drectly appled o mg the formatve structures. However, though the mprovemets of the precso rates for both etropy-based ad SALSA over the orgal oes rage from 0. to 0.2, all of these four algorthms perform poor whe the desred recall s larger tha 0.7. It s observed that TOC pages have more lks tha artcle pages geeral. Ths observato suggests us to rak pages by the umber of outlks o a page,.e. the out-degree of odes the lk graph. It s terestg to see from Fgure that the smple heurstcs OL performs much better tha geeral lk aalyss algorthms whe the desred recall s uder 0.7, where the precso decreases suddely whe the recall s larger tha 0.7. It s oted that there are some ose effects fluecg the etropy-based lk aalyss ad about 0% of TOC pages are hard to dscrmate whe these four algorthms are appled. I the followg secto, we wll descrbe these ose effects ad our solutos to these effects. Table 4: The lst of compoets of expermets Expermet Abbr. OL PA CB SALSA LN AEN Precso 0.8 0.6 0.4 0.2 0 Descrpto rak by umber of outlks a page the page mode the cotet block mode Kleberg s the stochastc approach for lk-structure aalyss lk ormalzato weghted by achor text etropy 0. 0.2 0.3 0.4 0. 0.6 0.7 0.8 0.9 Recall OL -LN (SALSA) LAMIS LAMIS-LN Fgure : The effect of weghts o lk aalyss 4.3. Augmeted Features of LAMIS O specto of expermetal results Fgure, we fd several ose flueces o lk aalyss algorthms durg mg formatve structures. The most fluetal effect wll be examed Secto 4.3., ad we shall devse oe techque to remedy ts effect. 4.3.. Hybrdzato ad fecto of hubs ad authortes Whe the cotexts of pages are more complex, pages may cota more hybrd characterstcs of hub ad authorty. Such a page s called a hybrdzed page. For example, the leadg story page ews Web ste cotas formatve lks to other hot-ews pages
ad s cosdered as a TOC page to these hot-ews pages. However, ths page s fact a hghly authortatve page as well because the hottest ews s also poted to by other pages. Due to the fluece of the mutual reforcemet relatoshp, the hybrdzato of hubs ad authortes affects ot oly the characterstcs of hybrdzed pages but also that of ther eghborg pages. The fecto wll blur the values of hub ad authorty. To address ths effect, we propose a adaptve approach, amely the cotet block mode (CB). Cotet Block Mode I Fgure 6, we ca see the dfferece of mutual reforcemet propagato betwee the page mode ad the cotet block mode. I the page mode, authorty of P2,.e., A p2, ca affect the value A p3 through hub of P H p. If Page p2 s authortatve, A p3 wll also be promoted, eve though t s ot a authortatve page. I the cotet block mode, we dvde Page p to two parts, oe cotas a lk to Page p2, ad the other cotas a lk to Page p3. They are treated as separate odes. Hece, the propagato of hgh authorty of Page p2 wll ow be termated at CB ad Page p2 wll ot be correctly promoted. The block level hub values also help us to extract the formatve part of a page. The work [8] also proposes a fe-graed model based o Documet Object Model (DOM [23]) to perform mcro-hub computato. I our approach, blocks the cotet block mode are delmted by pre-defed HTML tags, e.g., <table>, rather tha by the DOM tree used [8]. It s show by our expermetal results that the delmtato mechasm performs well o dscoverg formatve cotet blocks of Web documets. The effect of the cotet block mode s show Fgure 7. We ote that LAMIS-LN-CB outperforms other schemes whe the desred recall s larger tha 0.6. We ca fd that the average precso of LAMIS-LN-CB s smaller tha the oe of LAMIS-LN whe the recall rate s smaller tha 0.. Ths s because LAMIS-LN-CB raks two artcle pages of TVBS ad TTV ews ste the top-0 hub rakg. Because the szes of TOC aswer sets of these two stes are small,.e., 3 ad 22 respectvely, the correspodg precso rates decrease suddely, whch tur reduces the average precso rate whe the desred recall s smaller tha 0.. The performace of the other two expermets o the cotet block mode s smlar to that the page mode, because the ose effects of TKC ad redudat/rrelevat lks coceal the mprovemet of the cotet block mode. 4.4. Overall Performace Comparso amog All Schemes We summarze the R-precso values of all expermets Table. The augmeted LAMIS (LAMIS-LN-CB), whch tegrates the cotet block mode wth etropy-based lk aalyss, attas the best performace metrcs. The average R-precso s 0.7. The augmeted LAMIS s fact raked frst for R-precso four of eght Chese datasets ad raked secod the other three, showg the mprovemet o R-precso over geeral by a factor of.78. We ca also see the mprovemet of augmeted LAMIS over all Chese ews Web stes Fgure 8. It s show that LAMIS based schemes geeral outperform others ad the techque devsed Secto 4.3. s powerful dealg wth such effect. P CB CB2 P2 P3 P2 P3 Page Mode Cotet Block Mode Fgure 6: Propagatos of mutual reforcemet o dfferet modes Precso 0.8 0.6 0.4 0.2 0 0. 0.2 0.3 0.4 0. 0.6 0.7 0.8 0.9 Recall -CB LAMIS LAMIS-CB LAMIS-LN LAMIS-LN-CB Fgure 7: Effects of the cotet block mode ad the page mode Table : R-Precso of all expermets R-Precso CDN CTIMES CNA CNET CTS TVBS TTV UDN AVG. outlks 0.84 0.67 0.36 0.44 0.42 0.77 0.64 0.43 0.7 0.48 0.63 0.94 0.03 0.03 0.0 0.0 0.02 0.27 -LN (SALSA) 0.92 0.77 0.97 0.42 0.29 0.77 0.36 0.2 0.9 LAMIS 0.88 0.63 0.30 0.33 0.6 0.92 0.0 0.06 0.42 LAMIS-LN 0.96 0.77 0.2 0. 0.32.00 0.0 0.3 0.64 CB- 0.48 0.22 0.24 0.46 0.2 0.0 0.36 0.28 0.32 CB--LN 0.96 0.33 0.30 0.4 0.39 0.69 0.23 0.6 0.49 LAMIS-CB 0.88 0.49 0.2 0.3 0.3 0.92 0.8 0.09 0.39 LAMIS-LN-CB 0.92 0.9 0.97 0. 0.26.00 0.86 0.3 0.7 OL: rak by umber of outlks a page, PA: Page mode, CB: Cotet block mode, : Kleberg's, LN: Lk ormalzato, -LNSALSA: the stochastc approach for lkstructure aalyss,, AEN: weghted by achor text etropy R-Precso.00 0.90 0.80 0.70 0.60 0.0 0.40 0.30 0.20 0.0 0.00 CDN CTIMES CNA CNET CTS TV BS TTV UDN A V G. Datasets outlks -LN (SALSA) LAMIS-LN-CB Fgure 8: R-Precso mprovemet of augmeted LAMIS Precso (%) 70 6 60 0 4 40 LAMIS-LN-CB outlks SALSA Top-20 Fgure 9: Top-20 precso of Eglsh ews Web stes
We coduct our several expermets o Eglsh ews Web stes ad compare the top-20 precso rates Fgure 9. We ca ote that some results are ot as good as those Chese ews Web stes, though augmeted LAMIS stll emerges as the wer. The reaso s that effects of oses descrbed Secto 4.3 are promet Eglsh ews Web stes ad thus affect the top-20 rakg further. For example, four of top-0 hub-rakg pages DISPATCH ad CSMONITOR are artcle pages wth local related ews meus, ad they appear to receve hgh hub values due to hgh weghted lks the rrelevat block.. Coclusos I the paper, we addressed the problem o mg formatve structure of Web stes ad ts correspodg ssues. We devsed a etropy-based lk aalyss mechasm to me the formatve structures wth hgh precso. Our approach,.e., LAMIS, corporates the etropy value of a lk, whch s essece the average of etropy values of terms the achor text, to coeffcets of the authorty-hub equatos so as to ehace the etropy-based lk aalyss for Web structure mg. LAMIS s further augmeted to mprove the precso ad recall the Web structure mg. LAMIS gves a precse TOC recogto methodology ad bulds a formatve structure by these TOC pages ad artcles pages that are lked by formatve lks TOC pages. Expermetal results o several ews Web stes have show that the augmeted LAMIS has the capablty of mg the formatve structures of ews Web stes wth a very good precso. Moreover, from the expermetal studes, we ehace LAMIS ad make t practcally useful for mg real ews Web stes. 6. Refereces [] B. Ameto, L. Tervee, ad W. Hll. Does Authorty Mea Qualty? Predctg Expert Qualty Ratgs of Web Documets. Proc. of 23th ACM SIGIR Cof. o Research ad Developmet Iformato Retreval, 2000. [2] R. Baeza-Yates, B. Rbero-Neto. Moder Iformato Retreval. Addso Wesley, 999. [3] K. Bharat, M. R. Hezger. Improved Algorthms for Topc Dstllato a Hyperlked Evromet. Proc. of 2th ACM SIGIR Cof. o Research ad Developmet Iformato Retreval, 998. [4] S. Br, L. Page. The aatomy of a large-scale hypertextual Web search ege. Proc. of 7th World Wde Web Coferece, 998. [] A. Broder, S. Glassma, M. Maasse, G. Zweg. Sytactc Clusterg of the Web. Proc.of 6 th World Wde Web Coferece, 997. [6] A. Broder, R. Kumar, F. Maghoul, P. Raghava, S. Rajagopala, R. Stata, A. Tomks, J. Weer. Graph structure the Web. Proc.of 9 th World Wde Web Coferece, 2000. [7] S. Chakrabart, M. Josh, V. Tawde. Ehaced Topc Dstllato usg Text, Markup Tags, ad Hyperlks. Proc. of 24th ACM SIGIR Cof. o Research ad Developmet Iformato Retreval, 200. [8] S. Chakrabart. Itegratg the Documet Object Model wth Hyperlks for Ehaced Topc Dstllato ad Iformato Extracto. Proc. of 0 th World Wde Web Coferece, 200. [9] S. Chakrabart, B. Dom, P. Raghava, S. Rajagopala, D. Gbso, J. M. Kleberg. Automatc Resource Complato by Aalyzg Hyperlk Structure ad Assocated Text. Proc. of 7 th World Wde Web Coferece, 998. [0] S. Chakrabart, B. Dom, S. Kumar, P. Raghava, S. Rajagopala, A. Tomks, D. Gbso, ad J. M. Kleberg. Mg the Web's lk structure. IEEE Computer, 32(8), pages 60-67, August 999. [] C. H. Chag, S. C. Lu. IEPAD: Iformato Extracto Based o Patter Dscovery. Proc.of 0 th World Wde Web Coferece, 200. [2] M.-S. Che, J.-S. Park, ad P. S. Yu. Effcet Data Mg for Path Traversal Patters. IEEE Trasactos o Kowledge ad Data Egeerg, 0(2): 209--22, Aprl 998. [3] V. Crescez, G. Mecca, ad P. Meraldo. RoadRuer: towards automatc data extracto from large Web stes. Proc. of 27 th Iteratoal Coferece o Very Large Data Bases, 200. [4] B. D. Davso. Recogzg Nepotstc Lks o the Web. Proc. of AAAI 2000. [] N. Jushmerck. Learg to remove Iteret advertsemets. Proc. of 3 rd Iteratoal Cof. O Autoomous Agets, 999. [6] J. M. Kleberg. Authortatve sources a hyperlked evromet. ACM-SIAM Symposum o Dscrete Algorthms, 998. [7] N. Kushmerck, D. Weld, ad R. Doorebos. Wrapper Iducto for Iformato Extracto. I Proc. of the th Iteratoal Jot Coferece o Artfcal Itellgece (IJCAI), 997. [8] R. Lempel, S. Mora. The Stochastc Approach for Lk-Structure Aalyss (SALSA) ad the TKC effect. I 9 th Iteratoal World Wde Web Coferece, Amsterdam, Netherlads, May 2000. [9] W. S. L, N. F. Aya, O. Kolak, Q. Vu. Costructg Mult-Graular ad Topc-Focused Web Ste Maps. Proc. of 0 th World Wde Web Coferece, 200. [20] P. Proll, J. Ptkow, R. Rao. Slk from a sow s ear: Extractg usable structures from the Web. Proc. of ACM SIGCHI Coferece o Huma Factors Computg, 996. [2] G. Salto. Automatc Text Processg: The Trasformato, Aalyss, ad Retreval of Iformato by Computer. Addso Wesley. 989. [22] C. E. Shao. A mathematcal theory of commucato. Bell System Techcal Joural, 27:398-403, 948. [23] W3C DOM. Documet Object Model (DOM). http://www.w3.org/dom/.