Title Language Model for Information Retrieval

Ttle Language Model for Informaton Retreval Rong Jn Language Technologes Insttute School of Computer Scence Carnege Mellon Unversty Alex G. Hauptmann Computer Scence Department School of Computer Scence Carnege Mellon Unversty ChengXang Zha Language Technologes Insttute School of Computer Scence Carnege Mellon Unversty ABSTRACT In ths paper, we propose a new language model, namely, a ttle language model, for nformaton retreval. Dfferent from the tratonal language model used for retreval, we defne the contonal probablty Q as the probablty of usng query Q as the ttle for document D. We adopted the statstcal translaton model learned from the ttle and document pars n the collecton to compute the probablty Q. To avod the sparse data problem, we propose two new smoothng methods. In the experments wth four fferent TREC document collectons, the ttle language model for nformaton retreval wth the new smoothng method outperforms both the tratonal language model and the vector space model for IR sgnfcantly. Categores and Subject Descrptors H.3.3 [Informaton Search and Retreval]: Retreval Models language model; machne learnng for IR General Terms Algorthms Keywords ttle language model, statstcal translaton model, smoothng, machne learnng. ITRODUCTIO Usng language models for nformaton retreval has been stued extensvely recently [,3,7,8,0]. The basc dea s to compute the contonal probablty Q,.e. the probablty of generatng a query Q gven the observaton of a document D. Several fferent methods have been appled to compute ths contonal probablty. In most approaches, the computaton s conceptually decomposed nto two stnct steps: () Estmatng a document language model; (2) Computng the query lkelhood usng the estmated document model based on some query model. For example, Ponte and Croft [8] emphaszed the frst step, and used several heurstcs to smooth the Maxmum Lkelhood Estmate Permsson to make gtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or strbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, or republsh, to post on servers or to restrbute to lsts, requres pror specfc permsson and/or a fee. SIGIR 02, August -5, 2002, Tempere, Fnland. Copyrght 2002 ACM -583-56-0/02/0008 $5.00. (MLE) of the document language model, and assumed that the query s generated under a multvarate Bernoull model. The BB method [7] emphaszed the second step and used a two-state hdden Markov model as the bass for generatng queres, whch, n effect, s to smooth the MLE wth lnear nterpolaton, a strategy also adopted n Hemstra and Kraaj [3]. In Zha and Lafferty [], t has been found that the retreval performance s affected by both the estmaton accuracy of document language models and the approprate modelng of the query, and a twostage smoothng method was suggested to explctly address these two stnct steps. A common defcency n these approaches s that they all apply an estmated document language model rectly to generatng queres, but presumably queres and documents should be generated through fferent stochastc processes, snce they have qute fferent characterstcs. Therefore, there exsts a gap between a document language model and a query language model. Indeed, such a gap has been well-recognzed n [4], where separate models are proposed to model queres and documents respectvely. The gap has also been recognzed n [6], where a document model s estmated based on a query through averagng over document models based on how well they explan the query. In most exstng approaches usng query lkelhood for scorng, ths gap has been mplctly addressed through smoothng. Indeed, n [] t has been found that the optmal settng of smoothng parameters s actually query-dependent, whch suggests that smoothng may have helped brdge ths gap. Although fllng the gap by smple smoothng has been shown to be emprcally effectve, deally we should estmate a query language model rectly based on the observaton of a document, and apply the estmated query language model, nstead of the document language model, to generate queres. The queston then s, What evdence do we have for estmatng a query language model gven a document?. Ths s a very challengng queston, snce the nformaton avalable to us n a typcal ad hoc retreval settng ncludes no more than a database of documents and queres. In ths paper, we propose to use the ttles of documents as the evdence for estmatng a query language model for a gven document -- essentally to approxmate the query language model gven a document by the ttle language model for that document, whch s easer to estmate. The motvaton of ths work s based on the observaton that queres are more lke ttles than documents n many aspects. For example, both ttles and queres tend to be very short and concse descrpton of nformaton. The reasonng process n author s mnd when makng up the ttle for a document s smlar to what s n a user s mnd when formulatng a query

based on some deal document -- both would be tryng to capture what the document s about. Therefore, t s reasonable to assume that the ttles and queres are created through a smlar generaton process. The ttle nformaton has been exploted prevously for mprovng nformaton retreval, but, so far, only heurstc methods, such as ncreasng the weght of ttle words have been tred (e.g., [5,0]). Here we use the ttle nformaton n a more prncpled way by treatng a ttle as an observaton from a document-ttle statstcal translaton model. Techncally, the ttle language model approach falls nto the general source-channel framework proposed n Berger and Lafferty [], where the fference between a query and a document s explctly addressed by treatng query formulaton as a corrupton of the deal document n the nformaton theoretc sense. Conceptually, however, the ttle language model s fferent from the synthetc query translaton model explored n []. The use of syntheszed queres provdes an nterestng way to tran a statstcal translaton model that can address mportant ssues such as synonymy and polysemy, whereas the ttle language model s meant to rectly approxmate queres wth ttles. Moreover, tranng wth the ttles poses specal ffcultes due to data sparseness, whch we scuss below. A document can potentally have many fferent ttles, but the author only provdes one ttle for each document. Thus, f we estmate ttle language models only based on the observaton of the author-gven ttles, t wll suffer severely from the problem of sparse data. The use of a statstcal translaton model can allevate ths problem. The basc dea s to treat the document-ttle pars as translaton pars observed from some translaton model that captures the ntrnsc document to query translaton patterns. Ths means, we would tran the statstcal translaton model based on the document-ttle pars n the whole collecton. Once we have ths general translaton model n hand, we can estmate the ttle language model for a partcular document by applyng the learned translaton model to the document. Even f we pool all the document-ttle pars together, the tranng data s stll qute sparse gven the large number of parameters nvolved. Snce ttles are typcally much shorter than documents, we would expect that most words n a document would never occur n any of the ttles n the collecton. To address ths problem, we extend the standard learnng algorthms of the translaton models by adng specal parameters to model the self-translaton probabltes of words. We propose two such technques: One assumes that all words have the same selftranslaton probablty and the other assumes that each ttle has an extra unobserved null word slot that can only be flled by a word generated through self-translaton. The proposed ttle language model and the two self-translaton smoothng methods are evaluated wth four fferent TREC databases. The results show that the ttle language model approach consstently performs better than both the smple language modelng approach and the Okap retreval functon. We also observe that the smoothng of self-translaton probabltes has a sgnfcant mpact on the retreval performance. Both smoothng methods mprove the performance sgnfcantly over the non-smoothed verson of the ttle language model. The null word based smoothng method consstently performs better than the method of tyng self-translaton probabltes. The rest of the paper s organzed as follows: We frst present the ttle language model approach n Secton 2, descrbng the two self-translaton smoothng methods. We then present the experments and results n Secton 3. Secton 4 gves the conclusons and future work. 2. A TITLE LAGUAGE MODEL FOR IR The basc dea of the ttle language model approach s to estmate the ttle language model for a document and then to compute the lkelhood that the query would have been generated from the estmated model. Therefore, the key ssue s how to estmate the ttle language model for a document based on the observaton of a collecton of documents. A smple approach would be to estmate the ttle language model for a document usng only the ttle of that document. However, because of the flexblty n choosng fferent ttles and the fact that each document has only one ttle gven by the author(s), t would be almost mpossble to obtan a good estmaton of ttle language model rectly from the ttles. Our approach s to explot statstcal translaton models to fnd the ttle language model based on the observaton of a document. More specfcally, we use a statstcal translaton model to convert the language model of a document to the ttle language model for that document. To accomplsh ths converson process, we need to answer two questons:. How to estmate such a statstcal translaton model? 2. How to apply the estmated statstcal translaton model to convert a document language model to a ttle language model and use the estmated ttle language model to score documents wth respect to a query? Sectons 2. and 2.2 address these two questons respectvely. 2. Learnng a Statstcal Ttle Translaton Model The key component n a statstcal ttle translaton model s the word translaton probablty dw),.e. the probablty of usng word tw n the ttle, gven that word dw appears n the document. Once we have the set of word translaton probabltes dw), we can easly calculate the ttle language model for a document based on the observaton of that document. To learn the set of word translaton probabltes, we can take advantage of the document-ttle pars n the collecton. By vewng documents as samples of a verbose language and ttles as samples of a concse language, we can treat each documentttle par as a translaton par,.e. a par of texts wrtten n the verbose language and the concse language respectvely. Formally, let {<t, d >, =, 2,, } be the ttle-document pars n the collecton. Accorng to the standard statstcal translaton model [2], we can fnd the optmal model M* by maxmzng the probablty of generatng ttles from documents, or M* = arg max t, M ) () M = Based on the model for the statstcal translaton model [2], Equaton () can be expanded as

M* = argmax argmax M argmax M M = ε φ, + d + = tw t dw = tw t dw t d, φ, + d + d c( d d ) d dw ) where ε s a constant, φ stands for the null word, d s the length of document d, c(d d ) s the number of tmes that word dw appears n document d. In the last step of Equaton (2), we throw out the constant ε and use the approxmaton that P ( dw d) c( d d) /( d + ). To fnd the optmal word translaton probabltes d M*), we can use the EM algorthm. The detals of the algorthm can be found n the lterature for statstcal translaton models, such as [2]. We call ths model model for easy reference. 2.. The problem of under-estmatng selftranslaton probabltes There s a serous problem wth usng model descrbed above rectly to learn the correlaton between the words n documents and ttles. In partcular, the self-translaton probablty of a word (.e., w =w w)) wll be under-estmated sgnfcantly. A document can potentally have many fferent ttles, but authors generally only gve one ttle for every document. Because ttles are usually much shorter than documents, only an extremely small porton of the words n a document can be expected to actually appear n the ttle. We measured the vocabulary overlappng between ttles and documents on three fferent TREC collectons: A988), WSJ(990-992) and SJM(99), and found that, on average, only 5% of the words n a document also appear n ts ttle. Ths means that, most of the document words would never appear n any ttle, whch wll result n a zero selftranslaton probablty for most of the words. Therefore, f we follow the learnng algorthm for the statstcal translaton model rectly, the followng scenaro may occur: For some documents, even though they contan every sngle query word, the probablty Q can stll be very low due to the zero self-translaton probablty. In the followng subsectons, we propose two fferent learnng algorthms that can address ths problem. As wll be shown later, both algorthms mprove the retreval performance sgnfcantly over the model, ncatng that the proposed methods for modelng the self-translaton probabltes are effectve. 2..2 Tyng self-translaton probabltes (Model 2) One way to avod the problem of zero self translaton probablty s to te all the self translaton probabltes w =w w) wth a sngle parameter P self. Essentally, we assume that all the selftranslaton probabltes have approxmately the same value, and so can be replace wth a sngle parameter. Snce there are always some ttle words actually comng from the body of documents, the unfed self-translaton probablty P self wll not be zero. We call the corresponng model Model 2. (2) We can also apply the EM algorthm to estmate all the word translaton probabltes, nclung the smoothng parameter P self. The updatng Equatons are as follows: Let w w) and P self stand for the parameters obtaned from the prevous teraton, P (w w) and P self stand for the updated values of the parameters n the current teraton. Accorng to the EM algorthm, the updatng equaton for the self-translaton probablty P self, wll be ) t ) P self = Z self w ) + w w ) w, ) (3) w d ^w w where varable Z self s the normalzaton constant and s defned as w w ) t) w, ) + w w w ) + w w ) w, ) w ^w w Zself = ) t) (4) + w ) w w ) w, ) w d ^w w For those non-self-translaton probabltes,.e. w w w), the EM updatng equatons are dentcal to the ones used for the standard learnng algorthm of a statstcal translaton model except that n the normalzaton equatons, the self-translaton probablty should be replaced wth P self, or w P ( w w ) = P self (5) w w 2..3 Adng a ull Ttle Word Slot (Model 3) One problem wth tyng all the self-translaton probabltes for fferent words wth a sngle unfed self-translaton probablty s that we lose some nformaton about the relatve mportance of words. Specfcally, those words wth a hgher probablty n the ttles should have a hgher self-translaton probablty than those wth a lower probablty n the ttles. Tyng them would cause under-estmaton of the former and over-estmaton of the latter. As a result, the self-translaton probablty may be less than the translaton probablty for other words, whch s not desrable. In ths subsecton, we propose a better smoothng model that s able to scrmnate the self-translaton probabltes for fferent document words. It s based on the dea of ntroducng an extra ULL word slot n the ttle. An nterestng property of ths model s that the self-translaton probablty s guaranteed to be no less than the translaton probablty for any other word,.e. w w) w w w). We call ths model Model 3. Ttles are typcally very short and therefore only provde us wth very lmted data. o suppose we had sampled more ttle words from the ttle language model of a gven document, what knds of words would we expect to have seen? Gven no other nformaton, t would be reasonable to assume that we wll more lkely observe a word that occurs n the document. To capture ths ntuton, we assume that there s an extra ULL, unobserved, word slot n each ttle, that can only be flled n by self-translatng any word n the body of the document. Use e t to stand for the extra word slot n

the ttle t. Wth the count of ths extra word slot, the standard statstcal translaton model between the document d and ttle t wll be mofed as P ( t d, M ) P ( e d, M ) P ( tw d, M ) tw t dw d P ( tw φ, M ) + d + t dw d tw t P ( dw dw, M ) P ( dw d ) P ( tw dw, M ) P ( dw d ) To fnd the optmal statstcal translaton model, we wll stll maxmze the translaton probablty from documents to ttles. Substtutng the document-ttle translaton probablty t d, wth equaton (6), the optmzaton goal (Equaton ()) can be wrtten as dw d dw ) dw d M* = argmax M = φ, + P tw dw M P dw d (7) (, ) ( ) + tw t d dw Because the extra word slot n every ttle provdes a chance for any word n the document to appear n the ttle through the selftranslaton process, t s not ffcult to prove that, ths model wll ensure that the self-translaton probablty w w) wll be no less than w w w) for any word w. The EM algorthm can agan be appled to maxmze Equaton (7) and learn the word translaton probabltes. The updatng equatons for the word translaton probabltes are essentally the same as what are used for the standard learnng algorthm for statstcal translaton models, except for the ncluson of the extra counts due to the null word slot. 2.2 Computng Document Query Smlarty In ths secton, we scuss how to apply the learned statstcal translaton model to fnd the ttle language model for a document and use the estmated ttle language model to compute the relevance value of a document wth respect to a query. To accomplsh ths, we defne the contonal probablty Q as the probablty of usng query Q as the ttle for document D, or, the probablty of translatng document D nto query Q usng the statstcal ttle translaton model, whch s gven below. Q D, M ) = ε qw d + φ, M ) + qw d M ) c( d dw d qw φ, M ) ε + qw d M ) dw D + dw D As can be seen from Equaton (8), the document language model dw s not rectly used to compute the probablty of a query term. Instead, t s converted nto a ttle language model through usng word translaton probabltes qw dw). Such converson also happens n the model proposed n [], but there the translaton model s meant to capture synonym and polysemy relatons, and s traned wth synthetc queres. Smlar to the (6) (8) tratonal language modelng approach, to deal wth the query words that can t be generated from ttle language model, we need to do further smoothng,.e. Q D, M ) = λε qw φ, M ) + qw d M ) c( d + d + dw d ( λ) qw GE) qw φ, M ) λ + qw d M ) dw + ε D + dw D ( λ) qw GE) (8 ) where constant λ s the smoothng constant and qw GE) s the general Englsh language model whch can be easly estmated from the collecton []. In our experment, we set the smoothng constant λ to be 0.5 for all fferent models and all fferent collectons. Equaton (8 ) s the general formula that can be used to score a document wth respect to a query wth any specfc translaton model. A fferent translaton model would thus result n a fferent retreval formula. In the next secton, we wll compare the retreval performance usng fferent statstcal ttle translaton models, nclung Model, Model 2 and Model 3. 3. EXPERIMET 3. Experment Desgn The goal of our experments s to answer the followng three questons:. Wll the ttle language model be effectve for nformaton retreval? To answer ths queston, we wll compare the performance of ttle language model wth that of the state-ofart nformaton retreval methods, nclung the Okap method and the tratonal language model for nformaton retreval. 2. How general s the traned statstcal ttle translaton model? Can a model estmated on one collecton be appled to another? To answer ths queston, we conduct an experment that apples the statstcal ttle translaton model learned from one collecton to other collectons. We then compare the performance of usng a foregn translaton model wth that of usng no translaton model. 3. How mportant s the smoothng of self-translaton n the ttle language model approach for nformaton retreval? To answer ths queston, we can compare the results for ttle language model wth model 2 and model 3. We used three fferent TREC testng collectons for evaluaton: AP88 (Assocated Press, 988), WSJ90-92 (wall street journal from 990 to 992) and SJM (San Jose Mercury ews, 99). We used TREC4 queres (20-250) and ther relevance judgments for evaluaton. The average length of the ttles n these collectons s four to fve words. The fferent characterstcs of the three databases allow us to check the robustness of our models. 4.2 Baselne Methods

The two baselne methods are the Okap method[9] and the tratonal language modelng approach. The exact formula for the Okap method s shown n Equaton (9) df ( qw) + 0.5 tf ( q log( ) df ( qw) + 0.5 Sm Q, = D 0.5 +.5 + tf ( q avg _ dl ( (9) where tf(q s the term frequency of word qw n document D, df(qw) s the document frequency for the word qw and avg_dl s the average document length for all the documents n the collecton. The exact equaton used for the tratonal language modelng approach s shown n Equaton (0). P ( Q = (( λ ) qw GE) + λ dw ) (0) The constant λ s the smoothng constant (smlar to the λ n Equaton (8 )), and qw GE) s the general Englsh language model estmated from the collecton. To make the comparson far, the smoothng constant for the tratonal language model s set to be 0.5, whch s same as for the ttle language model. 3.2 Experment Results The results on AP88, WSJ and SJM are shown n Table, Table 2, and Table 3, respectvely. In each table, we nclude the precsons at fferent recall ponts and the average precson. Several nterestng observatons can be made on these results: Table : Results for AP88 Collecton LM stands for tratonal language model, Okap stands for Okap formula and model-, model-2 and model-3 stand for ttle language model, model 2 and model 3. Collecton LM Okap Model Model Model 2 3 Recall 0. 0.4398 0.4798 0.206 0.4885 0.5062 Recall 0.2 0.3490 0.3789 0.409 0.4082 0.4024 Recall 0.3 0.3035 0.3286 0.54 0.347 0.3572 Recall 0.4 0.2492 0.2889 0.0680 0.2830 0.333 Recall 0.5 0.24 0.2352 0.0525 0.2399 0.2668 Recall 0.6 0.689 0.20 0.0277 0.856 0.207 Recall 0.7 0.369 0.596 0.074 0.460 0.742 Recall 0.8 0.08 0.0833 0.074 0.0897 0.84 Recall 0.9 0.067 0.06 0.05 0.065 0.0738 Recall.0 0.0580 0.0582 0.05 0.068 0.0639 Avg. Prec. 0.2238 0.2463 0.208 0.256 0.2677 Frst, let us compare the results between fferent ttle language models, namely model, model 2 and model 3. As seen from Table, 2 and 3, for all the three collectons, model s nferor to model 2, whch s nferor to model 3, n terms of both average precson and precsons at fferent recall ponts. In partcular, on the WSJ collecton, ttle language model performs extremely poorly compared wth the other two methods. Ths result ncates that ttle language model may fal to fnd relevant documents n some cases due to the problem of zero self-translaton probablty, as we scussed n Secton 2. Indeed, we computed the percentage of ttle words that cannot be found n ther documents. Ths number s 25% for AP88 collecton, 34% for SJM collecton and 45% for WSJ collecton. Ths hgh percentage of mssng ttle words strongly suggests that the smoothng of self-translaton probablty wll be crtcal. Indeed, for the WSJ collecton, whch has the hghest percentage of mssng ttle words, ttle language model, wthout any smoothng of self-translaton probablty, degrades the performance more dramatcally than for collectons AP88 and SJM, where more ttle words can be found n the documents, and the smoothng of self-translaton probablty s not as crtcal. Table 2: Results for WSJ collecton. LM stands for tratonal language model, Okap stands for Okap formula and model-, model-2 and model-3 stand for ttle language model, model 2 and model 3. Collecton LM Okap Model Model Model 2 3 Recall 0. 0.4308 0.4539 0.206 0.4055 0.427 Recall 0.2 0.3587 0.3546 0.409 0.3449 0.368 Recall 0.3 0.272 0.2724 0.54 0.2674 0.2878 Recall 0.4 0.2272 0.87 0.0680 0.2305 0.2432 Recall 0.5 0.82 0.265 0.0525 0.723 0.874 Recall 0.6 0.33 0.0840 0.0277 0.72 0.369 Recall 0.7 0.0525 0.0308 0.074 0.0764 0.0652 Recall 0.8 0.0328 0.028 0.074 0.0528 0.0465 Recall 0.9 0.053 0.006 0.05 0.0350 0.0204 Recall.0 0.053 0.006 0.05 0.032 0.0204 Avg. Prec. 0.844 0.79 0.076 0.85 0.950 Table 3: Results for SJM Collecton. LM stands for tratonal language model, Okap stands for Okap formula and model-, model-2 and model-3 stand for ttle language model, model 2 and model 3. Collecton LM Okap Model Model Model 2 3 Recall 0. 0.4009 0.4054 0.4226 0.4249 0.4339 Recall 0.2 0.3345 0.3232 0.328 0.3650 0.3638 Recall 0.3 0.283 0.2348 0.272 0.2890 0.309 Recall 0.4 0.2076 0.692 0.99 0.2236 0.2296 Recall 0.5 0.85 0.378 0.670 0.874 0.99 Recall 0.6 0.046 0.0986 0.095 0.393 0.43 Recall 0.7 0.086 0.057 0.0782 0.0862 0.0974 Recall 0.8 0.0460 0.032 0.0688 0.059 0.0788 Recall 0.9 0.0375 0.032 0.0524 0.0386 0.0456 Recall.0 0.0375 0.032 0.0524 0.0386 0.0456 Avg. Prec. 0.845 0.727 0.90 0.983 0.208 The second menson of comparson s to compare ttle language models wth tratonal language model. As already ponted out by Berger and Lafferty [], the tratonal language model can be

vewed as a specal case of translaton language model,.e. all the translaton probablty w w) become delta functons δ(w ). Therefore, the comparson along ths menson can ncate f the translaton probabltes learned from the correlaton between ttles and documents are effectve n mprovng retreval accuracy. As seen from Table, Table 2, and Table 3, ttle language model 3 performances sgnfcantly better than the tratonal language model over all the three collectons n terms of all the performance measures. Thus, we can conclude that the translaton probablty learned from ttle-document pars appears to be helpful for fnng relevant documents. Lastly, we can also compare the performance of the ttle language model approach wth the Okap method [8]. For all the three collectons the ttle language model 3 outperforms Okap sgnfcantly n terms of all the performance measures, except n one case -- The precson at 0. recall on the WSJ collecton s slghtly worse than both the tratonal language model approach and Okap. To test the generalty of the estmated translaton model, we appled the statstcal ttle translaton model leaned from the AP88 collecton to the AP90 collecton. We hypothesze that, f two collectons are smlar, the statstcal ttle translaton model learned from one collecton should be able to gve a good approxmaton of the correlaton between documents and ttles of the other collecton. Therefore, t would make sense to apply the translaton model learned from one collecton to another smlar collecton. Table 4: Results for AP90. LM stands for tratonal language model, Okap stands for Okap formula and model-3 stand for ttle language model 3. Dfferent from the prevous experments n whch the translaton model s learned from the retreved collecton tself, ths experment apples the translaton model learned from AP88 to retreve relevant document n AP90 collecton. Collecton LM Okap Model3 Recall 0. 0.4775 0.495 0.537 Recall 0.2 0.48 0.4308 0.4454 Recall 0.3 0.324 0.3374 0.3628 Recall 0.4 0.2700 0.2894 0.3248 Recall 0.5 0.2280 0.2567 0.2665 Recall 0.6 0.733 0.223 0.2222 Recall 0.7 0.294 0.230 0.372 Recall 0.8 0.099 0.0969 0.36 Recall 0.9 0.0782 0.0659 0.0963 Recall.0 0.064 0.0550 0.0733 Avg. Prec. 0.24 0.25 0.277 Table 4 gves the results of applyng the translaton model learned from AP88 to AP90. Snce ttle language model 3 already demonstrated ts superorty to model and model 2, we only consdered model 3 n ths experment. From Table 3, we see that ttle generaton model 3 outperforms the tratonal language model and Okap method sgnfcantly n terms of all measures. We also appled the statstcal ttle translaton model learned from AP88 to WSJ to further examne the generalty of the model and our learnng method. Ths tme, the performance of ttle language model 3 wth the statstcal ttle translaton model learned from AP88 s only about the same as the tratonal language model and Okap method for the collecton WSJ. Snce the statstcal ttle translaton model learned from AP88 can be expected to be a much better approxmaton of the correlaton between documents and ttles for AP90 than for WSJ, these results suggest that applyng the translaton model learned from a foregn database s helpful only when the foregn database s smlar to the natve one. But, t s nterestng to note that t has never resulted n any degradaton of performance. 4. COCLUSIOS Brdgng the gap between a query language model and document language model s an mportant ssue when applyng language models to nformaton retreval. In ths paper, we propose brdgng ths gap by explotng document ttles to estmate a ttle language model, whch can be regarded as an approxmate query language model. The essence of our work s to approxmate the query language model for a document wth the ttle language model for the document. Operatonally, we frst estmate such a translaton model by usng all the document-ttle pars n a collecton. The translaton model can then be used to convert a regular document language model to a ttle language model. Fnally, the ttle language model estmated for each document s used to compute the query lkelhood. Intutvely, the scorng s based on the lkelhood that the query could have been a ttle for a document. Based on the experment results, we can draw the followng conclusons: Based on the comparson between the ttle language models and the tratonal language model and the Okap method, we can conclude that the ttle language model for nformaton retreval s an effectve retreval method. In all our experments, the ttle language model gves a better performance than both the tratonal language model and the Okap method. Based on the comparson between three fferent ttle language models for nformaton retreval, we can conclude that ttle generaton model 2 and 3 are superor to model, and model 3 s superor to model 2. Snce the fference between the three fferent ttle language models s on how to handle the self-translaton probablty, we can conclude that, frst, t s crucal to smooth the self-translaton probablty to avod the zero self-translaton probablty. Second, a better smoothng method for self-translaton probablty can mprove the performance. Results show that adng an extra null word slot to the ttle s a reasonable smoothng method for the self-translaton probabltes. The success of applyng the ttle language model learned from AP88 to AP90 appears to ncate that, n the case when the two collectons are smlar, the correlaton between documents and ttles n one collecton also tend to be smlar to that n the other. Therefore, t would seem to be approprate to apply the statstcal ttle translaton model

learned from one collecton to the retreval task of another smlar collecton. Even f the collectons are not smlar, applyng a learned statstcal ttle translaton model from a foregn database does not seem to degrade the performance ether. Thus, the statstcal ttle translaton model learned from ttle-document pars may be used as a general resource that can be appled to retreval task for fferent collectons. There are several rectons for the future work. Frst, t would be nterestng to see how the style or qualty of ttles would affect the effectveness of our model. One possblty s to use the collectons where the qualty of ttles has hgh varances (e.g., the Web data). Second, we have assumed that queres and ttles are smlar, but there may be queres (e.g., long and verbose queres) that are qute fferent from ttles. So, t would be nterestng to further evaluate the robustness of our model by usng many fferent types of queres. Fnally, usng ttle nformaton s only one way to brdge the query-document gap; t would be very nterestng to further explore other effectve methods that can generate an approprate query language model for a document. 5. ACKOWLEDGEMETS We thank Jame Callan, Ymng Yang, Luo S, and the anonymous revewers for ther helpful comments on ths work. Ths materal s based n part on work supported by atonal Scence Foundaton under Cooperatve Agreement o. IRI- 987496. Partal support for ths work was provded by the atonal Scence Foundaton's atonal Scence, Mathematcs, Engneerng, and Technology Educaton Dgtal Lbrary Program under grant DUE-0085834. Ths work was also supported n part by the Advanced Research and Development Actvty (ARDA) under contract number MDA908-00-C-0037. Any opnons, fnngs, and conclusons or recommendatons expressed n ths materal are those of the authors and do not necessarly reflect the vews of the atonal Scence Foundaton or ARDA. 6. REFERECES [] A. Berger and J. Laffety (999). Informaton retreval as statstcal translaton. In Proceengs of SIGIR 99. pp. 222-229. [2] P. Brown, S. Della Petra, V. Della Petra, and R. Mercer (993). The mathematcs of statstcal machne translaton: Parameter estmaton. Computatonal Lngustcs, 9(2), pp. 263-- 3. [3] D. Hemstra and W. Kraaj (999), Twenty-One at TREC-7: ad-hoc and cross-language track, In Proceengs of the seventh Text Retreval Conference TREC-7, IST Specal Publcaton 500-242, pages 227-238, 999. [4] J. Lafferty and C. Zha (200), Document language models, query models, and rsk mnmzaton for nformaton retreval, In Proceengs of SIGIR 200, pp. -9. [5] A. M. Lam-Adesna, G. J. F. Jones, Applyng summarzaton technques for term selecton n relevance feedback, In Proceengs of SIGIR 200, pp. -9. [6] V. Lavrenko and W. B. Croft (200), Relevance-based Language Models, In Proceengs of SIGIR 200, pp. 20-27. [7] D. Mller, T. Leek and R. M. Schwartz (999). A hdden Markov model nformaton retreval system. Proceengs of SIGIR 999, pp. 24-222.. [8] J. Ponte and W. B. Croft (998). A language modelng approach to nformaton retreval. In Proceengs of SIGIR 998, pp. 275-28. [9] S.E. Robertson et al.(993). Okap at TREC-4. In The Fourth Text Retreval Conference (TREC-4). 993 [0] E. Voorhees and D. Harman (ed.) (996), The Ffth Text REtreval Conference (TREC-5), IST Specal Publcaton 500-238. [] C. Zha and J. Lafferty (200). A study of smoothng methods for language models appled to ad hoc nformaton retreval. In Proceeng of SIGIR 0, 200, pp. 334-342.