Using Content-Based Filtering for Recommendation 1

Usng Content-Based Flterng for Recommendaton 1 Robn van Meteren 1 and Maarten van Someren 2 1 NetlnQ Group, Gerard Brandtstraat 26-28, 1054 JK, Amsterdam, The Netherlands, robn@netlnq.nl 2 Unversty of Amsterdam, Roeterstraat 18, The Netherlands, marten@sw.psy.uva.nl Abstract Fndng nformaton on a large web ste can be a dffcult and tme-consumng process. Recommender systems can help users fnd nformaton by provdng them wth personalzed suggestons. In ths paper the recommender system PRES s descrbed that uses content-based flterng technques to suggest small artcles about home mprovements. A doman such as ths mplcates that the user model has to be very dynamc and learned from postve feedbac only. The relevance feedbac method seems to be a good canddate for learnng such a user model, as t s both effcent and dynamc. 1 Introducton As the World Wde Web contnues to grow at an exponental rate, the sze and complexty of many web stes grow along wth t. For the users of these web stes t becomes ncreasngly dffcult and tme consumng to fnd the nformaton they are loong for. To help users fnd the nformaton that s n accordance wth ther nterests a web ste can be personalzed. Recommender systems can mprove a web ste for ndvdual users by dynamcally addng hyperlns. In ths paper the recommender system PRES (acronym for Personalzed Recommender System) s ntroduced. PRES creates dynamc hyperlns for a web ste that contans a collecton of advses about do t yourself home mprovement. The purpose of these dynamc hyperlns s to mae t easer for a user to fnd nterestng tems and thus mprovng the nteracton between the system and the user. When users browse through a web ste they are usually loong for tems they fnd nterestng. Interest tems can consst of a number of thngs. For example, textual nformaton can be consdered as nterest tems or an ndex on a certan topc could be the tem a user s loong for. Another example, applcable for a web vendor, s to consder purchased products as nterest tems. Whatever the tems consst of, a web ste can be seen as a collecton of these nterest tems. Every large collecton needs a certan structure to mae t easy for vstors to fnd what they are loong for. A web ste can be structured by dvdng ts web pages nto content pages and navgaton pages. The content pages provde the user wth the nterest tems whle the navgaton pages help the user to search for the nterest tems. Ths s not a strct classfcaton however. Pages can also be hybrd n the sense that they both provde content as well as navgaton facltes. Furthermore, what s a navgaton page for one user may be a content page to another and vsa versa. In general however, ths classfcaton provdes a way of descrbng the structure of a web ste and how ths structure can be mproved for ndvdual users by dynamcally addng hyperlns. 1 Ths research has been supported by NetlnQ

- Content Page - Navgaton Page - Hybrd Page - Hypertext Ln Fgure 1 Structure of sample web ste Fgure 1 shows an example of a web ste wth a typcal tree structure. The content pages are found at the bottom of the tree whle the navgaton pages are found at the top. A recommender system can dsplay ts recommendatons by dynamcally creatng hypertext lns to content pages that contan the tems a user mght be nterested n. Several factors determne whether or not a recommended page should be lned to the page that s shown to the user. Sometmes content pages are only recommended f they contan tems that are smlar to the tem(s) shown on the current page. Another consderaton for dynamc lnng s the proxmty to the recommended page. The dstance between two pages s determned by the mnmal number of lns t taes to navgate from one page to another. There s not much use of lnng the current page to a recommended page f the dstance between the two pages s 1. The further the dstance, the more useful a dynamcally created ln becomes. 2 Content-based flterng Recommender systems are a specal type of nformaton flterng systems. Informaton flterng deals wth the delvery of tems selected from a large collecton that the user s lely to fnd nterestng or useful and can be seen as a classfcaton tas. Based on tranng data a user model s nduced that enables the flterng system to classfy unseen tems nto a postve class c (relevant to the user) or a negatve class c (rrelevant to the user). The tranng set conssts of the tems that the user found nterestng. These tems form tranng nstances that all have an attrbute. Ths attrbute specfes the class of the tem based on ether the ratng of the user or on mplct evdence. Formally, an tem s descrbed as a vector X = ( x, 1 x2,..., x n ) of n components. The components can have bnary, nomnal or numercal attrbutes and are derved from ether the content of the tems or from nformaton about the users preferences. The tas of the learnng method s to select a functon based on a tranng set of m nput vectors that can classfy any tem n the collecton. The functon h ( X ) wll ether be able to classfy an unseen tem as postve or negatve at once by returnng a bnary value or return a numercal value. In that case a threshold can be used to determne f the tem s relevant or rrelevant to the user.

A content-based flterng system selects tems based on the correlaton between the content of the tems and the user s preferences as opposed to a collaboratve flterng system that chooses tems based on the correlaton between people wth smlar preferences. PRES s a content-based flterng system. It maes recommendatons by comparng a user profle wth the content of each document n the collecton. The content of a document can be represented wth a set of terms. Terms are extracted from documents by runnng through a number of parsng steps. Frst all HTML tags and stop words (words that occur very often and cannot be used as dscrmnators) are removed. The remanng words are reduced to ther stem by removng prefxes and suffxes [Porter 1980]. For nstance the words computer, computers and computng could all be reduced to comput. The user profle s represented wth the same terms and bult up by analyzng the content of documents that the user found nterestng. Whch documents the user found nterestng can be determned by usng ether explct or mplct feedbac. Explct feedbac requres the user to evaluate examned documents on a scale. In mplct feedbac the user s nterests are nferred by observng the user s actons, whch s more convenent for the user but more dffcult to mplement. There are several ways n whch terms can be represented n order to be used as a bass for the learnng component. A representaton method that s often used s the vector space model. In the vector space model a document D s represented as an m- dmensonal vector, where each dmenson corresponds to a dstnct term and m s the total number of terms used n the collecton of documents. The document vector s wrtten as, where w s the weght of term t that ndcates ts mportance. If document D does not contan term t then weght w s zero. Term weghts can be determned by usng the tf-df scheme. In ths approach the terms are assgned a weght that s based on how often a term appears n a partcular document and how frequently t occurs n the entre document collecton: w = tf log n df where tf s the number of occurrences of term t n document D, n the total number of documents n the collecton and df the number of documents n whch term t appears at least once. The assumptons behnd tf-df are based on two characterstcs of text documents. Frst, the more tmes a term appears n a document, the more relevant t s to the topc of the document. Second, the more tmes a term occurs n all documents n the collecton, the more poorly t dscrmnates between documents. In the vector space model user profles can be represented just le documents by one or more profle vectors. The degree of smlarty between a profle vector P, where P = u,..., u ) can be determned by usng the cosne measure: ( 1 sm( D, P) = D P D P = u u 2 w w 2

3 PRES Document Collecton Recommender Web Page nteracton User Profle feedbac User Fgure 2 The PRES archtecture Fgure 2 shows the PRES archtecture. A user profle s learned from feedbac provded by the user. The recommender system compares the user profle wth the documents n the collecton. The documents are then raned on the bass of certan crtera such as smlarty, novelty, proxmty and relevancy and the best raned documents appear as hyperlns on the current web page. 3.1 Doman characterstcs As stated earler, PRES recommends documents that consst of textual nformaton about do t yourself home mprovements. The most mportant aspect of ths type of nformaton s the fact that a certan topc s only nterestng to a user for a short perod of tme. Once an mprovement has been carred out or enough nformaton has been provded the user wll lose nterest n that topc. As a consequence the user model that PRES learns has to be very dynamc. A doman n whch users nterests change qucly also has mplcatons for how the feedbac s acqured. Explct feedbac s not a good choce for two reasons. Users wll have to rate tems frequently and wll be very reluctant n dong so as ther ratngs expre qucly. Acqurng feedbac by observng the users actons s therefore favorable. A user that selects and reads a document for a certan amount of tme provdes a strong ndcaton that the document contans nformaton a user s nterested n. After such acton the documents s therefore classfed as a postve example. Fndng negatve examples s much more problematc. Users gnorng lns to documents could be seen as a clue. Ths acton does not provde strong evdence however as users mght not have notced the ln or vst the document later. Another possble clue s a document that s beng read for a very short tme but ths could be caused by the fact that the document s smlar to what the user has already seen although the topc of the document s stll nterestng to the user. Classfyng documents as negatve based on wea assumptons leads to much nose n the tranng data and may result n naccurate predctons. Negatve examples wll therefore not be used. In short, the user model has to be dynamc and learned from postve examples only.

3.2 Learnng algorthm There exst a varety of machne learnng methods that can be used for learnng a user model. PRES employs the relevance feedbac method because t s both effcent and dynamc. Relevance feedbac was ntroduced by Roccho as an nformaton retreval utlty that mplements retreval n several passes but t can also be appled to nformaton flterng. Documents and profles are represented as vectors n the vector space model. The term weghts of a document are calculated usng the standard tf-df scheme. The profle conssts of one vector that represents a topc of nterest. In several nformaton flterng systems that use relevance feedbac the profle consst of more than one vector. For example, WebMate [Chen & Sycara 1997] a personal agent that helps users browse the web, uses clusterng to mantan several profle vectors that each represent a dfferent topc. The problem wth usng more than one vector s that t taes a whle before a vector wll represent a topc accurately enough. A document about a certan topc wll therefore not always be assgned to a profle vector about the same topc. Because PRES operates n an envronment n whch the topc nterest of users changes constantly only one vector s used. Ths vector s ntally empty and s adjusted as the user navgates through the web ste. When a user has spent a certan amount of tme readng a document D the user profle P s updated wth the followng equaton: P' = α P + βd Note that f the document or the profle does not contan a term the correspondng weght s zero. Terms n the profle that have very low weghts are removed after the profle vector has been updated. Weght β determnes the relatve mportance of a document to the user. In nformaton flterng systems that rely on explct feedbac β usually corresponds to the ratng a user has gven to a document and can ether be a postve or a negatve value. Systems that use mplct feedbac may use dfferent values for β for dfferent nds of feedbac. In the recommender system Slder [Balabanovc 1998] for example, β s set to 3 when a user has deleted an artcle and to 0.5 when a users has read an artcle. Because PRES only determnes whether or not a document s relevant, weght β s always set to 1. The profle vector s adjusted to a dmnshng of the user s nterests by α, a weght between 0 and 1 that reduces the term weghts n the profle. Ths weght s determned va expermentaton. Besdes the profle vector, a user profle also contans nformaton about whch pages the user has vsted and the current page. All ths nformaton s necessary to generate recommendatons that appear as hyperlns on web pages. Several factors can be consdered n determnng whch documents should be suggested to the user: The smlarty between a document vector and the profle vector can be calculated usng several smlarty measurements. The novelty of a document s determned by the exstence of nformaton n a document that s new to the user. The proxmty of a document s determned by the mnmal number of lns t taes to navgate from the current page to a page that presents the document. Mobasher et al. [Mobasher et al. 1999b] for example tae the log of the number of lns as a measurement of dstance. Some recommender systems also chec f a document s relevant to the nformaton shown on the current page.

In PRES the smlarty between the profle vector and a document s determned by usng the cosne measurement. Documents that have already been vsted by the user or already appear as hyperlns on the current page are fltered out. The remanng documents are then sorted by ran. Two dfferent strateges for determnng whch of these documents appear as hyperlns were examned. In the frst approach only a fxed number of top raned documents are suggested. In the second approach documents are suggested f ther score s above a relevance threshold [Yan & Garca- Molna 1994]. 3.3 Implementaton PRES has been largely wrtten n the object-orented programmng language Java. Java s often used to mae applets that run on clent machnes. Clent-sde Java has several lmtatons however. In order to run applets the clent machne has to support Java. Although ths s a standard feature n most browsers, many of them also have the opton to dsable applets. It also taes a consderable amount of tme to load the applet on the clent machne and for the applet to request or receve data from the server machne. PRES does not mae use of applets but runs entrely on the serversde by usng Java servlets and Java server pages (JSP). A servlet s a Java class that expands the functonalty of a server. Because servlets operate solely wthn the doman of a server they do not have the lmtatons that applets mpose. Java server pages are fles that contan both HTML and embedded Java code. JSP fles are compled to servlets by a servlet engne that also runs the servlets. PRES uses the servlet engne of JRun that functons as a plug-n to an exstng server. Ths means that the servlets run ndependently from the web server. The advantage of an add-on servlet engne s that f the server changes, the same servlet engne can stll be used and hence no alteratons have to made to the servlet code. In many servers that support servlets a request for a JSP fle s handled by a specal servlet. Usually ths servlet can be replaced by another servlet or by a sequence of servlets. The request s then sent to the frst servlet n the chan whle the response from the last servlet s sent bac to the browser. In between the output of each servlet s passed to the next servlet. Ths technque s called servlet channg and can be used to place a specal servlet behnd the servlet that normally handles JSP fles so that t s called upon every tme a user request a JSP fle. Ths servlet carres out the relevance feedbac method by updatng the user profle wth the terms of the JSP fle. It also replaces a predefned tag that could be ncluded n the JSP fle wth personalzed recommendatons. The archtecture of the PRES mplementaton s shown n fgure 3. The database contans two dfferent types of data: Informaton about users and nformaton about web pages. The nformaton about users s obtaned onlne ether from users themselves or by observng ther behavor. The nformaton about web pages s gathered offlne, largely by the parser component whch s run every tme new pages are added to the web ste. It analyss the collecton of web pages and extracts terms and calculates ther tf-df weghts whch are then stored n the database. The profler and membershp component both react to user requests. The membershp component enables the user to become a member by creatng an account. Informaton that the users provde such as a unque user name and a password s stored n the database. A user needs to be a member n order to receve personalzed recommendatons. The

profler component eeps trac of the pages that members vst. It also carres out the relevance feedbac method for every member. All ths nformaton s stored n the user profle whch s ntally empty for new members. The recommender component tres to fnd relevant web pages by comparng the user Database User Profle Page Index Profler Recommender Membershp Parser Feedbac JSP fles Fgure 3 Archtecture of the PRES mplementaton profle wth the pages that were ndexed. Recommendatons are presented to the user by nsertng hyperlns nto the next JSP fle that the user requests. 4 Performance To evaluate PRES three fcttous users were created whch all had dfferent nterests. Two topcs about home mprovement were assgned to the frst user, three to the second user and four to the thrd user. For each topc all the relevant documents n the collecton were selected and classfed as relevant f t contaned nformaton that was assocated wth the topc. The percentage of documents that the users selected per topc also dffered. The frst user selected 80% of all the relevant documents about the two topcs, the second user 50% and the thrd user 30%. The tranng sessons were then processed just le a normal user sesson would have been. Fgure 4 shows the results of ths experment. From left to rght the precson (relevant retreved/retreved) and recall (relevant retreved/relevant) at dfferent ponts n tme are shown of the frst, second and thrd tranng sesson respectvely. The precson ratos dffer sgnfcantly for dfferent topcs of nterest. Ths s probably caused by the fact that documents about several topcs contan many dfferent terms whch maes t dffcult for the learnng method to fnd a relaton between these documents. Precson ratos also dffer at dfferent ponts n tme for the same topcs of nterest. As more documents about a certan topc are selected t becomes easer for a learnng method to mae

1 0,8 0,6 0,4 0,2 0 Precson Recall Fgure 4 Precson and recall for three recommendatons as t s provded wth more smlar terms. On the other hand, as more documents are selected the number of remanng documents that are relevant decreases and t therefore also becomes more dffcult for the learnng method to fnd other relevant documents. Ths last factor wll have lttle effect n the case of the thrd user sesson as only 30% of the total amount of documents about a topc are selected. In ths case, precson usually ncreases overtme. The exact effects however reman dffcult to measure as other factors, such as a page change, also have an nfluence on these ratos. 5 Related wor Many of the content-based flterng technques that were descrbed are borrowed from the feld of nformaton retreval. Content-based flterng however, dffers from nformaton retreval n the manner n whch the nterests of a user are represented. Instead of usng a query an nformaton flterng system tres to model the user s long term nterests. There are several other systems that use content-based flterng to help users fnd nformaton on the World Wde Web. Letza [Leberman 1995] s a user nterface that asssts users browsng the web. The system tracs the browsng behavor of a user and tres to antcpate what pages may be of nterest to the user. Sysll & Webert [Pazzan et al. 1996] s a software agent that tres to determne whch web pages mght nterest a user by usng a nave Bayesan classfer. A user provdes tranng nstances by ratng explored pages as ether hot or cold. Jennngs and Hguch [Jennngs & Hguch 1992] descrbe a neural networ that models the nterests of a user n a Usenet news envronment. The neural networ s formed and modfed as a result of the artcles a user has read or rejected. The wor presented n ths paper dffers from these systems by focusng on sngle web stes. Informaton flterng systems that personalze web stes often use a collaboratve approach to flterng. Amazon.com for example uses the GroupLens system [Resnc et al. 1994] to mae recommendatons about boos and vdeos. Mobasher et al. [Mobasher et al. 1999b] descrbe several mnng technques for personalzaton. User access logs are examned to dscover clusters of users that exhbt smlar browsng behavor. These clusters can be used to predct the browsng behavor of ndvdual users.

6 Conclusons The test results seem to ndcate that on average, slghtly more than one out of two of the suggestons that PRES maes s relevant. The results are negatvely nfluenced by the fact that the same concept can usually be descrbed wth several terms and many terms have more than one meanng. Ths maes the user profle less accurately, especally because the documents n the collecton are relatvely short and normally only a few documents about the same topc are selected by the user. Better results mght be obtaned by mprovng the vector space model. What content-based flterng systems cannot do however, s mae predctons about the future nterests of users. Collaboratve flterng systems can mae suggestons to a user that are outsde the scope of prevous selected tems. The effectveness of PRES can therefore be further mproved f content-based and collaboratve flterng would be combned. The menu structure also has a great nfluence on the effectveness of PRES. Because t s not useful to recommend tems that already appear n the current menu, t becomes more dffcult to mae recommendatons f the user has selected a menu that already contans most of the relevant tems. The average precson of the recommendatons that PRES maes drops about 15% f the menu s taen nto account. So n general, the better the web ste s organzed, the harder t wll be to mae recommendatons. Most web stes, especally larger ones, however can never be perfectly optmzed for all users. Users have dfferent nterests and personalzng a web ste could help them fnd nformaton faster as they otherwse would have or wouldn t have found at all. References [Armstrong et al. 1995] Armstrong, R., Fretag, D., Joachms, T. and Mtchell, T. (1995). WebWatcher: A learnng apprentce for the World Wde Web. In Proceedngs of AAAI Sprng Symposum on Informaton Gatherng, Standford, CA. [Atsum 1997] Atsum, M. (1997). Extracton of User s Interests from Web Pages based on Genetc Algorthm. IPSJ SIG Vol. 97 No. 51, pp. 13-18. [Balabanovc 1998] Balabanovc, M. (1998). An Interface for Learnng Mult-topc User Profles from Implct Feedbac. AAAI-98 Worshop on Recommender Systems, Madson, Wsconsn. [Balabanovc & Shoham 1995] Balabanovc, M. and Shoham, Y. (1995). Learnng Informaton Retreval Agents: Experments wth Automated Web Browsng. In Worng Notes of the AAAI Sprng Symposum Seres, pp. 13-18. Stanford Unversty. [Chen & Sycara 1998] Chen, L. and Sycara, K. (1998). WebMate: A Personal Agent for Browsng and Searchng. In Proceedngs of the 2nd Internatonal Conference on Autonomous Agents and Mult Agent Systems, Mnneapols, MN. [Jennngs & Hguch 1992] Jennngs, A. and Hguch, H. (1992). A Personal News Servce based on a User Model Neural Networ. IEICE Transactons on Informaton and Systems, Vol. E75-D, No. 2, pp. 198-209. [Leberman 1995] Leberman, H. (1995). Letza: An Agent that Asssts Web Browsng. In Proceedngs of the 1995 Internatonal Jont Conference on Artfcal Intellgence. Montreal, Canada. [Mobasher et al. 1999a] Mobasher, B., Cooley, R. and Srvastava, J. (1999). Data Preparaton for Mnng World Wde Web Browsng Patterns. Journal of Knowledge and Informaton Systems, Vol. 1, No. 1. [Mobasher et al. 1999b] Mobasher, B., Cooley, R. and Srvastava, J. (1999). Automatc Personalzaton Based on Web Usage Mnng. Techncal Report TR99-010, Department of Computer Scence, Depaul Unversty. [Nchols 1997] Nchols, D. M. (1997). Implct Ratng and Flterng. In Proceedngs of the Ffth DELOS Worshop on Flterng and Collaboratve Flterng, pp. 31-36. Budapest, Hungary.

[Nlsson 1996] Nlsson, N. J. (1996). Introducton to Machne Learnng. Unpublshed draft, Department of Computer Scence, Stanford Unversty. [Pazzan et al. 1996] Pazzan, M., Muramatsu, J. and Bllsus, D. (1996). Sysll & Webert: Identfyng nterestng web stes. In Proceedngs of the Thrteenth Natonal Conference on Artfcal Intellgence AAAI 96, pp. 54-61. [Porter 1980] Porter, M. (1980). An Algorthm for Suffx Strppng. Program, 14(3), pp. 130-137. [Resnc et al. 1994] Resnc, P., Iacovou, N., Sucha, M., Bergstrom, P. and Redl, J. (1994). GroupLens: An Open Archtecture for Collaboratve Flterng of Netnews. In Proceedngs of ACM 1994 Conference on Computer Supported Cooperatve Wor, Chapel Hll, NC, pp. 175-186. [Schwab & Pohl 1999] Schwab, I. and Pohl, W. (1999). Learnng User Profles from Postve Examples. In Proceedngs of the Internatonal Conference on Machne Learnng & Applcatons, pp. 15-20. Chana, Greece. [Yan & Garca-Molna 1994] Yan, T. W. and Garca-Molna, H. (1994) Index Structures for Informaton Flterng Under the Vector Space Model. In Proceedngs of the Internatonal Conference on Data Engneerng.