10 Content-based Recommendaton Systems Mchael J. Pazzan 1 and Danel Bllsus 2 1 Rutgers Unversty, ASBIII, 3 Rutgers Plaza New Brunswck, NJ 08901 pazzan@rutgers.edu 2 FX Palo Alto Laboratory, Inc., 3400 Hllvew Ave, Bldg. 4 Palo Alto, CA 94304 bllsus@fxpal.com Abstract. Ths chapter dscusses content-based recommendaton systems,.e., systems that recommend an tem to a user based upon a descrpton of the tem and a profle of the user s nterests. Content-based recommendaton systems may be used n a varety of domans rangng from recommendng web pages, news artcles, restaurants, televson programs, and tems for sale. Although the detals of varous systems dffer, content-based recommendaton systems share n common a means for descrbng the tems that may be recommended, a means for creatng a profle of the user that descrbes the types of tems the user lkes, and a means of comparng tems to the user profle to determne what to recommend. The profle s often created and updated automatcally n response to feedback on the desrablty of tems that have been presented to the user. 10.1 Introducton A common scenaro for modern recommendaton systems s a Web applcaton wth whch a user nteracts. Typcally, a system presents a summary lst of tems to a user, and the user selects among the tems to receve more detals on an tem or to nteract wth the tem n some way. For example, onlne news stes present web pages wth headlnes (and occasonally story summares) and allow the user to select a headlne to read a story. E-commerce stes often present a page wth a lst of ndvdual products and then allow the user to see more detals about a selected product and purchase the product. Although the web server transmts HTML and the user sees a web page, the web server typcally has a database of tems and dynamcally constructs web pages wth a lst of tems. Because there are often many more tems avalable n a database than would easly ft on a web page, t s necessary to select a subset of tems to dsplay to the user or to determne an order n whch to dsplay the tems.
Content-based recommendaton systems analyze tem descrptons to dentfy tems that are of partcular nterest to the user. Because the detals of recommendaton systems dffer based on the representaton of tems, ths chapter frst dscusses alternatve tem representatons. Next, recommendaton algorthms suted for each representaton are dscussed. The chapter concludes wth a dscusson of varants of the approaches, the strengths and weaknesses of content-based recommendaton systems, and drectons for future research and development. 10.1.1 Item Representaton Items that can be recommended to the user are often stored n a database table. Table 10.1 shows a smple database wth records (.e., rows ) that descrbe three restaurants. The column names (e.g., Cusne or Servce) are propertes of restaurants. These propertes are also called attrbutes, characterstcs, felds, or varables n dfferent publcatons. Each record contans a value for each attrbute. A unque dentfer, ID n Table 10.1, allows tems wth the same name to be dstngushed and serves as a key to retreve the other attrbutes of the record. Table 10.1. A restaurant database ID Name Cusne Servce Cost 10001 Mke s Pzza Italan Counter Low 10002 Chrs s Cafe French Table Medum 10003 Jacques Bstro French Table Hgh The database depcted n Table 10.1 could be used to drve a web ste that lsts and recommends restaurants. Ths s an example of structured data n whch there s a small number of attrbutes, each tem s descrbed by the same set of attrbutes, and there s a known set of values that the attrbutes may have. In ths case, many machne learnng algorthms may be used to learn a user profle, or a menu nterface can easly be created to allow a user to create a profle. The next secton of ths chapter dscusses several approaches to creatng a user profle from structured data. Of course, a web page typcally has more nformaton than s shown n Table 10.1, such as a text descrpton of the restaurant, a restaurant revew, or even a menu. These may easly be stored as addtonal felds n the database and a web page can be created wth templates to dsplay the text felds (as well as the structured data). However, free text data creates a number of complcatons when learnng a user profle. For example, a profle mght ndcate that there s an 80% probablty that a partcular user would lke a French restaurant. Ths mght be added to the profle because a user gave a postve revew of four out of fve French restaurants. However, unrestrcted text felds are typcally unque and there would be no opportunty to provde feedback on fve restaurants descrbed as A charmng café wth attentve staff overlookng the rver. An extreme example of unstructured data may occur n news artcles. Table 10.2 shows an example of a part of a news artcle. The entre artcle can be treated as a large unrestrcted text feld.
Table 10.2. Part of a newspaper artcle Lawmakers Fne-Tunng Energy Plan SACRAMENTO, Calf. -- Wth Calforna's energy reserves remanng all but depleted, lawmakers prepared to work through the weekend fne-tunng a plan Gov. Gray Davs says wll put the state n the power busness for "a long tme to come." The proposal nvolves partally takng over Calforna's two largest utltes and sgnng long-term contracts of up to 10 years to buy electrcty from wholesalers. Unrestrcted texts such as news artcles are examples of unstructured data. Unlke structured data, there are no attrbute names wth well-defned values. Furthermore, the full complexty of natural language may be present n the text feld ncludng polysemous words (the same word may have several meanngs) and synonyms (dfferent words may have the same meanng). For example, n the artcle n Table 10.2, Gray s a name rather than a color, and power and electrcty refer to the same underlyng concept. Many domans are best represented by sem-structured data n whch there are some attrbutes wth a set of restrcted values and some free-text felds. A common approach to dealng wth free text felds s to convert the free text to a structured representaton. For example, each word may be vewed as an attrbute, wth a Boolean value ndcatng whether the word s n the artcle or wth an nteger value ndcatng the number of tmes the word appears n the artcle. Many personalzaton systems that deal wth unrestrcted text use a technque to create a structured representaton that orgnated wth text search systems [34]. In ths formalsm, rather than usng words, the root forms of words are typcally created through a process called stemmng [30]. The goal of stemmng s to create a term that reflects the common meanng behnd words such as compute, computaton, computer computes and computers. The value of a varable assocated wth a term s a real number that represents the mportance or relevance. Ths value s called the tf*df weght (term-frequency tmes nverse document frequency). The tf*df weght, w(t,d), of a term t n a document d s a functon of the frequency of t n the document (tf t,d), the number of documents that contan the term (df t ) and the number of documents n the collecton (N). 1 1 Note that n the descrpton of tf*df weghts, the word document s tradtonally used snce the orgnal motvaton was to retreve documents. Whle the chapter wll stck wth the orgnal termnology, n a recommendaton system, the documents correspond to a text descrpton of an tem to be recommended. Note that the equatons here are representatve of the class of formulae called tf*df. In general, tf*df systems have weghts that ncrease monotoncally wth term frequency and decrease monotoncally wth document frequency.
tf ( tf t, d log ) 2 log t, d t w ( t, d) = 2 (10.1) Table 10.3 shows the tf*df representaton (also called the vector space representaton) of the complete artcle excerpted n Table 10.2. The terms are ordered by the tf*df weght. The ntuton behnd the weght s that the terms wth the hghest weght occur more often n that document than n the other documents, and therefore are more central to the topc of the document. Note that terms such as utl (a stem of utlty ), power, megawatt, are among the hghest weghted terms capturng the meanng. N df N df Table 10.3. tf*df representaton of the artcle n Table 10.2 utl-0.339 power-0.329 megawatt-0.309 electr-0.217 energ-0.206 calforna-0.181 debt-0.128 lawmak-0.128 state-0.122 wholesal-0.119 partal-0.106 consum-0.105 alert-0.103 scroung-0.096 advoc-0.09 test-0.088 bal-out-0.088 crs-0.085 amd- 0.084 prce-0.083 long-0.082 bond-0.081 plan-0.081 term-0.08 grd-0.078 reserv- 0.077 blackout-0.076 bd-0.076 market-0.074 fne-0.073 deregul-0.07 spral-0.068 deplet-0.068 lar-0.066. Of course, ths representaton does not capture the context n whch a word s used. It loses the relatonshps between words n the descrpton. For example, a descrpton of a steak house mght contan the sentence, there s nothng on the menu that a vegetaran would lke whle the descrpton of a vegetaran restaurant mght menton vegan rather than vegetaran. In a manually created structured database, the cusne attrbute havng a value of vegetaran would ndcate that the restaurant s ndeed a vegetaran one. In contrast, when convertng an unstructured text descrpton to structured data, the presence of the word vegetaran does not always ndcate that a restaurant s vegetaran and the absence of the word vegetaran does not always ndcate that the restaurant s not a vegetaran restaurant. As a consequence, technques for creatng user profles that deal wth structured data need to dffer somewhat from those technques that deal wth unstructured data or unstructured data automatcally and mprecsely converted to structured data. One varant on usng words as terms s to use sets of contguous words as terms. For example, n the artcle n Table 10.2, terms such as energy reserves and power busness mght be more descrptve of the content than these words treated as ndvdual terms. Of course, terms such as all but would also be ncluded, but one would expect that these have very low weghts, n the same way that all and but ndvdually have low weghts and are not among the most mportant terms n Table 10.3. t
10.2 User Profles A profle of the user s nterests s used by most recommendaton systems. Ths profle may consst of a number of dfferent types of nformaton. Here, we concentrate on two types of nformaton: 1. A model of the user s preferences,.e., a descrpton of the types of tems that nterest the user. There are many possble alternatve representatons of ths descrpton, but one common representaton s a functon that for any tem predcts the lkelhood that the user s nterested n that tem. For effcency purposes, ths functon may be used to retreve the n tems most lkely to be of nterest to the user. 2. A hstory of the user s nteractons wth the recommendaton system. Ths may nclude storng the tems that a user has vewed together wth other nformaton about the user s nteracton, (e.g., whether the user has purchased the tem or a ratng that the user has gven the tem). Other types of hstory nclude savng queres typed by the user (e.g., that a user searched for an Italan restaurant n the 90210 zp code). There are several uses of the hstory of user nteractons. Frst, the system can smply dsplay recently vsted tems to facltate the user returnng to these tems. Second, the system can flter out from a recommendaton system an tem that the user has already purchased or read. 2 Another mportant use of the hstory n content-based recommendaton systems s to serve as tranng data for a machne learnng algorthm that creates a user model. The next secton wll dscuss several dfferent approaches to learnng a user model. Here, we brefly descrbe approaches of manually provdng the nformaton used by recommendaton systems: user customzaton and rule-based recommendaton systems. In user customzaton, a recommendaton system provdes an nterface that allows users to construct a representaton of ther own nterests. Often check boxes are used to allow a user to select from the known values of attrbutes, e.g., the cusne of restaurants, the names of favorte sports teams, the favorte sectons of a news ste, or the genre of favorte moves. In other cases, a form allows a user to type words that occur n the free text descrptons of tems, e.g., the name of a muscan or author that nterests the user. Once the user has entered ths nformaton, a smple database matchng process s used to fnd tems that meet the specfed crtera and dsplay them to the user. There are several lmtatons of user customzaton systems. Frst, they requre effort from the user and t s dffcult to get many users to make ths effort. Ths s partcularly true when the user s nterests change, e.g., a user may not follow football 2 Of course, n some stuatons t s approprate to recommend an tem the user has purchased and n other stuatons t s not. For example, a system should contnue to recommend an tem that wears out or s expended, such as a razor blade or prnt cartrdge, whle there s lttle value n recommendng a CD or DVD a user owns.
durng the season but then become nterested n the Superbowl. Second, customzaton systems do not provde a way to determne the order n whch to present tems and can fnd ether too few or too many matchng tems to dsplay. Fgure 10.1 shows book recommendatons at Amazon.com. Although Amazon.com s usually thought of as a good example of collaboratve recommendaton (see Chapter 9 of ths book [35]), parts of the user s profle can be vewed as a content-based profle. For example, Amazon contans a feature called favortes that represents the categores of tems preferred by users. These favortes are ether calculated by keepng track of the categores of tems purchased by users or may be set manually by the user. Fgure 10.2 shows an example of a user customzaton nterface n whch a user can select the categores. In rule-based recommendaton systems, the recommendaton system has rules to recommend other products based on the user hstory. For example, a system may contan a rule that recommends the sequel to a book or move to people who have purchased the early tem n the seres. Another rule mght recommend a new CD by an artst to users that purchased earler CDs by that artst. Rule-based systems may capture several common reasons for makng recommendatons, but they do not offer the same detaled personalzed recommendatons that are avalable wth other recommendaton systems. Fg. 10.1. Book recommendatons by Amazon.com.
Fg. 10.2. User customzaton n Amazon.com 10.3 Learnng a User Model Creatng a model of the user s preference from the user hstory s a form of classfcaton learnng. The tranng data of a classfcaton learner s dvded nto categores, e.g., the bnary categores tems the user lkes and tems the user doesn t lke. Ths s accomplshed ether through explct feedback n whch the user rates tems va some nterface for collectng feedback or mplctly by observng the user s nteractons wth tems. For example, f a user purchases an tem, that s a sgn that the user lkes the tem, whle f the user purchases and returns the tem that s a sgn that the user doesn t lke the tem. In general, there s a tradeoff snce mplct methods can collect a large amount of data wth some uncertanty as to whether the user actually lkes the tem. In contrast, when the user explctly rates tems, there s lttle or no nose n the tranng data, but users tend to provde explct feedback on only a small percentage of the tems they nteract wth. Fgure 10.3 shows an example of a recommendaton system wth explct user feedback. The recommender MyBestBets by ChoceStream s a web based nterface to a televson recommendaton system. Users can clck on the thumbs up or thumbs down buttons to ndcate whether they lke the program that s recommended. By necessty, ths system requres explct feedback because t s not ntegrated wth a televson [1] and cannot nfer the user s nterests by observng the user s behavor.
Fg. 10.3. A recommendaton system usng explct feedback The next secton revews a number of classfcaton learnng algorthms. Such algorthms are the key component of content-based recommendaton systems, because they learn a functon that models each user s nterests. Gven a new tem and the user model, the functon predcts whether the user would be nterested n the tem. Many of the classfcaton learnng algorthms create a functon that wll provde an estmate of the probablty that a user wll lke an unseen tem. Ths probablty may be used to sort a lst of recommendatons. Alternatvely, an algorthm may create a functon that drectly predcts a numerc value such as the degree of nterest. Some of the algorthms below are tradtonal machne learnng algorthms desgned to work on structured data. When they operate on free text, the free text s frst converted to structured data by selectng a small subset of the terms as attrbutes. In contrast, other algorthms are desgned to work n hgh dmensonal spaces and do not requre a preprocessng step of feature selecton. 10.4 Decson Trees and Rule Inducton Decson tree learners such as ID3 [31] buld a decson tree by recursvely parttonng tranng data, n ths case text documents, nto subgroups untl those subgroups contan only nstances of a sngle class. A partton s formed by a test on
some feature -- n the context of text classfcaton typcally the presence or absence of an ndvdual word or phrase. Expected nformaton gan s a commonly used crteron to select the most nformatve features for the partton tests [38]. Decson trees have been studed extensvely n use wth structured data such as that shown n Table 10.1. Gven feedback on the restaurants, a decson tree can easly represent and learn a profle of someone who prefers to eat n expensve French restaurants or nexpensve Mexcan restaurants. Arguably, the decson tree bas s not deal for unstructured text classfcaton tasks [29]. As a consequence of the nformaton-theoretc splttng crtera used by decson tree learners, the nductve bas of decson trees s a preference for small trees wth few tests. However, t can be shown expermentally that text classfcaton tasks frequently nvolve a large number of relevant features [17]. Therefore, a decson tree s tendency to base classfcatons on as few tests as possble can lead to poor performance on text classfcaton. However, when there are a small number of structured attrbutes, the performance, smplcty and understandablty of decson trees for content-based models are all advantages. Km et al. [18] descrbe an applcaton of decson trees for personalzng advertsements on web pages. RIPPER [9] s a rule nducton algorthm closely related to decson trees that operates n a smlar fashon to the recursve data parttonng approach descrbed above. Despte the problematc nductve bas, however, RIPPER performs compettvely wth other state-of-the-art text classfcaton algorthms. In part, the performance can be attrbuted to a sophstcated post-prunng algorthm that optmzes the ft of the nduced rule set wth respect to the tranng data as a whole. Furthermore, RIPPER supports mult-valued attrbutes, whch leads to a natural representaton for text classfcaton tasks,.e., the ndvdual words of a text document can be represented as multple feature values for a sngle feature. Whle ths s essentally a representatonal convenence f rules are to be learned from unstructured text documents, the approach can lead to more powerful classfers for sem-structured text documents. For example, the text contaned n separate felds of an emal message, such as sender, subect, and body text, can be represented as separate mult-valued features, whch allows the algorthm to take advantage of the document s structure n a natural fashon. Cohen [10] shows how RIPPER can classfy e-mal messages nto user defned categores. 10.5 Nearest Neghbor Methods The nearest neghbor algorthm smply stores all of ts tranng data, here textual descrptons of mplctly or explctly labeled tems, n memory. In order to classfy a new, unlabeled tem, the algorthm compares t to all stored tems usng a smlarty functon and determnes the "nearest neghbor" or the k nearest neghbors. The class label or numerc score for a prevously unseen tem can then be derved from the class labels of the nearest neghbors. The smlarty functon used by the nearest neghbor algorthm depends on the type of data. For structured data, a Eucldean dstance metrc s often used. When usng the vector space model, the cosne smlarty measure s often used [34]. In the Eucldean
dstance functon, the same feature havng a small value n two examples s treated the same as that feature havng a large value n both examples. In contrast, the cosne smlarty functon wll not have a large value f correspondng features of two examples have small values. As a consequence, t s approprate for text when we want two documents to be smlar when they are about the same topc, but not when they are both not about a topc. The vector space approach and the cosne smlarty functon have been appled to several text classfcaton applcatons ([11], [39], [2]) and, despte the algorthm s unquestonable smplcty, t performs compettvely wth more complex algorthms. The Daly Learner system uses the nearest neghbor algorthm to create a model of the user s short term nterests [7]. Gxo, a personalzed news system, also uses text smlarty as a bass for recommendaton (Fgure 10.4). The headlnes are preceded by an con that ndcates how popular the tem s (the frst bar) and how smlar the story s to stores that have been read by the user before (the second bar). The fact that these bars dffer shows the value of personalzng to the ndvdual. Fg. 10.4. Gxo presents personalzed news based on smlarty to artcles that have prevously been read 10.6 Relevance Feedback and Roccho s Algorthm Snce the success of document retreval n the vector space model depends on the user s ablty to construct queres by selectng a set of representatve keywords [34], methods that help users to ncrementally refne queres based on prevous search results have been the focus of much research. These methods are commonly referred to as relevance feedback. The general prncple s to allow users to rate documents returned by the retreval system wth respect to ther nformaton need. Ths form of feedback can subsequently be used to ncrementally refne the ntal query. In a
manner analogous to ratng tems, there are explct and mplct means of collectng relevance feedback data. Roccho s algorthm [33] s a wdely used relevance feedback algorthm that operates n the vector space model. The algorthm s based on the modfcaton of an ntal query through dfferently weghted prototypes of relevant and non-relevant documents. The approach forms two document prototypes by takng the vector sum over all relevant and non-relevant documents. The followng formula summarzes the algorthm formally: D D Q + 1 = α Q + β γ (10.2) D D rel nonrel Here, Q s the user s query at teraton, and α, β, and γ are parameters that control the nfluence of the orgnal query and the two prototypes on the resultng modfed query. The underlyng ntuton of the above formula s to ncrementally move the query vector towards clusters of relevant documents and away from rrelevant documents. Whle ths goal forms an ntutve ustfcaton for Roccho s algorthm, there s no theoretcally motvated bass for the above formula,.e., nether performance nor convergence can be guaranteed. However, emprcal experments have demonstrated that the approach leads to sgnfcant mprovements n retreval performance [33]. In more recent work, researchers have used a varaton of Roccho s algorthm n a machne learnng context,.e., for learnng a user profle from unstructured text ([15], [3], [29]). The goal n these applcatons s to automatcally nduce a text classfer that can dstngush between classes of documents. In ths context, t s assumed that no ntal query exsts, and the algorthm forms prototypes for classes analogously to Roccho s approach as vector sums over documents belongng to the same class. The result of the algorthm s a set of weght vectors, whose proxmty to unlabeled documents can be used to assgn class membershp. Smlar to the relevance feedback verson of Roccho s algorthm, the Roccho-based classfcaton approach does not have any theoretc underpnnngs and there are no performance or convergence guarantees. 10.7 Lnear Classfers Algorthms that learn lnear decson boundares,.e., hyperplanes separatng nstances n a mult-dmensonal space, are referred to as lnear classfers. There are a large number of algorthms that fall nto ths category, and many of them have been successfully appled to text classfcaton tasks [20]. All lnear classfers can be descrbed n a common representatonal framework. In general, the outcome of the learnng process s an n-dmensonal weght vector w, whose dot product wth an n- dmensonal nstance, e.g., a text document represented n the vector space model, results n a numerc score predcton. Retanng the numerc predcton leads to a lnear regresson approach. However, a threshold can be used to convert contnuous
predctons to dscrete class labels. Whle ths general framework holds for all lnear classfers, the algorthms dffer n the tranng methods used to derve the weght vector w. For example, the equaton below s known as the Wdrow-Hoff rule, delta rule or gradent descent rule and derves the weght vector w by ncremental vector movements n the drecton of the negatve gradent of the example's squared error [37]. Ths s the drecton n whch the error falls most rapdly. w 1, = w, 2 ( w x y ) x, + η (10.3) The equaton shows how the weght vector w can be derved ncrementally. The nner product of nstance x and weght vector w s the algorthm s numerc predcton for nstance x. The predcton error s determned by subtractng the nstance s known score, y, from the predcted score. The resultng error s then multpled by the orgnal nstance vector x and the learnng rate η to form a vector that, when subtracted from the weght vector w, moves w towards the correct predcton for nstance x. The learnng rate η controls the degree to whch every addtonal nstance affects the prevous weght vector. An alternatve algorthm that has expermentally been shown to outperform the approach above on text classfcaton tasks wth many features s the exponentated gradent (EG) algorthm. Kvnen and Warmuth [19] prove a bound for EG s error, whch depends only logarthmcally on the number of features. Ths result offers a theoretc argument for EG s performance on text classfcaton problems, whch are typcally hgh-dmensonal. An mportant advantage of the above learnng schemes for lnear algorthms s that they can be performed on-lne,.e., the current weght vector can be modfed ncrementally as new nstances become avalable. Ths s a crucal advantage for applcatons that operate under real-tme constrants. Fnally, t s mportant to note that whle the above approaches tend to converge on hyperplanes that separate the tranng data accurately, the hyperplane s generalzaton performance mght not be optmal. A related approach amed at mprovng generalzaton performance s known as support vector machnes [36]. The central dea underlyng support vector machnes s to maxmze the classfcaton margn,.e., the dstance between the decson boundary and the closest tranng nstances, the socalled support vectors. A seres of emprcal experments on a varety of benchmark data sets ndcated that lnear support vector machnes perform partcularly well on text classfcaton tasks [17]. The man reason for ths s that the margn maxmzaton s an nherently bult-n overfttng protecton mechansm. A reduced tendency to overft tranng data s partcularly useful for text classfcaton algorthms, because n ths doman hgh dmensonal concepts must often be learned from lmted tranng data, whch s a scenaro prone to overfttng.
10.8 Probablstc Methods and Naïve Bayes In contrast to the lack of theoretcal ustfcatons for the vector space model, there has been much work on probablstc text classfcaton approaches. Ths secton descrbes one such example, the naïve Bayesan classfer. Early work on a probablstc classfer and ts text classfcaton performance was reported by Maron [24]. Today, ths algorthm s commonly referred to as a naïve Bayesan Classfer [13]. Researchers have recognzed Naïve Bayes as an exceptonally well-performng text classfcaton algorthm and have frequently adopted the algorthm n recent work ([27], [28], [25]). The algorthm s popularty and performance for text classfcaton applcatons have prompted researchers to emprcally evaluate and compare dfferent varatons of naïve Bayes that have appeared n the lterature (e.g. [26], [21]). In summary, McCallum and Ngam [26] note that there are two frequently used formulatons of naïve Bayes, the multvarate Bernoull and the multnomal model. Both models share the followng prncples. It s assumed that text documents are generated by an underlyng generatve model, specfcally a parameterzed mxture model: P( d C θ ) = P( c θ ) P( d c ; θ ) (10.4) = 1 Here, each class c corresponds to a mxture component that s parameterzed by a dsont subset of θ, and the sum of total probablty over all mxture components determnes the lkelhood of a document. Once the parameters θ have been learned from tranng data, the posteror probablty of class membershp gven the evdence of a test document can be determned accordng to Bayes rule: P( c ˆ) θ P( d c ; ˆ) θ P ( c d; ˆ) θ = (10.5) P( d ˆ) θ Whle the above prncples hold for naïve Bayes classfcaton n general, the multvarate Bernoull and multnomal models dffer n the way p(d c ; θ) s estmated from tranng data. The multvarate Bernoull formulaton was derved wth structured data n mnd. For text classfcaton tasks, t assumes that each document s represented as a bnary vector over the space of all words from a vocabulary V. Each element B t n ths vector ndcates whether a word appears at least once n the document. Under the naïve Bayes assumpton that the probablty of each word occurrng n a document s ndependent of other words gven the class label, p(d c ; θ) can be expressed as a smple product:
V P( d c ; θ ) = ( B P( w c ; θ ) + (1 B )(1 P( w c ; θ ))) (10.6) t= 1 t t Bayes-optmal optmal estmates for p(w t c ; θ) can be determned by word occurrence countng over the data: D 1+ BtP( c d ) P( w c ; θ ) (10.7) t = 1 = D 2 + = 1 P( c t d ) In contrast to the bnary document representaton of the multvarate Bernoull model, the multnomal formulaton captures word frequency nformaton. Ths model assumes that documents are generated by a sequence of ndependent trals drawn from a multnomal probablty dstrbuton. Agan, the naïve Bayes ndependence assumpton allows p(d c ; θ) to be determned based on ndvdual word probabltes: t d Nt = P( d ) P( w c ; θ (10.8) P ( d c ; θ ) ) t= 1 t Here, N t s the number of occurrences of word w t n document d. Takng word frequences nto account, maxmum lkelhood estmates for p(w t c ; θ) can be derved from tranng data: 1+ N P( c d ) P( w c ; θ ) (10.9) t t = 1 = V D V + D s= 1 = 1 N s P( c d ) Emprcally, the multnomal naïve Bayes formulaton was shown to outperform the multvarate Bernoull model. Ths effect s partcularly notceable for large vocabulares (McCallum and Ngam, 1998). Even though the naïve Bayes assumpton of class-condtonal attrbute ndependence s clearly volated n the context of text classfcaton, naïve Bayes performs very well. Domngos and Pazzan [12] offer a possble explanaton for ths paradox by showng that class-condtonal feature ndependence s not a necessary condton for the optmalty of naïve Bayes. The naïve Bayes classfer has been used n several content-based recommendaton systems ncludng Syskll & Webert [29].
10.9 Trends n Content-Based Flterng Belkn & Croft [5] surveyed some of the frst content-based recommendaton systems and noted that they made use of technology related to nformaton retreval such as tf*df and Roccho s method. Indeed, some of the early work on content-based recommendaton used the term query to refer to user models. In ths vew, a user model s a saved query (or a set of saved queres) that can retreve addtonal or new nformaton of nterest to the user. Some representatve early systems nclude a system at Bellcore [14] that found new techncal reports related to prevously read reports and LyrcTme [22] that recommended songs n a multmeda player based on a profle learned from the user s feedback on pror songs played. The creaton and rapd growth of the World Wde Web n the md 1990s made access to vast amounts of nformaton possble and created problems of locatng and dentfyng personally relevant nformaton. Some n the Machne Learnng communty appled tradtonal machne learnng methods to user modelng of document nterests. These methods reduced the text tranng data to a few hundred hghly relevant words usng technques such as nformaton theory or tf*df. Some representatve systems ncluded WebWatcher [16] and Syskll & Webert [29]. Fg. 10.5. The Syskll & Webert system learns a model of the user s preference for web pages
10.10 Lmtatons and Extensons Although there are dfferent approaches to learnng a model of the user s nterest wth content-based recommendaton, no content-based recommendaton system can gve good recommendatons f the content does not contan enough nformaton to dstngush tems the user lkes from tems the user doesn t lke. In recommendng some tems, e.g., okes or poems, there often sn t enough nformaton n the word frequency to model the user s nterests. Whle t would be possble to tell a lawyer oke from a chcken oke based upon word frequences, t would be dffcult to dstngush a funny lawyer oke from other lawyer okes. As a consequence, other recommendaton technologes, such as collaboratve recommenders [35], should be used n such stuatons. In some stuatons, e.g., recommendng moves, restaurants, or televson programs, there s some structured nformaton (e.g., the genre of the move as well as actors and drectors) that can be used by a content-based system. However, ths nformaton mght be supplemented by the opnons of other users. One way to nclude the opnons of other users n the frameworks dscussed n Secton 10.2 s to add addtonal data assocated to the representaton of the examples. For example, Basu et al. [4] add features to examples that ndcate the dentfers of other users who lke an tem. Rpper was appled to the resultng data that could learn profles wth both collaboratve and content-based features (e.g., a user mght lke a scence fcton move f USER-109 lkes t). Although not strctly a content-based system, the same technology as content-based recommenders s used to learn a user model. Indeed, Bllsus and Pazzan [6] have shown that any machne learnng algorthm may be used as the bass for collaboratve flterng by transformng user ratngs to attrbutes. Chapter 12 of ths book [8] dscusses a varety of other approaches to combnng content and collaboratve nformaton n recommendaton systems. A fnal usage of content n recommendatons s worth notng. Smple contentbased rules may be used to flter the results of other methods such as collaboratve flterng. For example, even f t s the case that people who buy dolls also buy adult vdeos, t mght be mportant not to recommend adult tems n a partcular applcaton. Smlarly, although not strctly content-based, some systems mght not recommend tems that are out of stock. 10.11 Summary Content-based recommendaton systems recommend an tem to a user based upon a descrpton of the tem and a profle of the user s nterests. Whle a user profle may be entered by the user, t s commonly learned from feedback the user provdes on tems. A varety of learnng algorthms have been adapted to learnng user profles, and the choce of learnng algorthm depends upon the representaton of content.
References 1. Al, K., van Stam, W.: TVo: Makng Show Recommendatons Usng a Dstrbuted Collaboratve Flterng Archtecture. In: Proceedngs of the Tenth ACM SIGKDD Internatonal Conference on Knowledge Dscovery and Data Mnng. Seattle, WA. (2004) 394-401 2. Allan, J., Carbonell, J., Doddngton, G., Yamron, J., Yang, Y.: Topc Detecton and Trackng Plot Study Fnal Report. In: Proceedngs of the DARPA Broadcast News Transcrpton and Understandng Workshop. Lansdowne, VA (1998) 194-218 3. Balabanovc, M., Shoham Y.: FAB: Content-based, Collaboratve Recommendaton. Communcatons of the Assocaton for Computng Machnery 40(3) (1997) 66-72 4. Basu, C., Hrsh, H., Cohen W.: Recommendaton as Classfcaton: Usng Socal and Content-Based Informaton n Recommendaton. In: Proceedngs of the 15th Natonal Conference on Artfcal Intellgence, Madson, WI (1998) 714-720 5. Belkn, N., Croft, B.: Informaton Flterng and Informaton Retreval: Two Sdes of the Same Con? Communcatons of the ACM 35(12) (1992) 29-38 6. Bllsus, D., Pazzan, M.: Learnng Collaboratve Informaton Flters. In: Proceedngs of the Internatonal Conference on Machne Learnng. Morgan Kaufmann Publshers. Madson, WI (1998) 46-54 7. Bllsus, D., Pazzan, M., Chen, J.: A Learnng Agent for Wreless News Access. In: Proceedngs of the Internatonal Conference on Intellgent User Interfaces (2002) 33-36 8. Burke, R.: Hybrd Web Recommender Systems. In: Bruslovsky, P., Kobsa, A., Nedl, W. (eds.): The Adaptve Web: Methods and Strateges of Web Personalzaton. Lecture Notes n Computer Scence, Vol. 4321. Sprnger-Verlag, Berln Hedelberg New York (2007) ths volume 9. Cohen, W.: Fast Effectve Rule Inducton. In: Proceedngs of the Twelfth Internatonal Conference on Machne Learnng, Tahoe Cty, CA. (1995) 115-123 10. Cohen, W.: Learnng Rules that Classfy E-mal. In: Papers from the AAAI Sprng Symposum on Machne Learnng n Informaton Access (1996) 18-25 11. Cohen, W., Hrsh, H. Jons that Generalze: Text Classfcaton Usng WHIRL. In: Proceedngs of the Fourth Internatonal Conference on Knowledge Dscovery & Data Mnng, New York, NY (1998) 169-173 12. Domngos, P., Pazzan, M. Beyond Independence: Condtons for the Optmalty of the Smple Bayesan Classfer. Machne Learnng 29 (1997) 103-130. 13. Duda, R., Hart, P.: Pattern Classfcaton and Scene Analyss. New York, NY: Wley and Sons (1973) 14. Foltz, P., Dumas, S.: Personalzed Informaton Delvery: An Analyss of Informaton Flterng Methods. Communcatons of the ACM 35(12) (1992) 51-60 15. Ittner, D., Lews, D., Ahn, D.: Text Categorzaton of Low Qualty Images. In: Symposum on Document Analyss and Informaton Retreval, Las Vegas, NV (1995) 301-315 16. Joachms, T., Fretag, D., Mtchell, T.: WebWatcher: A Tour Gude for the World Wde Web. In: Proceedngs of the 15th Internatonal Jont Conference on Artfcal Intellgence. Nagoya, Japan (1997) 770-775 17. Joachms, T.: Text Categorzaton Wth Support Vector Machnes: Learnng wth Many Relevant Features. In: European Conference on Machne Learnng, Chemntz, Germany (1998) 137-142 18. Km, J., Lee, B., Shaw, M., Chang, H., Nelson, W.: Applcaton of Decson-Tree Inducton Technques to Personalzed Advertsements on Internet Storefronts. Internatonal Journal of Electronc Commerce 5(3) (2001) 45-62 19. Kvnen, J., Warmuth, M.: Exponentated Gradent versus Gradent Descent for Lnear Predctors. Informaton and Computaton 132(1) (1997) 1-63
20. Lews, D., Schapre, R., Callan, J., Papka, R.: Tranng Algorthms for Lnear Text Classfers. In: Proceedngs of the 19th Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval, Konstanz, Germany (1996) 298-306 21. Lews, D.: Naïve (Bayes) at Forty: The Independence Assumpton n Informaton Retreval. In: European Conference on Machne Learnng, Chemntz, Germany (1998) 4-15 22. Loeb, S.: Archtectng Personal Delvery of Multmeda Informaton. Communcatons of the ACM 35(12) (1992) 39-48 23. Mandel, M., Polner, G., Ells, D.: Support Vector Machne Actve Learnng for Musc Retreval. ACM Multmeda Systems Journal 12(1) (2006) 3-13 24. Maron, M.: Automatc Indexng: An Expermental Inqury. Journal of the Assocaton for Computng Machnery 8(3) (1961) 404-417 25. McCallum, A., Rosenfeld, R., Mtchell T., Ng, A.: Improvng Text Classfcaton by Shrnkage n a Herarchy of Classes. In: Proceedngs of the Internatonal Conference on Machne Learnng. Morgan Kaufmann Publshers. Madson, WI (1998) 359-367 26. McCallum, A., Ngam, K.: A Comparson of Event Models for Nave Bayes Text Classfcaton. In: AAAI/ICML-98 Workshop on Learnng for Text Categorzaton, Techncal Report WS-98-05, AAAI Press (1998) 41-48 27. Mtchell, T.: Machne Learnng. McGraw-Hll (1997) 28. Ngam, K., McCallum, A., Thrun, S., Mtchell, T.: Learnng to Classfy Text from Labeled and Unlabeled Documents. In: Proceedngs of the 15th Internatonal Conference on Artfcal Intellgence, Madson, WI (1998) 792-799 29. Pazzan M., Bllsus, D.: Learnng and Revsng User Profles: The Identfcaton of Interestng Web Stes. Machne Learnng 27(3) (1997) 313-331 30. Porter, M.: An Algorthm for Suffx Strppng. Program 14(3) (1980) 130-137 31. Qunlan, J.: Inducton of Decson Trees. Machne Learnng 1(1986) 81-106 32. Qunlan, J.: C4.5: Programs for Machne Learnng. Morgan Kauffman (1993) 33. Roccho, J.: Relevance Feedback n Informaton Retreval. In: G. Salton (ed.). The SMART System: Experments n Automatc Document Processng. NJ: Prentce Hall (1971) 313-323 34. Salton, G. Automatc Text Processng. Addson-Wesley (1989) 35. Schafer, B., Frankowsk, D., Herlocker, J., Sen, S.: Collaboratve Flterng Recommender Systems. In: Bruslovsky, P., Kobsa, A., Nedl, W. (eds.): The Adaptve Web: Methods and Strateges of Web Personalzaton. Lecture Notes n Computer Scence, Vol. 4321. Sprnger-Verlag, Berln Hedelberg New York (2007) ths volume 36. Vapnk, V.: The Nature of Statstcal Learnng Theory. Sprnger: New York (1995) 37. Wdrow, A., Hoff, M.: Adaptve Swtchng Crcuts. WESCON Conventon Record 4 (1960) 96-104 38. Yang, Y., Pedersen J.: A Comparatve Study on Feature Selecton n Text Categorzaton. In: Proceedngs of the Fourteenth Internatonal Conference on Machne Learnng, Nashvlle, TN (1997) 412-420 39. Yang, Y.: An Evaluaton of Statstcal Approaches to Text Categorzaton. Informaton Retreval 1(1) (1999) 67-88