QDquaderni. UP-DRES User Profiling for a Dynamic REcommendation System E. Messina, D. Toscani, F. Archetti. university of milano bicocca

Transcription

1 A01 084/01

2

3 university of milano bicocca QDquaderni department of informatics, systems and communication UP-DRES User Profiling for a Dynamic REcommendation System E. Messina, D. Toscani, F. Archetti research report n. 1 march 2006

4 Copyright MMVI ARACNE editrice S.r.l. via Raffaele Garofalo, 133 A/B Roma (06) ISBN ISSN I diritti di traduzione, di memorizzazione elettronica, di riproduzione e di adattamento anche parziale, con qualsiasi mezzo, sono riservati per tutti i Paesi. Non sono assolutamente consentite le fotocopie senza il permesso scritto dell Editore. I edizione: aprile 2006

5 5 1. Introduction The Proposed System Content Extractor Taxonomy Builder Recommendation Manager Conclusions and Future Work...23 References...24

6

7 7 UP-DRES User Profiling for a Dynamic REcommendation System Enza Messina 1, Daniele Toscani 1,2, Francesco Archetti 1,2 1 DISCO, Università degli Studi di Milano Bicocca, Via Bicocca degli Arcimboldi, Milano, Italy messina@disco.unimib.it 2 Consorzio Milano Ricerche, Via Cicognara 7, Milano, Italy {archetti,toscani}@milanoricerche.it Abstract. The WWW is actually the most dynamic and attractive information exchange place. Finding useful information is hard due to huge data amount, varied topics and unstructured contents. In this paper we present a web browsing support system that proposes personalized contents. It is integrated in the content management system and it runs on the server hosting the site. It processes periodically site contents, extracting vectors of the most significant words. A topology tree is defined applying hierarchical clustering. During online browsing, viewed contents are processed and mapped in the vector space previously defined. The centroid of these vectors is compared with the topology tree nodes centroids to find the most similar; its contents are presented to the user as link suggestions or dynamically created pages. Personal profile is saved after every session and included in the analysis during same user s subsequent visits, avoiding the cold start problem. 1. Introduction Today s world is sometimes called the information society, to point out the growing importance that information is assuming. It is easy for everyone to consult knowledge sources and to publish them. Automatic systems help in this process, but they also generate a huge amount of monitoring and derived data. The practical effect is that, at a certain stage, people will be confronted with more information than they can effectively process: this situation is known as information overload [4] [17]. This means that part of that informa

8 8 Messina et al tion will be ignored, forgotten, distorted or otherwise lost. The web is the most evolving media and reflects these trends: finding information on it is becoming more and more difficult and time consuming. Users want to find useful and interesting contents during the navigation; on the other hand, portal administrators of e-commerce and services sites want to attract visitors. Every person perceives the definition of useful and interesting in a different way: this is the reason why systems that provide personalized suggestions based on user preferences, a.k.a. recommendation systems, are required. In order to derive models for representing web users and identifying their interests three different approaches may be found in the literature: collaborative filtering, content-based analysis, browsing behaviour modelling; this classification depends on the basis of the data source used. People interacting with collaborative filtering based systems have to actively express an interest, rating the contents they are viewing. This allows the system to give friendly suggestions (filter) based on the opinions of others users of the same service (from this the term collaborative ). In [12], for example, the authors proposed an filter which asks a small group of users to formulate queries in a special language, in order to determine the usefulness. Other collaborative filtering systems have been proposed in [18] and [25]. Even in these cases, an active and explicit participation from the user community is required: each user has to rate the content of Usenet news articles. A form of automation is introduced here by applying a k-nearest neighbour algorithm to find groups with similar interests. In [24] rating weights are defined to be proportional to the time spent viewing a page. In [31] the Usenet news posting are used to rate the liking of web sites, creating a list of the top endorsed sites. In a recent work, Sugiyama [30], user s profiles are derived from the choices made after a query submission to a search engine and from the contents of the pages selected from the query results. A modified collaborative filtering is then applied to a user-term matrix (instead of user-item matrix in classic collaborative filtering). Users term vectors are then clusterized to find homogeneous communities. Content based recommendation systems build a model of the web pages contents and compare it with the contents which are of interest for the user.

9 UP-DRES User Profiling for a Dynamic REcommendation System 9 Collaborative filtering is here implicit, in the sense that user s choices are helpful to state the relevance of similar items. The main techniques applied in this field can be grouped in clustering [3] [6], bayesian networks [6] and rulebased systems [27]. A content based approach to learn human interests automatically through a divisive hierarchical clustering algorithm has been proposed in [16]. Each page can be assigned to one or more nodes in the hierarchy, which is used for learning and predicting interests: the root is the user s general long-term interest and leaves represent short-term specific domains. In [28] information coming from multiple information resources is aggregated in order to create a recommendation list as reply to queries in which different query elements can be assigned by the user. An interesting application can be found in [13] where a system which presents links of interest in a box integrated into the Internet Explorer browser is presented. Here an ontology is built by clustering vectors of words extracted from web pages. In [23] the computer science ontology described in [22] is used for bootstrapping the current user s interests, in order to overcome the cold start problem arising when the user is unknown to the system. Documents viewed by the user are associated to a topic by using a variant of the nearest neighbour algorithm. Collaborative filtering is then performed on a user-topic matrix. In another system the content based approach is combined with collaborative filtering [1]. It ranks web pages through a topic filter and this information is reinforced by the user s feedbacks. Content personalized web pages present different information to different users and diverge from link personalization, which only adapts the link anchor structure and leaves unmodified the substantial information part. Early studies in [5] present the idea of a newspaper that allows for interactive personalization. In My Yahoo! [21] user s preferences are collected from explicit indication or semi-automated inference from navigation activity, asking the user to choose from general areas to more specific topics. The browsing behaviour modelling approach analyzes the interactions between the user and the web. Like in [35] web-server logs are used as data source to track user s browsing pattern into web sites. These logs, that are collected automatically from web server applications, provide information about activities performed by a user from the moment he/she enters a web site

10 10 Messina et al. to the moment he/she leaves it [8], including time spent viewing a page, and allow us to separate browsing sessions. Sessions clustering is useful to discover both groups of users, exhibiting similar browsing patterns, and groups of pages, with related contents (pages are clusterized on the basis of how often they appear together across navigation patterns). Algorithms for sessions clustering can be classified into two approaches: similarity-based and model-based (or probabilistic) [7]. Compared to similarity-based methods, which assign user to a cluster only on the basis of a given session similarity measure, model-based methods offer better interpretability: each model directly characterizes the corresponding cluster. Model-based clustering techniques have been widely used and have shown promising results in many applications involving web data [2] [33]. More specifically, in the model based approach the users sessions clusters are generated as follows: 1. A user arrives at the web site in a particular time and is assigned to a cluster with some probability. The number of clusters is determined by using several probabilistic methods, such as BIC (Bayesian Information Criterion), bayesian approximations, or bootstrap methods [11]. 2. The behaviour of each cluster is governed by a statistical model and the user s behaviour is generated from this model. Each cluster has a data-generating model with different components. Clusters are defined by learning the parameters of one or more (in the case of a mixture) probability distribution function, used to assign people to the various clusters, and the number of components. The number of components of the model can be determined by model selection techniques and parameters can be estimated using maximum likelihood algorithms, e.g. the EM (Expectation- Maximization) [9]. Other approaches that don t need user s active participation to the model creation are WebWatcher [14] and Letizia [19] [20], which extract information on users from their browsing behaviour. Some critics can be moved to the fact that they propose a persistent model and don t care about user s interest changes. For a complete review of the system based on implicit user participation see [15]. In this paper we propose a web profiling system particularly suitable for improving the services offered by dynamic web sites, whose contents are composed from a repository of documents related to different arguments. It

11 UP-DRES User Profiling for a Dynamic REcommendation System 11 combines the content based analysis with browsing behaviour modelling, in the sense that we follow the users during their visits and, on the basis of the contents that they are viewing, we identify their behaviour and consequently their interests. Sometimes people have to answer many questions about preferences or demographic data when they register to a web site. Profiles created in this way are generally static and have to be kept updated under the responsibility of the user. However, only few of them are willing to spend time doing seemingly useless operations, also if this will ensure a better personalization. The results are incomplete, unreliable profiles. The proposed approach does not require human interaction, because it extracts information about user preferences from the contents of the visited web pages. Another advantage of our system is that, being integrated in the content management application, it operates online, collecting the requests made by user without the need of web server logs data. In fact, log files ideally represent a good source of data to infer the browsing behaviour but practically, as stated in [2][33], they have to be cleaned and processed to reconstruct the users navigation sessions; this process can be very hard and sometimes impossible, due to technical reasons concerned mainly with privacy and security procedures that hide personal data. In addition, today the world wide web is migrating towards a dynamic structure, in which pages are not published in simple HTML format, but contains executable code and dynamic access to resources, and logs are losing the traditional function of lists of requested web pages, to become records of content management applications status, from which it is difficult to obtain useful information. The rest of the paper is organized as follows: the general architecture of the system is described in Section 2, where we introduce all of its modules: Content Extractor, responsible to manage documents and convert them in a machine-tractable form, Taxonomy Builder, that creates a document hierarchy based on topics, Recommendation Manager, which creates Sort and Long Term Profiles of users, on the basis of contents that they view. In sections 3 to 5 are given detailed descriptions of each of these modules. Finally, in Section 6 we present our conclusions and future work directions.

12 12 Messina et al. 2. The Proposed System In this section we present a synthesis of the architecture of the system that allows us to profile web users dynamically, in order to help them during the navigation process. In Fig. 1 we show the system s main modules: Content Extractor, Taxonomy Builder and Recommendation Manager. The activation of these modules and the data exchanges between them are governed by the super-module UP-DRES, which acts as a supervisor. Fig. 1. Overview of the system Some external elements take part in UP-DRES functioning. The Document Repository contains all the textual elements that can be used to compose the Web Site pages. The application that manages the Web Site is able to intercept the User s request and send them to the Document Repository, in order to select the documents to introduce to the UP-DRES system for the classification process.

13 UP-DRES User Profiling for a Dynamic REcommendation System 13 The system, through the Recommendation Manager module, combines the user s Short Term Profile (STP), obtained by analysing the user behaviour during the current session, with a Long Term Profile (LTP) built as a weighted sum of the previously constructed user s STPs. Typical web pages are composed of text, images, multimedia contents and applications stored in a file system area called Document Repository. Profiles are obtained by considering the contents of the pages visited by the user. They are used by the Recommendation Manager Module to decide, through a maximization matching procedure, which information to present next on the web site by choosing it from the currently available Document Repository. Contents shown on the web page should therefore automatically capture the visitor s preferences by using as indicator of interest the choices made by the user by clicking on a given page and the time spent visiting such page. The system runs on the server side, as a process integrated in the content management system which manages the web pages publication. In order to maximize the matching between the user s preferences expressed during the navigation pattern and the information currently available in the Document Repository, the Content Extractor browses periodically (offline) the Document Repository to take snapshots of the web site contents and it builds a matrix, which is its vector space representation, as described in Section 3. This matrix is then used as input by the Taxonomy Builder module to generate the Web Site Taxonomy, as explained in Section 4. As a visiting session starts, the sequence of pages viewed by the user are processed by the Content Extractor and a STP is dynamically updated at each click. The Recommendation Manager combines opportunely the STP with the Long Term Profile, as described in section 5. This profile combination produces as output a vector of terms which is classified according to the Web Site Taxonomy in order to find the taxonomy node whose contents best matches the user s browsing behaviour and his/her general interests. Recommendation is therefore made generating a selfadapting, personalized web site: contents of the matching class are presented to the user as link suggestion or composed dynamically in a web page. At the end of each session, the STP is integrated in the LTP, which synthesizes the user s browsing history which will be used in the next sessions to refine the recommendation process.