Acquisition of User Profile for Domain Specific Personalized Access 1

Acquisition of User Profile for Domain Specific Personalized Access 1 Plaban Kumar Bhowmick, Samiran Sarkar, Sudeshna Sarkar, Anupam Basu Department of Computer Science & Engineering, Indian Institute of Technology Kharagpur, India {plaban, samiran, sudeshna, anupam}@cse.iitkgp.ernet.in Abstract. The Internet is a large and ever increasing database of structured and unstructured data ranging from audio, video and text. As a consequence, often a huge number of references to web pages are returned by a standard search engine in response to a user s query. Finding information from this huge pool is tedious and time taking. It is necessary to have a model of the user s interest in order to identify relevant information personalized to his needs. In this paper, we describe our representation of the user s profile so that it can be used to identify relevant information for the user. We describe our system for acquiring and maintain the user s profile so that the system can adapt itself to shift in the interest of the user. For each user, the system maintains a separate user profile that needs to gradually saturate to the actual interest of the user. 1 Introduction The Internet is an enormously growing source of information belonging to varying domain. Finding relevant information for the user provided query from the huge pool of the Internet is a challenging task to address as the personalized search interface is truly lacking. In a general purpose search engine, the results returned for a search query is same for all the users irrespective of the background and interest of the users. For example, in response to a query reflection, a standard search engine returns results from varying domains. For example, among first 9 results returned by Google, only three results are related to the domain of physics. This is because the keyword reflection has various meanings in different context and holds different meaning for different user category. We have developed a system that can be used by school students to query the Internet to retrieve personalized information. The domain of our interest is school level topics. In this domain, different users belonging to different grades and abilities will have different requirements. The requirement of a class seven student is somewhat different from that of a tenth standard student. So the same set of documents may not be understandable to students of every level. Traditional search engines do 1 This research work is funded in part by Media Lab Asia, under the auspices of the Communication Empowerment Laboratory, IIT Kharagpur.

not address this issue. There is need to develop a system which will personalize the view of the global database depending on the personal preferences of the users. To deliver the appropriate set of documents to the user, the system needs the knowledge of the underlying domain and also the concepts of interest to the user and her level of knowledge. To achieve this, there is need for proper representation of the domain knowledge about the different subjects and also knowledge about the user s requirement. In this paper, we will be describing the representation of the domain knowledge and the use of the domain knowledge in modeling and acquiring the user s profile. In section 3, we describe different aspects of domain knowledge and its requirement in the process of acquiring the user interest. In section 4, we will be discussing about the model of the user interest that helps in providing the users with relevant information. 2 Related Work Retrieving personalized information from the information space of Internet is a broad area of research. Several research works look at the various aspects of personalized system. The research works differs in the way they represent the user profile, the adaptivity of the system. We call a personalized system to be adaptive if the system is able to tune itself according to requirement of the user. Again among the adaptive systems, the algorithm for learning the user s interest varies from system to system. The presence of domain knowledge also distinguishes between the systems. FAB[1] is a content based recommendation system where the user profile is represented as a list of keywords and is maintained by relevance feedback. SHIFT[2] provides the users with personalized information by looking at the aspect of user interest modeling. The user profile is represented by weighted vector of keywords and specified during the subscription procedure. No idea of adaptation is apparent here. ifweb[3] is a user model based intelligent agent that provides support for navigation of WWW and also for document search according to the need of the user. The user profiles are represented as weighted semantic networks. The nodes and relations between nodes are derived from the co-occurrence criteria of related terms in some documents. The user profile is updated by relevance feedback. SmartPush[4] is a personalized news delivery system which depends on the special type of content authoring. Each document is augmented with a metadata in an ontological form that describes the content. User profile is represented by the hierarchy of concepts or ontology and is created explicitly by the user or by choosing from a set of default profiles. The user profile and the metadata in each document are matched to decide on the relevance of the document. WebWatcher[5] assists users like a tour guide while the user is browsing the World Wide Web providing important suggestions in choosing relevant hyperlinks by analyzing past experiences. The user profile is represented as a list of keywords and is provided in the beginning of the tour. Each link is annotated with the interest of the user if she selects the link. But no adaptation is supported in case of user profile. Syskill & Webert[6] provides the facility of both search and recommendation.

The user profile is a set of non-related classes each of which represents the content of an index page. Each class contains a separate profile which is a list of boolean features. Letizia[7] is an agent that automates a browsing strategy consisting of a best-first search augmented by heuristics inferring user interest from browsing behavior. The observation process is passive in the sense that it sits idle when the user browses. It provides recommendation on demand. In [8], the user profile is represented as hierarchy of concepts and is adopted from a reference ontology. Each concept is assigned with ten documents the concepts from which are extracted to generate a super document vector for each concept. The user profile is maintained automatically by looking at the surfing behavior of the user. The surfed pages are then classified to the appropriate concept node by measuring the cosine similarity[9] between the document vector and vector assigned to concept node. The time spent on a page plays an important role in calculating the weights. All the systems described above are addressing the issue of personalization based on the ontology of general domain. The domain of our interest i.e., the domain of school topics has a well defined structure. Our ontology has been tuned to that structure. To identify relevant documents for a school student, a system needs not only to identify the important concepts in the domain, but also needs to consider whether the concept is easily understandable to a student given her background. The above mentioned systems are inadequate to fulfill this need completely. Our work describes how these issues can be handled by representing the domain knowledge and the user interest in a special way. 3 Domain Knowledge The knowledge representation database, ontology[10], is organized into a three level hierarchical structure as shown in Figure 1. Fig. 1. Structure of Ontology Topic-Subtopic Level: On the top level, the topics share a parent child relationship. This provides a way of generalization from a specific to a more general topic. The hierarchy of the topics is stored as an n-ary tree with the exception that a node may have multiple parents. This is because a subtopic may be placed under two or more

topics. For example, in the domain of biology, animal nutrition and plant nutrition are two subtopics of the topic nutrition. Concept Level: A topic consists of several concepts, which form the next level of the ontology and a concept may belong to one or more topics. A set of empirical relations can be defined among the concepts in a domain. We notice that if a concept is of significance in a document, it is usually the case that the document contains a number of references to related concepts. The breadth and depth of the ontology is used by the ranking algorithm because concepts that are directly and remotely connected to the concepts in the query are used for the calculation of the document scores. In fact the occurrence of related concepts is taken as a very strong indication of the relevance of the document. Pages that do not contain related concepts are suspect and may be spurious. The relations that are stored in the ontology become very important for this reason. In order to keep the system simple the relations must be broad and general. The relation list chosen must also cover most important forms of relations that occur so that the ranking process has a sufficiently good ontological web. For example, if a document contains material relevant to reflection in optics, it will have references to some of the related concepts like light, ray, mirror, lens, angle of incidence, etc. To capture the strength of a relation, we introduce the notion of distance between two concepts. This distance between two concepts is not symmetric. These distances have been devised and tuned experimentally for each domain. The types of relations in the context of a domain are explained in Table 1. The concepts in the domain are organized into a di-graph. The existence of an edge between two concepts in the digraph indicates that the concepts are related. Each edge is assigned a weight depending upon the relation by which two concepts are related by this edge. The weight is an indication of the strength of the relationship. Table 1. Inter concept relationships Relations Has Part Inherited From Has Prerequisite Functionally Related Part Of Procedure Is Caused By The relations in Table 1 provide a way of storing the structure of a domain without storing any information about a particular concept. This structure may be used as a conceptual co-occurrence function and shows which concepts can logically co-occur. These relations make it possible to find the concepts that are close to a particular concept and this information may be used in many ways. Keyword Level: A set of keywords of each domain for the concepts that the keywords are associated to. This list also contains the specificity index of each keyword with respect to each of the concepts that it is associated to. This specificity index

stores the likelihood of the keyword representing a particular concept. These keywords are used to extract concepts from documents and queries. The association of the keywords to the concepts has several advantages. Firstly, the different keywords having the same meaning are mapped to a common concept removing the synonymous ambiguity of keywords. 4 Model of User Interest There is a need to model the interest of the user in order to filter the web documents with respect to the need of the user. The students belonging to the same class have common set of interest that is defined by the curriculum. So, we have defined a set of group profiles that are the representation of the syllabus. The model of the users interest is captured in the form of user profile that can be derived from group profiles by default. But individual interest of a user can vary from the predefined group profiles. We define two types of attributes to model two different aspect of the user interest. 1. Domain knowledge specific attributes: This type of attributes tries to capture the interest of the user in terms of the knowledge of the domain. The same ontological structure as the domain knowledge is adopted in the representation of the domain knowledge specific attributes of the user profile. Each concept in the user profile is further annotated with scores revealing the interest value of the concepts. 2. Information presentational attributes: The format and the view a document largely depends on the personal preferences of the user. For example, the user may like to have images in the presented document or the user may like to view the document with his personal color preferences. So there is a need to personalize the presentational view of the document. These attributes are used by the transcoding module of the system during presentation of the document. 4.1 Creation and Maintenance of the User Profile The user profile is acquired in two phases. At the first phase, the user is asked explicitly to provide her initial profile as a goal. The user can also update the profile manually. Static scores are assigned to each concept in these concepts. The user may not be able to enumerate all his interests initially. So the user s browsing history is used to update her profile. The next phase (user profile acquisition) monitors the browsing behavior of the user and with the help of the content analysis scheme the concepts of the user s interest are discovered gradually. 4.2 Profile Editing and Monitoring Architecture In Figure 2, we present an architecture for the creation and automatic updating of the user profile.

Fig. 2. Profile editing and monitoring architecture 4.3 Static User Profile Creation Profile Editor: In Figure 3, we provide an interface that helps the student to create her profile consulting several group profiles. Fig. 3. User profile creation and updating interface The interface provides the following facilities: Choosing a predefined group profile. Adding a new topic into the profile from a predefined group profile. Adding a subset of concepts under a topic. The student can also update his profile statically. The following operations are provided for updating a user profile: Deleting a topic and all the concepts under the topic. Deleting a subset of concepts under a topic.

Adding new concepts from a topic. The concepts chosen by the user are high indicators of his interest. So, these concepts should get higher interest scores. We adopt a fixed scoring scheme to score these concepts: Score(C) = S, where C Є concept from the static user profile S = constant representing the fixed score 4.4 User Profile Acquisition We have identified four possible data sources listed below that can be used to learn the user s preferences as the user starts using the system. 1. Query history. 2. Usage log of the user. 3. Previous state of the user profile. 4. Content of the document scanned by the user. 4.4.1 Learning from Query Pattern The pattern in which the user places her query reveals much of the user s interest. The system monitors all queries placed by a user and periodically updates the score of each concept by looking at the frequency the concepts present in the queries. Thus we get a weighted list of concepts. From this we choose the concepts which have higher frequency associated with. Now we scan each concept in this filtered list. If a concept from this list does exist in the user profile, the score for this concept is increased with the help of scoring scheme discussed later. If the concept is new to the user profile, then the concept is just annotated with calculated score. For each concept appearing in the query find the related concepts that occur in the current user profile Score = (concept frequency + related concept score) /period 4.4.2 Learning from Browsing Pattern The monitoring agent monitors the browsing pattern to capture the concepts of user s interest. The browsing pattern of the user is maintained in the form of usage log. The usage log is analyzed to obtain the Web Access Graph that represents the browsing graph of the user for a particular result provided to the user in the response to a query. User Log Analysis: There are some important clues from usage log that we can exploit: 1. The file accessed by the user in her session. 2. time that a user spends on a particular document. Access Graph: We have represented the browsing session of the user in a directed graph called Access Graph (AG). The graph is the trace of the navigation pattern of the user. Each node of this graph is the representation of the browsed page having the following fields: Time of access defined as follows

t a = time of access for the page The idiosyncrasies of the user access behavior should be kept in mind in calculating t a. The user may explore a link and return back without spending a sufficient amount of time. Again the user may keep a page opened for a long time while she is busy with some other work. So, two threshold values have to be set so that effect of these idiosyncrasies can be tackled. Here we set two threshold values, which limit the acceptance of a reasonable access time. The set of links with three types of labeling: o explored_fruitful. o explored_unfruitful o unexplored Concepts derived from anchor text of links: The set of all links in an accessed page is divided into three sets. Φ explored-fruitful = set of all concepts appearing in the explored and fruitful links. ξ explored-unfruitful = set of all concepts appearing in the explored but unfruitful links. Ψ unexplored = set of all concepts appearing in the unexplored links. ω ef (c) = static score * ß * (t a /length), c Є Φ explored- fruitful (1) ω eu (c) = static score * ø * (t a /length), c Є ξ explored-unfruitful (2) ω un (c) = - static score * ø * (t a /length), c Є Ψ unexplored where ß and ø are score emphasizing factors that are tuned empirically and ß> ø Final vector Ω ln (c) = w 1 * ω ef (c) + w 2 * ω eu (c) + w 3 * ω un (c) (4) Concepts derived from the content: Here we derive the interesting scores of the concepts that are present in a document. We call those concepts as Direct Interesting Concept (DIC) those are relevant to the domain in concern. We not only consider the frequencies of DICs to be the score of the concept but also the related concepts to a DIC that are present in the previous state of the user profile contributes in the score of DIC. We call these related concepts as Indirect Interesting Concept (IIC). The score for each concept are derived by the following formulas: Ω DIC = frequency(dic) (5) Ω DIC = Ω DIC + Σ i=1,n (1/d DIC-IIC * ω IIC ), where n = no. of related concepts. d DIC-IIC = distance of the relation between DIC and IIC ω IIC = weight IIC in the previous state of the user profile. Score Accumulation: To get the final scores of the concepts explored during the browsing of the results, the access graph is transformed into an Access Tree (AT). Here depth at which a particular document is accessed plays an important role as browsing of a page at higher depth increases the interestingness value of the page. We preprocess the graph before generating AT. Certain types of links are removed (3) (6)

because they do not contribute much in score accumulation process. The types of links that can be pruned are: The self referential links. The link that form a cycle. In both the cases, the links are converted into simple text. From the remaining graph between root page and each individual page we find the path of maximum length and ignore other links that falls in the duplicate path. The scores of concepts during one result browsing are derived by the following formula: Final interest vector = interest vector at root + (7) Σ i=1,n (depth i * interest vector at child i ) 4.4.3 User Feedback This process of acquisition of the user profile is user feedback dependent. When the user is presented with a set of results, the user is explicitly asked to rank each of the result he has gone through {interesting, not interesting, ok}. The final interest vector and the final dislike vector is annotated with the explicit user feedback by the following expression: Final feedback vector = feedback_score* final interest vector Where feedback_score Є { 1.5(interesting, 0.5(not interesting), 1(ok)}. 4.4.4 Concept Age Monitoring We assume that a concept that has been referred infrequently in the past will be referred in the near future with lower probability. For that we have introduced the concept of aging. The age of a concept in the user profile increases when the user logs on to the system but the concept is not referred. The concepts with higher ages represent the concepts with lower interest with respect to the userhere we define an interest decay factor that depends on the age of the concept and number of sessions he has logged in. µ = (age of the concept)/(number of sessions) 3 Conclusion and Future Work The modeling of users interest is challenging task. The idiosyncrasies in the user behavior make the problem an order of magnitude harder. Here we have adopted a hybrid model of content based as well as access based approach. The presence of domain knowledge makes the process of acquisition of user interest a little bit simpler and robust also. There is need to draw comparison between the actual user interest and the acquired user profile. As a future work, we have to device a criterion by which we can estimate the time needed in the convergence of the acquired user inter-

est to the actual user interest. Again there should be evaluation criteria to show how close the acquired profile is to the actual user profile. References 1. M. Balabanovic and Yoav Shoham. FAB: Content Based Collaborative Reccomendation. In Communication of the ACM, Vol. 40 No. 3, Page 66-72, March 1997. 2. Tak W. Yan and H. Garcia-Molina. SIFT: ATool for Wide-Area Information Dissemination. In Proceedings of the 1995 USENIX Technical Confernce, Pages 177-86, 1995. 3. Fabio A. Asnicar, Carlo Tasso. ifweb: a Prototype of User Model-Based Intelligent Agent for Document Filtering and Navigation in the World Wide Web. Proceedings of the workshop "Adaptive Systems and User Modeling on the World Wide Web", Sixth International Conference on User Modeling, Chia Laguna, Sardinia, 2-5 June 1997. 4. T. Kuki, S. Jokela, R. Sulonen and M. Turpeinen. Agents in Delivering Personalized Content Based on Semantic Metadata. In Proc. 1999 AAAI Spring Symposium Workshop on Intelligent Agents in Cyberspace, pages 84-93, Stanford, USA, 1999. 5. T. Joachims, D. Freitag, and T. Mitchell. WebWatcher: A Tour Guide for the World Wide Web. In Proc. IJCAI 97, August 1997. 6. M. Pazzani, J. Muramatsu, and D. Billsus. Syskill & Webert: Identifying Interesting Web Sites. In Proc. 19 th National Conference on Artificial Intelligence, 1996. 7. Henry Lieberman. Letizia: An agent that Assists Web Browsing. In Proc. International Conference on Artificial Intelligence, Montreal, Canada, August 1995. 8. Alexander Preschner and Susan Gauch. Ontology Based Personalized Search. In Proc. 11th Intl. Conf. on Tools with Artificial Intelligence, pages 391-398, November 1999. 9. Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information retrieval. Addison Wesley Longman Publishing Co. 10. Michael Gruninger and Jintae Lee. Ontology Applicatios and Design. On Communications of the ACM, pages 39-41, February 2003/Vol. 45, No. 2.