Acquisition of User Profile for Domain Specific Personalized Access 1



Similar documents
Semantic Search in Portals using Ontologies

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Semantic Concept Based Retrieval of Software Bug Report with Feedback

Universität Augsburg. Institut für Informatik D Augsburg. Learning Scrutable User Models: Inducing Conceptual Descriptions. Martin E.

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

QDquaderni. UP-DRES User Profiling for a Dynamic REcommendation System E. Messina, D. Toscani, F. Archetti. university of milano bicocca

A Framework for Ontology-Based Knowledge Management System

1 o Semestre 2007/2008

Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

WebWatcher: A Tour Guide for the World Wide Web. Dayne Freitag. Carnegie Mellon University. in intelligent agents.

Adaptive Translation between User s Vocabulary and Internet Queries

Information Visualization of Attributed Relational Data

Persona: A Contextualized and Personalized Web Search

Training Management System for Aircraft Engineering: indexing and retrieval of Corporate Learning Object

Inverted files and dynamic signature files for optimisation of Web directories

Web Document Clustering

2 AIMS: an Agent-based Intelligent Tool for Informational Support

Reverse Engineering of Relational Databases to Ontologies: An Approach Based on an Analysis of HTML Forms

A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS

How To Cluster On A Search Engine

Search engine ranking

Index Terms Domain name, Firewall, Packet, Phishing, URL.

Profile Based Personalized Web Search and Download Blocker

Web-based Multimedia Content Management System for Effective News Personalization on Interactive Broadcasting

Self Organizing Maps for Visualization of Categories

An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them

Intinno: A Web Integrated Digital Library and Learning Content Management System

Experiments in Web Page Classification for Semantic Web

Search and Information Retrieval

Text Classification Using Symbolic Data Analysis

Annotea and Semantic Web Supported Collaboration

A Hybrid Approach for Ontology Integration

Search Result Optimization using Annotators

Building A Smart Academic Advising System Using Association Rule Mining

Utilising Ontology-based Modelling for Learning Content Management

The 2006 IEEE / WIC / ACM International Conference on Web Intelligence Hong Kong, China

Web Mining using Artificial Ant Colonies : A Survey

Data Discovery on the Information Highway

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

LDA Based Security in Personalized Web Search

Association rules for improving website effectiveness: case analysis

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Research of Postal Data mining system based on big data

Adaptive Probing: A Monitoring-Based Probing Approach for Fault Localization in Networks

A Framework for the Delivery of Personalized Adaptive Content

Query Recommendation employing Query Logs in Search Optimization

Mining Text Data: An Introduction

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Importance of Domain Knowledge in Web Recommender Systems

Extending a Web Browser with Client-Side Mining

Personalization of Web Search With Protected Privacy

ONTOLOGY-BASED GENERIC TEMPLATE FOR RETAIL ANALYTICS

AN EFFICIENT APPROACH TO PERFORM PRE-PROCESSING

Automated Collaborative Filtering Applications for Online Recruitment Services

Analysis of Social Media Streams

Spam Detection Using Customized SimHash Function

Blog Post Extraction Using Title Finding

A COMBINED TEXT MINING METHOD TO IMPROVE DOCUMENT MANAGEMENT IN CONSTRUCTION PROJECTS

VisCG: Creating an Eclipse Call Graph Visualization Plug-in. Kenta Hasui, Undergraduate Student at Vassar College Class of 2015

A Semantically Enriched Competency Management System to Support the Analysis of a Web-based Research Network

A Framework of Personalized Intelligent Document and Information Management System

Effective User Navigation in Dynamic Website

Integrating User Data and Collaborative Filtering in a Web Recommendation System

An Intelligent Matching System for the Products of Small Business/Manufactures with the Celebrities

KEYWORD SEARCH IN RELATIONAL DATABASES

NNMi120 Network Node Manager i Software 9.x Essentials

Florida International University - University of Miami TRECVID 2014

A UPS Framework for Providing Privacy Protection in Personalized Web Search

Performance evaluation of Web Information Retrieval Systems and its application to e-business

WEB SITE OPTIMIZATION THROUGH MINING USER NAVIGATIONAL PATTERNS

Facilitating Knowledge Intelligence Using ANTOM with a Case Study of Learning Religion

KOINOTITES: A Web Usage Mining Tool for Personalization

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Design and Implementation of Domain based Semantic Hidden Web Crawler

A Survey on Web Mining From Web Server Log

I. INTRODUCTION NOESIS ONTOLOGIES SEMANTICS AND ANNOTATION

Dynamical Clustering of Personalized Web Search Results

Natural Language Updates to Databases through Dialogue

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin

Chapter 6. Attracting Buyers with Search, Semantic, and Recommendation Technology

Semantically Enhanced Web Personalization Approaches and Techniques

Remote support for lab activities in educational institutions

HELP DESK SYSTEMS. Using CaseBased Reasoning

Data Mining for Web Personalization

A CLIENT-ORIENTATED DYNAMIC WEB SERVER. Cristina Hava Muntean, Jennifer McManis, John Murphy 1 and Liam Murphy 2. Abstract

Ontology-Based Filtering Mechanisms for Web Usage Patterns Retrieval

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

Graph Mining and Social Network Analysis

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

NOVEL APPROCH FOR OFT BASED WEB DOMAIN PREDICTION

Qualitative Corporate Dashboards for Corporate Monitoring Peng Jia and Miklos A. Vasarhelyi 1

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Automatic Annotation Wrapper Generation and Mining Web Database Search Result

AWERProcedia Information Technology & Computer Science

A HYBRID RULE BASED FUZZY-NEURAL EXPERT SYSTEM FOR PASSIVE NETWORK MONITORING

Transcription:

Acquisition of User Profile for Domain Specific Personalized Access 1 Plaban Kumar Bhowmick, Samiran Sarkar, Sudeshna Sarkar, Anupam Basu Department of Computer Science & Engineering, Indian Institute of Technology Kharagpur, India {plaban, samiran, sudeshna, anupam}@cse.iitkgp.ernet.in Abstract. The Internet is a large and ever increasing database of structured and unstructured data ranging from audio, video and text. As a consequence, often a huge number of references to web pages are returned by a standard search engine in response to a user s query. Finding information from this huge pool is tedious and time taking. It is necessary to have a model of the user s interest in order to identify relevant information personalized to his needs. In this paper, we describe our representation of the user s profile so that it can be used to identify relevant information for the user. We describe our system for acquiring and maintain the user s profile so that the system can adapt itself to shift in the interest of the user. For each user, the system maintains a separate user profile that needs to gradually saturate to the actual interest of the user. 1 Introduction The Internet is an enormously growing source of information belonging to varying domain. Finding relevant information for the user provided query from the huge pool of the Internet is a challenging task to address as the personalized search interface is truly lacking. In a general purpose search engine, the results returned for a search query is same for all the users irrespective of the background and interest of the users. For example, in response to a query reflection, a standard search engine returns results from varying domains. For example, among first 9 results returned by Google, only three results are related to the domain of physics. This is because the keyword reflection has various meanings in different context and holds different meaning for different user category. We have developed a system that can be used by school students to query the Internet to retrieve personalized information. The domain of our interest is school level topics. In this domain, different users belonging to different grades and abilities will have different requirements. The requirement of a class seven student is somewhat different from that of a tenth standard student. So the same set of documents may not be understandable to students of every level. Traditional search engines do 1 This research work is funded in part by Media Lab Asia, under the auspices of the Communication Empowerment Laboratory, IIT Kharagpur.

not address this issue. There is need to develop a system which will personalize the view of the global database depending on the personal preferences of the users. To deliver the appropriate set of documents to the user, the system needs the knowledge of the underlying domain and also the concepts of interest to the user and her level of knowledge. To achieve this, there is need for proper representation of the domain knowledge about the different subjects and also knowledge about the user s requirement. In this paper, we will be describing the representation of the domain knowledge and the use of the domain knowledge in modeling and acquiring the user s profile. In section 3, we describe different aspects of domain knowledge and its requirement in the process of acquiring the user interest. In section 4, we will be discussing about the model of the user interest that helps in providing the users with relevant information. 2 Related Work Retrieving personalized information from the information space of Internet is a broad area of research. Several research works look at the various aspects of personalized system. The research works differs in the way they represent the user profile, the adaptivity of the system. We call a personalized system to be adaptive if the system is able to tune itself according to requirement of the user. Again among the adaptive systems, the algorithm for learning the user s interest varies from system to system. The presence of domain knowledge also distinguishes between the systems. FAB[1] is a content based recommendation system where the user profile is represented as a list of keywords and is maintained by relevance feedback. SHIFT[2] provides the users with personalized information by looking at the aspect of user interest modeling. The user profile is represented by weighted vector of keywords and specified during the subscription procedure. No idea of adaptation is apparent here. ifweb[3] is a user model based intelligent agent that provides support for navigation of WWW and also for document search according to the need of the user. The user profiles are represented as weighted semantic networks. The nodes and relations between nodes are derived from the co-occurrence criteria of related terms in some documents. The user profile is updated by relevance feedback. SmartPush[4] is a personalized news delivery system which depends on the special type of content authoring. Each document is augmented with a metadata in an ontological form that describes the content. User profile is represented by the hierarchy of concepts or ontology and is created explicitly by the user or by choosing from a set of default profiles. The user profile and the metadata in each document are matched to decide on the relevance of the document. WebWatcher[5] assists users like a tour guide while the user is browsing the World Wide Web providing important suggestions in choosing relevant hyperlinks by analyzing past experiences. The user profile is represented as a list of keywords and is provided in the beginning of the tour. Each link is annotated with the interest of the user if she selects the link. But no adaptation is supported in case of user profile. Syskill & Webert[6] provides the facility of both search and recommendation.

The user profile is a set of non-related classes each of which represents the content of an index page. Each class contains a separate profile which is a list of boolean features. Letizia[7] is an agent that automates a browsing strategy consisting of a best-first search augmented by heuristics inferring user interest from browsing behavior. The observation process is passive in the sense that it sits idle when the user browses. It provides recommendation on demand. In [8], the user profile is represented as hierarchy of concepts and is adopted from a reference ontology. Each concept is assigned with ten documents the concepts from which are extracted to generate a super document vector for each concept. The user profile is maintained automatically by looking at the surfing behavior of the user. The surfed pages are then classified to the appropriate concept node by measuring the cosine similarity[9] between the document vector and vector assigned to concept node. The time spent on a page plays an important role in calculating the weights. All the systems described above are addressing the issue of personalization based on the ontology of general domain. The domain of our interest i.e., the domain of school topics has a well defined structure. Our ontology has been tuned to that structure. To identify relevant documents for a school student, a system needs not only to identify the important concepts in the domain, but also needs to consider whether the concept is easily understandable to a student given her background. The above mentioned systems are inadequate to fulfill this need completely. Our work describes how these issues can be handled by representing the domain knowledge and the user interest in a special way. 3 Domain Knowledge The knowledge representation database, ontology[10], is organized into a three level hierarchical structure as shown in Figure 1. Fig. 1. Structure of Ontology Topic-Subtopic Level: On the top level, the topics share a parent child relationship. This provides a way of generalization from a specific to a more general topic. The hierarchy of the topics is stored as an n-ary tree with the exception that a node may have multiple parents. This is because a subtopic may be placed under two or more

topics. For example, in the domain of biology, animal nutrition and plant nutrition are two subtopics of the topic nutrition. Concept Level: A topic consists of several concepts, which form the next level of the ontology and a concept may belong to one or more topics. A set of empirical relations can be defined among the concepts in a domain. We notice that if a concept is of significance in a document, it is usually the case that the document contains a number of references to related concepts. The breadth and depth of the ontology is used by the ranking algorithm because concepts that are directly and remotely connected to the concepts in the query are used for the calculation of the document scores. In fact the occurrence of related concepts is taken as a very strong indication of the relevance of the document. Pages that do not contain related concepts are suspect and may be spurious. The relations that are stored in the ontology become very important for this reason. In order to keep the system simple the relations must be broad and general. The relation list chosen must also cover most important forms of relations that occur so that the ranking process has a sufficiently good ontological web. For example, if a document contains material relevant to reflection in optics, it will have references to some of the related concepts like light, ray, mirror, lens, angle of incidence, etc. To capture the strength of a relation, we introduce the notion of distance between two concepts. This distance between two concepts is not symmetric. These distances have been devised and tuned experimentally for each domain. The types of relations in the context of a domain are explained in Table 1. The concepts in the domain are organized into a di-graph. The existence of an edge between two concepts in the digraph indicates that the concepts are related. Each edge is assigned a weight depending upon the relation by which two concepts are related by this edge. The weight is an indication of the strength of the relationship. Table 1. Inter concept relationships Relations Has Part Inherited From Has Prerequisite Functionally Related Part Of Procedure Is Caused By The relations in Table 1 provide a way of storing the structure of a domain without storing any information about a particular concept. This structure may be used as a conceptual co-occurrence function and shows which concepts can logically co-occur. These relations make it possible to find the concepts that are close to a particular concept and this information may be used in many ways. Keyword Level: A set of keywords of each domain for the concepts that the keywords are associated to. This list also contains the specificity index of each keyword with respect to each of the concepts that it is associated to. This specificity index

stores the likelihood of the keyword representing a particular concept. These keywords are used to extract concepts from documents and queries. The association of the keywords to the concepts has several advantages. Firstly, the different keywords having the same meaning are mapped to a common concept removing the synonymous ambiguity of keywords. 4 Model of User Interest There is a need to model the interest of the user in order to filter the web documents with respect to the need of the user. The students belonging to the same class have common set of interest that is defined by the curriculum. So, we have defined a set of group profiles that are the representation of the syllabus. The model of the users interest is captured in the form of user profile that can be derived from group profiles by default. But individual interest of a user can vary from the predefined group profiles. We define two types of attributes to model two different aspect of the user interest. 1. Domain knowledge specific attributes: This type of attributes tries to capture the interest of the user in terms of the knowledge of the domain. The same ontological structure as the domain knowledge is adopted in the representation of the domain knowledge specific attributes of the user profile. Each concept in the user profile is further annotated with scores revealing the interest value of the concepts. 2. Information presentational attributes: The format and the view a document largely depends on the personal preferences of the user. For example, the user may like to have images in the presented document or the user may like to view the document with his personal color preferences. So there is a need to personalize the presentational view of the document. These attributes are used by the transcoding module of the system during presentation of the document. 4.1 Creation and Maintenance of the User Profile The user profile is acquired in two phases. At the first phase, the user is asked explicitly to provide her initial profile as a goal. The user can also update the profile manually. Static scores are assigned to each concept in these concepts. The user may not be able to enumerate all his interests initially. So the user s browsing history is used to update her profile. The next phase (user profile acquisition) monitors the browsing behavior of the user and with the help of the content analysis scheme the concepts of the user s interest are discovered gradually. 4.2 Profile Editing and Monitoring Architecture In Figure 2, we present an architecture for the creation and automatic updating of the user profile.

Fig. 2. Profile editing and monitoring architecture 4.3 Static User Profile Creation Profile Editor: In Figure 3, we provide an interface that helps the student to create her profile consulting several group profiles. Fig. 3. User profile creation and updating interface The interface provides the following facilities: Choosing a predefined group profile. Adding a new topic into the profile from a predefined group profile. Adding a subset of concepts under a topic. The student can also update his profile statically. The following operations are provided for updating a user profile: Deleting a topic and all the concepts under the topic. Deleting a subset of concepts under a topic.

Adding new concepts from a topic. The concepts chosen by the user are high indicators of his interest. So, these concepts should get higher interest scores. We adopt a fixed scoring scheme to score these concepts: Score(C) = S, where C Є concept from the static user profile S = constant representing the fixed score 4.4 User Profile Acquisition We have identified four possible data sources listed below that can be used to learn the user s preferences as the user starts using the system. 1. Query history. 2. Usage log of the user. 3. Previous state of the user profile. 4. Content of the document scanned by the user. 4.4.1 Learning from Query Pattern The pattern in which the user places her query reveals much of the user s interest. The system monitors all queries placed by a user and periodically updates the score of each concept by looking at the frequency the concepts present in the queries. Thus we get a weighted list of concepts. From this we choose the concepts which have higher frequency associated with. Now we scan each concept in this filtered list. If a concept from this list does exist in the user profile, the score for this concept is increased with the help of scoring scheme discussed later. If the concept is new to the user profile, then the concept is just annotated with calculated score. For each concept appearing in the query find the related concepts that occur in the current user profile Score = (concept frequency + related concept score) /period 4.4.2 Learning from Browsing Pattern The monitoring agent monitors the browsing pattern to capture the concepts of user s interest. The browsing pattern of the user is maintained in the form of usage log. The usage log is analyzed to obtain the Web Access Graph that represents the browsing graph of the user for a particular result provided to the user in the response to a query. User Log Analysis: There are some important clues from usage log that we can exploit: 1. The file accessed by the user in her session. 2. time that a user spends on a particular document. Access Graph: We have represented the browsing session of the user in a directed graph called Access Graph (AG). The graph is the trace of the navigation pattern of the user. Each node of this graph is the representation of the browsed page having the following fields: Time of access defined as follows

t a = time of access for the page The idiosyncrasies of the user access behavior should be kept in mind in calculating t a. The user may explore a link and return back without spending a sufficient amount of time. Again the user may keep a page opened for a long time while she is busy with some other work. So, two threshold values have to be set so that effect of these idiosyncrasies can be tackled. Here we set two threshold values, which limit the acceptance of a reasonable access time. The set of links with three types of labeling: o explored_fruitful. o explored_unfruitful o unexplored Concepts derived from anchor text of links: The set of all links in an accessed page is divided into three sets. Φ explored-fruitful = set of all concepts appearing in the explored and fruitful links. ξ explored-unfruitful = set of all concepts appearing in the explored but unfruitful links. Ψ unexplored = set of all concepts appearing in the unexplored links. ω ef (c) = static score * ß * (t a /length), c Є Φ explored- fruitful (1) ω eu (c) = static score * ø * (t a /length), c Є ξ explored-unfruitful (2) ω un (c) = - static score * ø * (t a /length), c Є Ψ unexplored where ß and ø are score emphasizing factors that are tuned empirically and ß> ø Final vector Ω ln (c) = w 1 * ω ef (c) + w 2 * ω eu (c) + w 3 * ω un (c) (4) Concepts derived from the content: Here we derive the interesting scores of the concepts that are present in a document. We call those concepts as Direct Interesting Concept (DIC) those are relevant to the domain in concern. We not only consider the frequencies of DICs to be the score of the concept but also the related concepts to a DIC that are present in the previous state of the user profile contributes in the score of DIC. We call these related concepts as Indirect Interesting Concept (IIC). The score for each concept are derived by the following formulas: Ω DIC = frequency(dic) (5) Ω DIC = Ω DIC + Σ i=1,n (1/d DIC-IIC * ω IIC ), where n = no. of related concepts. d DIC-IIC = distance of the relation between DIC and IIC ω IIC = weight IIC in the previous state of the user profile. Score Accumulation: To get the final scores of the concepts explored during the browsing of the results, the access graph is transformed into an Access Tree (AT). Here depth at which a particular document is accessed plays an important role as browsing of a page at higher depth increases the interestingness value of the page. We preprocess the graph before generating AT. Certain types of links are removed (3) (6)

because they do not contribute much in score accumulation process. The types of links that can be pruned are: The self referential links. The link that form a cycle. In both the cases, the links are converted into simple text. From the remaining graph between root page and each individual page we find the path of maximum length and ignore other links that falls in the duplicate path. The scores of concepts during one result browsing are derived by the following formula: Final interest vector = interest vector at root + (7) Σ i=1,n (depth i * interest vector at child i ) 4.4.3 User Feedback This process of acquisition of the user profile is user feedback dependent. When the user is presented with a set of results, the user is explicitly asked to rank each of the result he has gone through {interesting, not interesting, ok}. The final interest vector and the final dislike vector is annotated with the explicit user feedback by the following expression: Final feedback vector = feedback_score* final interest vector Where feedback_score Є { 1.5(interesting, 0.5(not interesting), 1(ok)}. 4.4.4 Concept Age Monitoring We assume that a concept that has been referred infrequently in the past will be referred in the near future with lower probability. For that we have introduced the concept of aging. The age of a concept in the user profile increases when the user logs on to the system but the concept is not referred. The concepts with higher ages represent the concepts with lower interest with respect to the userhere we define an interest decay factor that depends on the age of the concept and number of sessions he has logged in. µ = (age of the concept)/(number of sessions) 3 Conclusion and Future Work The modeling of users interest is challenging task. The idiosyncrasies in the user behavior make the problem an order of magnitude harder. Here we have adopted a hybrid model of content based as well as access based approach. The presence of domain knowledge makes the process of acquisition of user interest a little bit simpler and robust also. There is need to draw comparison between the actual user interest and the acquired user profile. As a future work, we have to device a criterion by which we can estimate the time needed in the convergence of the acquired user inter-

est to the actual user interest. Again there should be evaluation criteria to show how close the acquired profile is to the actual user profile. References 1. M. Balabanovic and Yoav Shoham. FAB: Content Based Collaborative Reccomendation. In Communication of the ACM, Vol. 40 No. 3, Page 66-72, March 1997. 2. Tak W. Yan and H. Garcia-Molina. SIFT: ATool for Wide-Area Information Dissemination. In Proceedings of the 1995 USENIX Technical Confernce, Pages 177-86, 1995. 3. Fabio A. Asnicar, Carlo Tasso. ifweb: a Prototype of User Model-Based Intelligent Agent for Document Filtering and Navigation in the World Wide Web. Proceedings of the workshop "Adaptive Systems and User Modeling on the World Wide Web", Sixth International Conference on User Modeling, Chia Laguna, Sardinia, 2-5 June 1997. 4. T. Kuki, S. Jokela, R. Sulonen and M. Turpeinen. Agents in Delivering Personalized Content Based on Semantic Metadata. In Proc. 1999 AAAI Spring Symposium Workshop on Intelligent Agents in Cyberspace, pages 84-93, Stanford, USA, 1999. 5. T. Joachims, D. Freitag, and T. Mitchell. WebWatcher: A Tour Guide for the World Wide Web. In Proc. IJCAI 97, August 1997. 6. M. Pazzani, J. Muramatsu, and D. Billsus. Syskill & Webert: Identifying Interesting Web Sites. In Proc. 19 th National Conference on Artificial Intelligence, 1996. 7. Henry Lieberman. Letizia: An agent that Assists Web Browsing. In Proc. International Conference on Artificial Intelligence, Montreal, Canada, August 1995. 8. Alexander Preschner and Susan Gauch. Ontology Based Personalized Search. In Proc. 11th Intl. Conf. on Tools with Artificial Intelligence, pages 391-398, November 1999. 9. Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information retrieval. Addison Wesley Longman Publishing Co. 10. Michael Gruninger and Jintae Lee. Ontology Applicatios and Design. On Communications of the ACM, pages 39-41, February 2003/Vol. 45, No. 2.