Mining Navigation Histories for User Need Recognition

Similar documents
LDA Based Security in Personalized Web Search

International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online

Personalization of Web Search With Protected Privacy

Chapter ML:XI. XI. Cluster Analysis

Supporting Privacy Protection in Personalized Web Search

Web Personalization based on Usage Mining

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

How To Cluster On A Search Engine

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

A PREDICTIVE MODEL FOR QUERY OPTIMIZATION TECHNIQUES IN PERSONALIZED WEB SEARCH

Profile Based Personalized Web Search and Download Blocker

ISSN: A Review: Image Retrieval Using Web Multimedia Mining

Web Mining Seminar CSE 450. Spring 2008 MWF 11:10 12:00pm Maginnes 113

Identifying the Number of Visitors to improve Website Usability from Educational Institution Web Log Data

Dynamical Clustering of Personalized Web Search Results

An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them

Clustering Technique in Data Mining for Text Documents

Automatic Annotation Wrapper Generation and Mining Web Database Search Result

Remote Usability Evaluation of Mobile Web Applications

PREPROCESSING OF WEB LOGS

Web Log Based Analysis of User s Browsing Behavior

Automated Collaborative Filtering Applications for Online Recruitment Services

2. EXPLICIT AND IMPLICIT FEEDBACK

NOVEL APPROCH FOR OFT BASED WEB DOMAIN PREDICTION

Sentiment analysis on tweets in a financial domain

Importance of Domain Knowledge in Web Recommender Systems

Optimization of Image Search from Photo Sharing Websites Using Personal Data

Research and Development of Data Preprocessing in Web Usage Mining

An Enhanced Framework For Performing Pre- Processing On Web Server Logs

LiDDM: A Data Mining System for Linked Data

Financial Trading System using Combination of Textual and Numerical Data

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

Semantically Enhanced Web Personalization Approaches and Techniques

Enhance Preprocessing Technique Distinct User Identification using Web Log Usage data

Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection

Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior

Customer Classification And Prediction Based On Data Mining Technique

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

A Platform to Support Web Site Adaptation and Monitoring of its Effects: A Case Study

Enriched Links: A Framework For Improving Web Navigation Using Pop-Up Views

Bisecting K-Means for Clustering Web Log data

Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

Search and Information Retrieval

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

AN EFFICIENT APPROACH TO PERFORM PRE-PROCESSING

Towards Inferring Web Page Relevance An Eye-Tracking Study

AWERProcedia Information Technology & Computer Science

Recommendations on Web Page Using Domain Knowledge and Web Usage Mining For Personalization

EXTENDING JMETER TO ALLOW FOR WEB STRUCTURE MINING

Supporting Software Development Process Using Evolution Analysis : a Brief Survey

How To Write A Summary Of A Review

Search Result Optimization using Annotators

Information Visualisation and Visual Analytics for Governance and Policy Modelling

BUSINESS ANALYTICS. Overview. Lecture 0. Information Systems and Machine Learning Lab. University of Hildesheim. Germany

Term extraction for user profiling: evaluation by the user

A UPS Framework for Providing Privacy Protection in Personalized Web Search

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Blog Post Extraction Using Title Finding

ANALYZING OF THE EVOLUTION OF WEB PAGES BY USING A DOMAIN BASED WEB CRAWLER

Reverse Engineering of Relational Databases to Ontologies: An Approach Based on an Analysis of HTML Forms

Web Database Integration

KEYWORD SEARCH IN RELATIONAL DATABASES

A Case Study of Question Answering in Automatic Tourism Service Packaging

WEB SITE OPTIMIZATION THROUGH MINING USER NAVIGATIONAL PATTERNS

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Assessing the Scenic Route: Measuring the Value of Search Trails in Web Logs

Dynamic Data in terms of Data Mining Streams

Comparing Tag Clouds, Term Histograms, and Term Lists for Enhancing Personalized Web Search

Business Lead Generation for Online Real Estate Services: A Case Study

Ontology-Based Filtering Mechanisms for Web Usage Patterns Retrieval

Introduction. A. Bellaachia Page: 1

Viral Marketing in Social Network Using Data Mining

COURSE RECOMMENDER SYSTEM IN E-LEARNING

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Improving Web Page Retrieval using Search Context from Clicked Domain Names

A Survey on Preprocessing of Web Log File in Web Usage Mining to Improve the Quality of Data

Exploiting Tag Clouds for Database Browsing and Querying

Web Mining using Artificial Ant Colonies : A Survey

Optimizing Display Advertisements Based on Historic User Trails

Analysis of Social Media Streams

SELECTING ADS RELEVANT TO LIVE EVENTS TO AN ONLINE AUDIENCE

A Web-based Interactive Data Visualization System for Outlier Subspace Analysis

Web Page Prediction System Based on Web Logs and Web Domain Using Cluster Technique

ecommerce Web-Site Trust Assessment Framework Based on Web Mining Approach

How To Solve The Kd Cup 2010 Challenge

Context-aware taxi demand hotspots prediction

Distributed Framework for Data Mining As a Service on Private Cloud

System Behavior Analysis by Machine Learning

Web Mining Functions in an Academic Search Application

Database Marketing, Business Intelligence and Knowledge Discovery

Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations

Link Analysis and Site Structure in Information Retrieval

Intinno: A Web Integrated Digital Library and Learning Content Management System

Privacy Protection in Personalized Web Search- A Survey

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning

ANALYSIS OF WEB LOGS AND WEB USER IN WEB MINING

Transcription:

Mining Navigation Histories for User Need Recognition Fabio Gasparetti and Alessandro Micarelli and Giuseppe Sansonetti Roma Tre University, Via della Vasca Navale 79, Rome, 00146 Italy {gaspare,micarel,gsansone}@dia.uniroma3.it Abstract. The time spent using a web browser on a wide variety of tasks such as research activities, shopping or planning holidays is relevant. Web pages visited by users contain important hints about their interests, but empirical evaluations show that almost 40-50% of the elements of the web pages can be considered irrelevant w.r.t. the user interests driving the browsing activity. Moreover, pages might cover several different topics. For these reasons they are often ignored in personalized approaches. We propose a novel approach for selectively collecting text information based on any implicit signal that naturally exists through web browsing interactions. Our approach consists of three steps: (1) definition of a DOM-based representation of visited pages, (2) clustering of pages according with a tree edit distance measure and (3) exploiting the acquired evidence about the user behaviour to better filtering out irrelevant information and identify relevant text related to the current needs. A comparative evaluation shows the effectiveness of the proposed approach in retrieving additional web resources related to what the user is currently browsing Keywords: We would like to encourage you to list your keywords within the abstract section 1 Introduction Implicit Feedback techniques monitor the user behavior gathering usage data to build a profile of the user needs and, for this reason, users do not have to explicitly indicate which documents are relevant. Typical sources of usage data are: viewed or edited documents, query histories, emails, purchased items, etc.. Browsing and query histories in particular have been considered in some personalized search systems (e.g. [1]). Search engines toolbars and desktop search tools can easily access that information, which has proven to be very useful to disambiguate query terms and personalize the search results, identifying the current user context [2, 3]. Even though browsing activities are an important source of information to build profiles of the user interests, empirical evaluations show that browsing sessions contain around 40-50% of elements considered irrelevant w.r.t. the user interests driving the browsing activity [4]. HTML pages include noise data, such

2 Fabio Gasparetti and Alessandro Micarelli and Giuseppe Sansonetti as ads and navigation menus. Moreover, pages might cover several different topics. For these reasons they are often ignored in personalized approaches. We propose a novel approach for implicitly recognizing valuable text descriptions of current user needs based on the implicit feedback revealed through web browsing interactions. The remainder of this article describes the process of extraction of relevant cues from usage data collected from browsing sessions. 2 Related Works Early attempts show that query histories and clicked results, namely, title and summary have the chance to recognise the current search context improving the retrieval of relevant information [5, 6]. Further techniques take advantage of those aggregated click-through data extensively collected by popular search engines confirming that hypothesis [5, 7, 8]. But, on the other hand click-through data remain an exclusively advantage of large search engines and, therefore, out of reach of other entities. In [9, 10], the authors propose a preliminary attempt for extracting cues of user information needs based only on the user s most recent activity. In that model, the browsing activity is exploited in order to identify a set of words that characterize the current needs of the user. 3 Identifying User Needs from Browsing Sessions A browsing session is defined as an ordered set of web pages p 1, p 2,... p N that one user visits following the links that bind them. Empirical evidence shows that the external content introduced by hyperlinks sometimes tends to be of high quality and useful. Links convey recommendations and users make judgments about which links to follow according with the potential value of the distal objects w.r.t. their needs. On the web, however, hyperlinks bind documents of varying quality and purposes. Anchors and surrounding text can sometimes introduce noise and degrade potential representations of user current interests. Anchor text is usually vague and imprecise especially if consisting only of a few words or, even worse, these words are just for surfing support. Moreover, if the link is used only to organize information, it conveys no recommendation to the user. Users decide whether or not access the distal content, that is, the page at the other end of the link, analyzing the text snippets associated to links [11]. We assume that if the user decides to follow a link, she is expressing a particular interest that corresponds with her perception of the information source pointed by that link. Because this perception depends on the links anchor, that text can be considered strongly correlated to the current user needs governing the browsing activity. Collecting this information during a browsing session may be valuable for profiling users in personalized systems.

Mining Navigation Histories for User Need Recognition 3 3.1 Web Page Representation A DOM-based tree representation of each HTML page is defined. A pre-processing step involves a syntax checker 1 that cleans up malformed and faulty code. A simplified DOM-based tree is obtained filtering out unnecessary tags and considering the following most relevant ones. A further step aims at generalizing groups of blocks forming a single data region. A group of data records that contains descriptions of a set of similar objects are typically rendered in a contiguous region of a page and formatted using similar tags. The identification of data regions on web pages relies on the approach proposed by Liu et al. [12]. The DOM-based representation is also useful to cluster pages with similar templates. A template corresponds to the set of common layout and format features that appear in a group of pages. A common tree edit distance [13] allows us to compare and cluster together pages presented by the same template. Each cluster is thus represented by a single centroid tree. 3.2 Block Correlation Once each browsed page p is splitted to a set of non-overlapping text fragments, each corresponding to a block id, we begin analyzing pairs of contiguous pages in the browsing history (p i p j ). Given id pi the block containing the link chosen by the user, and the text content t pi of id pi, we perform a search through the content of the pointed page to find one or more correlated text blocks {id pj } by means of a semantic similarity measure [14]. If the similarity between two blocks id pi and id pj is above a given threshold, the tuple: < c(p i ), c(p j ), id pi, id pj > (1) is added to a local knowledge base KB r, where c(p i ) and c(p j ) are the clusters associated with the two pages p i and p j, respectively. The basic assumption of link analysis is that hyperlinks establish relationships between two pages. In our approach, a link from p i to p j indicates that there might be some relationship between one block id pi of page p i and another block id pj of p j. 3.3 Exploiting the Experience The last stage exploits this acquired evidence to retrieve text information related to the current user needs. Given a browsing session {p 1, p 2,..., p N } and the acquired experience KB r, we follow the previous steps in order to obtain the id of the blocks and clusters for each pair of pages (p i p j ), and the potential text extracted from the blocks. The relevance of each term in the extracted text is affected by a boosting factor the is linearly dependent with the number of times the tuple < c(p i ), c(p j ), id pi, id pj > is present in KB r. 1 http://www.w3.org/people/raggett/tidy/

4 Fabio Gasparetti and Alessandro Micarelli and Giuseppe Sansonetti 4 Conclusion The proposed implicit feedback technique extracts information related to the current user needs exploiting the experience implicitly acquired during a browsing activity. We are currently evaluating the proposed approach with a large set of data collected from real scenarios. In this way, it is possible to collect enough information to represent clusters of Web pages with similar templates and provide measures of relatedness between html blocks. Moreover, we will include in the evaluation different approaches to extract information from Web pages, e.g., advertisement removal techniques. Further work includes better techniques to represent the variation of user activity and interests during different sessions to better represent the user contexts (see [15]). References 1. Teevan, J., Dumais, S.T., Horvitz, E.: Personalizing search via automated analysis of interests and activities. In: SIGIR 05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA, ACM Press (2005) 449 456 2. Biancalana, C., Gasparetti, F., Micarelli, A., Sansonetti, G.: An approach to social recommendation for context-aware mobile services. ACM Trans. Intell. Syst. Technol. 4(1) (February 2013) 10:1 10:31 3. Biancalana, C., Flamini, A., Gasparetti, F., Micarelli, A., Millevolte, S., Sansonetti, G.: Enhancing traditional local search recommendations with context-awareness. In: Proceedings of the 19th international conference on User modeling, adaption, and personalization. UMAP 11, Berlin, Heidelberg, Springer-Verlag (2011) 335 340 4. Gibson, D., Punera, K., Tomkins, A.: The volume and evolution of web page templates. In: Special interest tracks and posters of the 14th international conference on World Wide Web. WWW 05, New York, NY, USA, ACM (2005) 830 839 5. Sriram, S., Shen, X., Zhai, C.: A session-based search engine. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 04, New York, NY, USA, ACM (2004) 492 493 6. Daoud, M., Tamine-Lechani, L., Boughanem, M., Chebaro, B.: A session based personalized search using an ontological user profile. In: Proceedings of the 2009 ACM symposium on Applied Computing. SAC 09, New York, NY, USA, ACM (2009) 1732 1736 7. Speretta, M., Gauch, S.: Personalized search based on user search histories. In: Web Intelligence, 2005. Proceedings. The 2005 IEEE/WIC/ACM International Conference on. (Sept 2005) 622 628 8. Paranjpe, D.: Learning document aboutness from implicit user feedback and document structure. In: Proceedings of the 18th ACM conference on Information and knowledge management. CIKM 09, New York, NY, USA, ACM (2009) 365 374 9. Gasparetti, F., Micarelli, A.: Exploiting web browsing histories to identify user needs. In: Proceedings of the 12th International Conference on Intelligent User Interfaces. IUI 07, New York, NY, USA, ACM (2007) 325 328 10. Gasparetti, F., Micarelli, A., Sansonetti, G.: Exploiting web browsing activities for user needs identification. In: Proceedings of the 2014 International Conference on Computational Science and Computational Intelligence (CSCI 14), IEEE Computer Society, Conference Publishing Services (March 2014)

Mining Navigation Histories for User Need Recognition 5 11. Pirolli, P., Card, S.K.: Information foraging. Psychological Review 106(4) (1999) 643 675 12. Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. KDD 03, New York, NY, USA, ACM (2003) 601 606 13. Chawathe, S.S.: Comparing hierarchical data in external memory. In: Proceedings of the 25th International Conference on Very Large Data Bases. VLDB 99, San Francisco, CA, USA, Morgan Kaufmann Publishers Inc. (1999) 90 101 14. Biancalana, C., Gasparetti, F., Micarelli, A., Sansonetti, G.: Social semantic query expansion. ACM Trans. Intell. Syst. Technol. 4(4) (October 2013) 60:1 60:43 15. Biancalana, C., Gasparetti, F., Micarelli, A., Miola, A., Sansonetti, G.: Contextaware movie recommendation based on signal processing and machine learning. In: Proceedings of the 2nd Challenge on Context-Aware Movie Recommendation. CAMRa 11, New York, NY, USA, ACM (2011) 5 10