A PREDICTIVE MODEL FOR QUERY OPTIMIZATION TECHNIQUES IN PERSONALIZED WEB SEARCH

International Journal of Computer Science and System Analysis Vol. 5, No. 1, January-June 2011, pp. 37-43 Serials Publications ISSN 0973-7448 A PREDICTIVE MODEL FOR QUERY OPTIMIZATION TECHNIQUES IN PERSONALIZED WEB SEARCH G. Pradeep and S. Prabakaran II Computer science and Engineering, Anna University, Trichirappalli, Tamilnadu, India E-mail: spavcp@gmail.com Abstract: Personalized web search algorithms nowadays used are not effective on different queries for different users and under different search contexts. In this Project. A large-scale evaluation framework for personalized search based on query logs were built and then it is decided to evaluate five personalized search algorithms (including two click-based ones and three topical-interest-based ones) using 12-day query logs of Windows Live Search. By analyzing the results, it was concluded that personalized Web search does not work equally well under various situations. It represents a significant improvement over generic Web search for some queries, while it has little effect and even harms query performance under some situations. It is decided to propose the click entropy as a simple measurement on whether a query should be personalized. It is decided to propose several features to automatically predict when a query will benefit from a specific personalization algorithm. Experimental results will show that using a personalization algorithm for queries selected this prediction model is better than using it simply for all queries. Re ranking of user visited pages are automatically done by the user s click actions. Keywords: Web search, Reranking factors, User actions, Click entropy, query optimization. I. INTRODUCTION As the amount of information [24] on the Web rapidly increases, it creates many new challenges for Web search. When the same query is submitted by different users, a typical search engine returns the same result, regardless of who submitted the query. This may not be suitable for users with different information needs. For example, for the query apple, some users may be interested in documents dealing with apple as fruit, while some other users may want documents related to Apple computers. One way to disambiguate the words in a query is to associate a small set of categories with the query. For example, if the category cooking or the category fruit is associated with the query apple, then the user s intention becomes clear. Current search engines such as Goggle or Yahoo! have hierarchies of categories to help users to specify their intentions. The use of hierarchical categories such as the Library of Congress Classification is also common among librarians. One criticism of search engines is that when queries are issued, most return the same results to users. In fact, the vast majority of queries to search engines are short and ambiguous. Different users may have completely different information needs and goals when using precisely the same query. Personalized web search is considered a solution to address these problems, since it can provide different search results based upon the preference of users. A measure of a query with respect to a collection of documents with the aim of quantifying the query s ambiguity with respect to those documents was developed by (Steve Cronin Townsend S, 2002). This measure, the clarity score, is the relative entropy between a query language model and the corresponding collection language model. The clarity score measures the coherence and specificity of the language used in documents likely to satisfy the query. It was argued that it provides a suitable quantification of the lack of ambiguity of a query with respect to a collection of documents and has potential applications throughout the information retrieval. In particular, the clarity score is shown to correlate positively with average precision in evaluations using TREC test collections. Hence, as one example, the clarity score could serve as a predictor of query performance. Systems would then be able to identify vague information requests and respond differently than they would to clear and specific requests. Re ranking algorithm is used to re rank the user visited URL pages based on the user actions in the

38 International Journal of Computer Sciences and System Analysis prescribed web pages. User clicks are classified based on the actions which may be save, bookmarks etc., This effective procedure will automatically increase the overall performance of the Personalized Web Searches. II. RELATED WORK Current web search engines [5] are built to serve all users, independent of the needs of any individual user. Personalization of web search is to carry out retrieval for each user incorporating his/her interests. A novel technique to map a user query to a set of categories, which represent the user s search intention. was to be adopted. This set of categories can serve as a context to disambiguate the words in the user s query. A user profile and a general profile are learned from the user s search history and a category hierarchy respectively. These two profiles are combined to map a user query into a set of categories. Several learning and combining algorithms are evaluated and found to be effective.among the algorithms to learn a user profile, we choose the Rocchio-based method for its simplicity, efficiency and its ability to be adaptive. Experimental results indicate that our technique to personalize web search is both effective and efficient. Web search engines help users find useful information on the World Wide Web (WWW). However, [23] when different users submit the same query, typical search engines return the same result regardless of who submitted the query. Generally, each user has different information needs for his/her query. Therefore, the search results should be adapted to users with different information needs. Experimental results show that [23] search systems that adapt to each user s preferences can be achieved by constructing user profiles based on modified collaborative filtering with detailed analysis of user s browsing history in one day. One hundred users, one hundred needs. As more and more topics are being discussed on the web [6] and our vocabulary remains relatively stable, it is increasingly difficult to let the search engine know what we want. Coping with ambiguous queries has long been an important part of the research on Information Retrieval, but still remains a challenging task. Personalized search has recently got significant attention in addressing this challenge in the web search community, based on the premise that a user s general preference may help the search engine disambiguate the true intention of a query. However, studies have shown that users are reluctant to provide any explicit input on their personal preference. A study was made to know how a search engine can learn a user s preference automatically based on her past click history and how it can use the user preference to personalize search results. The experiments show that users preferences can be learned accurately even from little click-history data and personalized search based on user preference yields significant improvements over the best existing ranking mechanism in the literature. The Web is a highly [10] distributed and heterogeneous information environment. The immense number of documents on the Web produces various challenges for search engines. Storage space, crawling speed, computational speed and retrieval of most relevant documents are some examples of these challenges. In this picture, it is important to define the relevancy of the documents as most popular and best quality documents. When ranking the html pages, you may judge about the quality of a page: by analyzing its content, by measuring its popularity or by examining its connectivity. The Information retrieval systems [5] (e.g., web search engines) are critical for overcoming information overload. A major deficiency of existing retrieval systems is that they generally lack user modeling and are not adaptive to individual users, resulting in inherently non-optimal retrieval performance. For example, a tourist and a programmer may use the same word java to search for different information, but the current search systems would return the same results. In this paper, we study how to infer a user s interest from the user s search context and use the inferred implicit user model for personalized search. We present a decision theoretic framework and develop techniques for implicit user modeling in information retrieval. We develop an intelligent client-side web search agent (UCAIR) that can perform eager implicit feedback, e.g., query expansion based on previous queries and immediate result reranking based on clickthrough information. Experiments on web search show that our search agent can improve search accuracy over the popular Google search engine. Long-term search history [10] contains rich information about a user s Search preferences. A study was made regarding statistical language modeling based methods to mine contextual information from long-term search history and to exploit it for more

A Predictive Model for Query Optimization Techniques in Personalized Web Search 39 accurate estimates of the query model. The experiments on a web search test collection show that the algorithms are effective in improving retrieval accuracy for both fresh and recurring queries. The best performance is achieved when using the combination of related past searches and Clickthrough data as the main source of search context. The PC Desktop [25] is a very rich repository of personal information, efficiently capturing user s interests. It is proposed to have a new approach towards an automatic personalization of web search in which the user specific information is extracted from such local desktops, thus allowing for an increased quality of user profiling, while sharing less private information with the search engine. More Specifically, we investigate the opportunities to select personalized query expansion terms for web search using three different desktop oriented approaches: summarizing the entire desktop data, summarizing only the desktop documents relevant to each user query, and applying natural language processing techniques to extract dispersive lexical compounds from relevant desktop resources. The experiments with the Google API showed at least the latter two techniques to produce a very strong improvement over current web search. A method for predicting query performance by computing the relative entropy between a query language model and the corresponding collection language model was developed. The resulting clarity score measures the coherence of the language usage in documents whose models are likely to generate the query. We suggest that clarity scores measure the ambiguity of a query with respect to a collection of documents and show that they correlate positively with average precision in a variety of TREC test sets. Thus, the clarity score may be used to identify ineffective queries, on average, without relevance information. An algorithm for automatically setting the clarity score threshold between predicted poorly performing queries and acceptable queries and validates it using TREC data was developed. A Comparison was made for the automatic thresholds to optimum thresholds and also check how frequently results as good are achieved in sampling experiments that randomly assign queries to the two classes. III. EXPERIMENTAL METHODOLOGY To evaluate the performance of personalized search, each participant is required to issue a certain number of test queries and determine whether each result is relevant. An advantage of this approach is that the relevance of documents can be explicitly judged by the participants. Unfortunately, there are some drawbacks in this method. Constraints on the number of participants and test queries may bias evaluation results on accuracy and reliability of the personalization algorithm. 3.1. Query Optimization in Personalized Web Search As the amount of information on the Web rapidly increases, it creates many new challenges for Web search. When the same query is submitted by different users, a typical search engine returns the same result, regardless of who submitted the query. This may not be suitable for users with different information needs. For example, for the query apple, some users may be interested in documents dealing with apple as fruit, while some other users may want documents related to Apple computers. One way to disambiguate the words in a query is to associate a small set of categories with the query. For example, if the category cooking or the category fruit is associated with the query apple, then the user s intention becomes clear. Current search engines such as Goggle or Yahoo! have hierarchies of categories to help users to specify their intentions. The use of hierarchical categories such as the Library of Congress Classification is also common among librarians. One criticism of search engines is that when queries are issued, most return the same results to users. In fact, the vast majority of queries to search engines are short and ambiguous. Different users may have completely different information needs and goals when using precisely the same query. Personalized web search is considered a solution to address these problems, since it can provide different search results based upon the preference of users. A re ranking evaluation frame work is to be constructed by first downloading the search results from the windows live search engine. Then, by using the selected Personalization algorithm to re rank search results. Operation Steps in the Proposed system : 1. Download the top 50 search results from the search engine for the query string. 2. Compute a Personalized score for each item using a Personalization Algorithm and generate

40 International Journal of Computer Sciences and System Analysis a rank list result items are to be sorted in descending order based on the personalised scores. 3. Combine the two rank lists and generate the final rank list, which will be returned to the users in personalized search. 4. The ranks of Clicked URL are in a log entry and use the events to evaluate the performance of the query. 3.2. Features Used To Predict Query Performance (1) 3.2.1 Click Entropy Click Entropy is a direct indication of query click variation. If all users click only one identical page on query, Click Entropy (q) = 0. A Smaller click entropy means that the majority of users agree with each other on a small number of web pages. In such cases, there is no need to do any personalization. A Large click entropy indicates that many web pages were clicked for the query. This mean the following: A user has to select several pages to satisfy his information need, which means the query is most likely an informational query. Different users have different selections on this query, which means that the query is an ambiguous query. (2) 3.2.2 Click Diversity The goal of personalized web search is to return different results to different users according to their preferences. A direct way to identify whether users have different preferences on a query is to check the click diversity of users. Click entropy is one of such measures of click diversity. For a given query, suppose there are K users who ever issued this query, and there are M documents that are clicked for this query. Then click frequency was calculated for each user on each document and represent them in a K X M user document matrix X. Each element x (k,m) =c, indicates that user k clicked document m by c times. If the user has not clicked the document, then x (k,m) =0. (3) 3.2.3 Concept Diversity This is another way to identify the diversity of user preferences over a query to measure the concept / topic diversity of clicked documents. Each document can be classified into one or more concept / topic categories. A document concept matrix is used to represent categories of documents. (4) 3.2.4 ExRatio Obviously, users usually reformulate ambiguous queries. A common reformulation is adding terms to the original query. So, we extract feature ExRatio based on this information. Num of sessions is the Number of sessions that the query appears. Num of sessions Ex is the Number of session that the query appears and at least one extended query also appears. The Ex Ratio is calculated by, Ex Ratio = Num of sessions Ex / Num of sessions (5) 3.2.5 Isfirstqueryinsession If a query is the first query of a session, S-Topic can t work for it. (6) 3.2.6 Hasqueryhistory This feature indicates whether the query has been issued in the past.. (7) 3.2.7 Avgclktimes This feature displays the average historical click times per query forthe query string. If users usually click multiple results for a query, this query is more likely to be an ambiguous or informational query. 3.3. Personalization Algorithms The personalization algorithms are used to rerank search results by computing a personalized score for each document for the results returned by each user query. Two strategies are to be implemented as Person- Level Re ranking 3.3.1. Historical Click Based Algorithm A query submitted by a user, the web pages frequently clicked by the user in the past are more relevant to the user than those seldom clicked by the user. Based on this a personalized score on page can be computed. A disadvantage of this approach is that it is not applicable for new queries that the user has never asked. If the data set contains 1/3 of the same user

A Predictive Model for Query Optimization Techniques in Personalized Web Search 41 issues the test queries more than one time. This approach would only benefit these queries. 3.3.2. User-Topical Interest Based Algorithm A Personalization method based on long-term user topical interests are represented as a vector. When a user submits a query, each returned web page is first mapped to a category vector. Then the similarity between the user profile vector and the page category vector is computed. Table 1 Basic Statistics of Data Set Item All Training Test #days 12 11 1 #users 10,000 10,000 1,792 #queries 55,937 51,334 4,639 #distinct queries 34,203 31,777 3,465 #clicks 93,566 85,642 7,924 #clicks/#queries 1.6727 1.6683 1.7081 #sessions 49,839 45,981 3,865 3.3.3. Group Level Re Ranking Implementation of K- Nearest Neighbor CF algorithm as a representative of a group based personalization. Computations on user similarity based on long term user profiles. 3.4. Performance Of Proposed Algorithms the same user. It shows that user often resubmit a query and review the results they have searched. Repeated clicks can be predicted based upon a user s historical queries and clicks. Table 3 Performanace of Repeated Queries Method All queries Non-optimal queries Repeated Not-rep. Repeated Non-rep. WEB 84.7758 59.9799 46.6285 47.4013 P-Click 87.3162 59.9799 55.9090 47.4013 L-Topic 84.8394 59.2563 48.4746 46.9741 S-Topic 84.4529 57.9335 48.1471 46.1184 LS-Topic 84.8485 59.2722 48.3539 46.9919 G-Click 87.2685 59.9799 55.7377 47.4013 Table 4 Performance of Self Repeated Queries Method All queries Non-optimal queries Self-repeated Not-rep. Self-repeated Not-rep. WEB 85.6337 63.2697 45.7215 47.4858 P-Click 89.1264 63.2997 59.4750 47.4858 L-Topic 85.7578 62.6378 48.4923 47.0778 S-Topic 85.4445 61.4236 47.7202 46.3240 LS-Topic 85.7508 62.6589 48.1993 47.1107 G-Click 89.0627 63.2793 59.1086 47.5025 Table 2 Overall Performance of Strategies Method All queries Non-optimal Optimal queries queries WEB 69.4669 47.2623 100.0000 P-Click 70.4350 +1.39% +49.0051 +3.69% 99.9029 0.10% L-Topic 69.0445 0.61% 47.2570 1.00% 99.0040 1.00% S-Topic 68.0799 2.00% 46.5008 1.61% 97.7529 2.25% LS-Topic69.0578 0.59% 47.2486 0.03% 99.0471 0.95% G-Click 70.4168 +1.37% 48.9728 +3.62% 99.9040 0.10% The above Table 2 Shows that Topical-interestbased strategies perform less well than Click based strategies and the baseline. 3.4.1. Performance on Repeated Queries In this frame work, 46 per cent of test queries are once repeated, and 33 per cent of queries are repeated by Figure 1: Performance of Queries Based on user Actions The above figure (Fig 1) shows the performance analysis of queries based on user actions. 4. CONCLUSIONS The algorithms stated in this framework only for repeated queries, but they are stable and simple. The topical- interest based personalized search algorithms implemented were not as stable as the click based ones under this framework. They could improve search

42 International Journal of Computer Sciences and System Analysis accuracy for some queries, but they harmed performance for more queries. Another important conclusion regarding this framework is that personalization does not work equally well under various situations. Results show that personalized web search yields significant improvements over generic web search for queries with high click entropy. All the queries should not be handled in the same manner. No Personalization algorithm can outperform others for all queries. Different methods have different strengths and weaknesses. The main objective of optimizing the query thereby increasing the effectiveness of personalized web search is achieved. To enhance further, promising direction can be explored in the future is to automatically predict which algorithm should be used for a given query and to combine the strength of different personalization methods. References [1] C. Silverstein, H. Marais, M. Henzinger, and M. Moricz, Analysis of a Very Large Web Search Engine Query Log, ACM SIGIR Forum, 33(1), 6-12, 1999. [2] B. J. Jansen, A. Spink, and T. Saracevic, Real Life, Real Users, and Real Needs: A Study and Analysis of User Queries on the Web, Information Processing and Management, 36(2), 207-227, 2000. [3] R. Krovetz and W. B. Croft, Lexical Ambiguity and Information Retrieval, Information Systems, 10(2), 115-141, 1992. [4] S. Cronen-Townsend and W. B. Croft, Quantifying Query Ambiguity, Proc. Second Int l Conf. Human Language Technology Research (HLT 02), pp. 94-98, 2002. [5] X. Shen, B. Tan, and C. Zhai, Implicit User Modeling for Personalized Search, Proc. ACM Int l Conf. Information and Knowledge Management (CIKM 05), pp. 824-831, 2005. [6] F. Qiu and J. Cho, Automatic Identification of User Interest for Personalized Search, Proc. 15th Int l World Wide Web Conf. (WWW 06), pp. 727-736, 2006. [7] J. Teevan, S. T. Dumais, and E. Horvitz, Beyond the Commons: Investigating the Value of Personalizing Web Search, Proc. Workshop New Technologies for Personalized Information Access (PIA), 2005. [8] J. Pitkow, H. Schutze, T. Cass, R. Cooley, D. Turnbull, A. Edmonds, E. Adar, and T. Breuel, Personalized Search, Comm. ACM, 45(9), 50-55, 2002. [9] A. Pretschner and S. Gauch, Ontology Based Personalized Search, Proc. 11th IEEE Int l Conf. Tools with Artificial Intelligence (ICTAI 99), pp. 391-398, 1999. [10] B. Tan, X. Shen, and C. Zhai, Mining Long-Term Search History to Improve Search Accuracy, Proc. 12fth ACM SIGKDD Int l Conf. Knowledge Discovery and Data Mining (KDD 06), 718-723, 2006. [11] G. Jeh and J. Widom, Scaling Personalized Web Search, Proc. 12th Int l World Wide Web Conf. (WWW 03), 271-279, 2003. [12] P. Ferragina and A. Gulli, A Personalized Search Engine Based on Web-Snippet Hierarchical Clustering, Special Interest Tracks and Posters of the 14th Int l Conf. World Wide Web (WWW 05), 801-810, 2005. [13] J. Teevan, S. T. Dumais, and E. Horvitz, Personalizing Search via Automated Analysis of Interests and Activities, Proc. 28th Ann. Int l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR 05), 449-456, 2005. [14] J. T. Sun, H. J. Zeng, H. Liu, Y. Lu, and Z. Chen, CubeSVD: A Novel Approach to Personalized Web Search, Proc. 14th Int l World Wide Web Conf. (WWW 05), 382-390, 2005. [15] F. Liu, C. Yu, and W. Meng, Personalized Web Search by Mapping User Queries to Categories, Proc. ACM Int l Conf. Information and Knowledge Management (CIKM 02), 558-565, 2002. [16] P. A. Chirita, W. Nejdl, R. Paiu, and C. Kohlschu tter, Using ODP Metadata to Personalize Search, Proc. 28th Ann. Int l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR 05), 178-185, 2005. [17] A. Broder, A Taxonomy of Web Search, ACM SIGIR Forum, 36(2), 3-10, 2002. [18] U. Lee, Z. Liu, and J. Cho, Automatic Identification of User Goals in Web Search, Proc. 14th Int l World Wide Web Conf. (WWW 05), 391-400, 2005. [19] X. Shen, B. Tan, and C. Zhai, Context-Sensitive Information Retrieval Using Implicit Feedback, Proc. 28th Ann. Int l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR 05), 43-50, 2005. [20] Windows Live Search, http://www.live.com, 2006. [21] J. R. Wen, Z. Dou, and R. Song, Personalized Web Search, Encyclopedia of Database Systems, 2009. [22] J. M. Carroll and M.B. Rosson, Paradox of the Active User, Interfacing Thought: Cognitive Aspects of Human-Computer Interaction, 80-111, 1987. [23] K. Sugiyama, K. Hatano, and M. Yoshikawa, Adaptive WebSearch Based on User Profile Constructed without Any Effort from Users, Proc. 13th Int l World Wide Web Conf. (WWW 04), 675-684, 2004. [24] F. Liu, C. Yu, and W. Meng, Personalized Web Search for Improving Retrieval Effectiveness, IEEE Trans. Knowledge and Data Eng., 16(1), 28-40, 2004. [25] P. A. Chirita, C. Firan, and W. Nejdl, Summarizing Local Context to Personalize Global Web Search, Proc. ACM Int l Conf. Information and Knowledge Management (CIKM), 2006. [26] J. Chaffee and S. Gauch, Personal Ontologies for Web Navigation, Proc. ACM Int l Conf. Information and Knowledge Management (CIKM 00), 227-234, 2000. [27] S. Gauch, J. Chaffee, and A. Pretschner, Ontology-Based Personalized Search and Browsing, Web Intelligence and Agent Systems, 1(3/4), 219-234, 2003.

A Predictive Model for Query Optimization Techniques in Personalized Web Search 43 [28] J. Trajkova and S. Gauch, Improving Ontology-Based User Profiles, Proc. Recherche d Information Assiste e par Ordinateur (RIAO 04), 380-389, 2004. [29] M. Speretta and S. Gauch, Personalized Search Based on User Search Histories, Proc. IEEE/WIC/ACM Int l Conf. Web Intelligence (WI 05), 622-628, 2005. [30] L. Page, S. Brin, R. Motwani, and T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web, technical report, Computer Science Dept., Stanford Univ., 1998. [31] T. H. Haveliwala, Topic-Sensitive Pagerank, Proc. 11th Int l World Wide Web Conf. (WWW), 2002. [32] T. Sarlo s, A.A. Benczu r, K. Csaloga ny, D. Fogaras, and B. Ra cz, To Randomize or Not to Randomize: Space Optimal Summaries for Hyperlink Analysis, Proc. 15th Int l World Wide Web Conf. (WWW 06), 297-306, 2006. [33] F. Tanudjaja and L. Mui, Persona: A Contextualized and Personalized Web Search, Proc. 35th Hawaii Int l Conf. System Sciences (HICSS 02), 3, 53, 2002. [34] J. S. Breese, D. Heckerman, and C. Kadie, Empirical Analysis of Predictive Algorithms for Collaborative Filtering, Proc. 14th Conf. Uncertainty in Artificial Intelligence (UAI 98), 43-52, 1998. [35] P. A. Chirita, C. S. Firan, and W. Nejdl, Personalized Query Expansion for the Web, Proc. 30th Ann. Int l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR 07), 7-14, 2007. [36] J. Teevan, S. T. Dumais, and D. J. Liebling, To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent, Proc. 31th Ann. Int l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR 08), 2008. [37] S. Cronen-Townsend, Y. Zhou, and W. B. Croft, Predicting Query Performance, Proc. 25th Ann. Int l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR 02), 299-306, 2002.