Enterprise Search Solutions Based on Target Corpus Analysis and External Knowledge Repositories

Transcription

1 Enterprise Search Solutions Based on Target Corpus Analysis and External Knowledge Repositories Thesis submitted in partial fulfillment of the requirements for the degree of M.S. by Research in Computer Science and Engineering by NIHAR SHARMA Search and Information Extraction Lab International Institute of Information Technology Hyderabad , INDIA March 2011

3 International Institute of Information Technology Hyderabad, India CERTIFICATE It is certified that the work contained in this thesis, titled Enterprise Search Solutions Based on Target Corpus Analysis and External Knowledge Repositories by Nihar Sharma, has been carried out under my supervision and is not submitted elsewhere for a degree. Date Adviser: Prof. Vasudeva Varma

4 To my parents

5 Acknowledgments I most sincerely express my gratitude to Dr. Vasudava Varma for his guidance and vision which has led to the realization of this thesis. He has not only been the mentor for this work but also a role model for me in the research domain. I also thank Dr. Prasad Pingali for his ever helpful insights and technical expertise during the early phase of this thesis. I thank my friends Neha, Guru, Rahul & Rahul, Abhijit and Jaideep for their moral support and companionship during this part of my life. v

6 Abstract A large portion of the data that resides in Enterprises, exists in the form of unstructured textual data. Unstructured data implies collection of natural language text in the form of word documents, HTML pages, plain text files, among others. Structured data on the other hand refers to relational databases, XML documents, etc; where there is a well articulated scheme to represent and store the data. Unstructured data is easy to produce and comprehend by humans, where as it is not trivial for the machines and electronic applications to extract information from it. As unstructured data lacks structure and patterns therefore, in order to target its specific portions for retrieving particular information, full text search has to be carried out. In between Internet and personal machines, we have the Intranets and data archives of various Enterprise establishments, which have their own set of guidelines for search. A typical enterprise search system would be benefited with the thoroughness of desktop search but would not want to suffer from the inherent slowness of typical on the fly, linear scans used in desktop search. An Internet search approach on the other hand may be too generic to produce precise results even though it would be advantageous for speedy indexing and retrieval. The difference and the uniqueness in the composition of enterprise data makes Enterprise search an entirely different beast. This thesis concentrates on enhancing the relevance of results in query based document retrieval in Enterprise Search (ES). Approaches to implement effective ES are different from standard document retrieval and web search. A number of such past and on-going works in ES research area are discussed in this thesis. Our approaches to ES work with two important open research areas in ES, namely, user context aware ES and resolving the issues of vocabulary bias and narrowness in ES. In user context aware ES, we concentrate on user context aware computerized work environment, to facilitate better understanding of user s information requirements in ES systems. Issues of vocabulary bias and narrowness arise when there is a difference between vocabulary used for queries and that used in the content (though meaning could be the same). In any form of document retrieval, the result-set is strongly query dependent. As a query represents the information need of a user, a misrepresenting or an ambiguously formed query dilutes the result quality. We try to reduce the difference between the vocabulary used in constructing the query and that used in enterprise content for resolving this issue. We enhance ES by concentrating at two basic aspects of information retrieval, which are query exvi

7 vii pansion and re-ranking. User and work context aware ES is implemented by through a user role based query expansion using local & global analysis techniques and re-ranking result-set by a role based document classification. External lexicon and thesaurus repositories constructed from open source on-line encyclopedias such as Wikipedia are used as complementary knowledge bases for implementing the query expansion for countering the vocabulary bias issues in the enterprise. This is accomplished by extracting and using a Wikipedia concept thesaurus from the article link structure of Wikipeida for query expansion. The results are re-ranked step using large manually tagged sets rich in categories from all domains of Wikipedia. These tagged sets are used to train a classifies which classifies search results into various enterprise document classes. Documents belonging to the dominant classes in the result-sets are given higher preference in re-ranking. We also introduce broader vocabulary range into result-set with collection enrichment techniques using Wikipedia article text for pseudo relevance feedback. Both supervised and unsupervised classification techniques are used to determine good feedback documents from the pseudo-relevant set. We evaluate our approahes on two datasets, namely the IIIT-H corpus and the CERC dataset. Role based personalization in ES required a dataset with role based topics and relevance judgments. As no such dataset existed, we customized the IIIT-H intranet data for the same. Rest of the techniques were evaluated on standard ES evaluation platform, i.e. the CERC dataset provided by the TREC Enterprise track. Role based personalization shows improvements for both local and global analysis based query expansion technique as well as for the re-ranking compared to plain test retrieval. Wikipedia concept thesaurus based query expansion reveals gains in recall figures, without diluting the precision in search results. Wikipedia category network based re-ranking method shows moderate improvements in the precision figures for the results. Improvements are also observed in Wikipedia article text based pseudo relevance feedback in comparison to blind relevance feedback without enriching the pseudo relevant set. Between unsupervised and supervised classification based approaches, latter shows better results with upto 11% improvement in mean average precision (MAP) figures.

8 Contents Chapter Page 1 Introduction Web versus Enterprise Search Research Problems in Enterprise Search Vocabulary bias and narrowness Heterogeneous collection User and work context Test collections Problem Definition User role and information need Query expansion with Wikipedia concept thesaurus Re-ranking with Wikipedia category network Pseudo relevance feedback with Wikipedia based collection enrichment Common theme of the thesis Organization of Thesis Background Enterprise Search Related Work: User Roles and Search Related Work: Query Expansion in Enterprise Search Thesaurus based techniques Pseudo relevance feedback Related Work: Re-ranking Related Work: Wikipedia and Enterprise search Wikipedia link structure and concept thesaurus Wikipedia categories and re-ranking Wikipedia as lexicon resource Summary Role Based Personalization of Enterprise Search Enterprise Roles and Search Role Based Tagging and Indexing Role tagged dataset Role classifier Role based co-occurrence thesaurus Indexing viii

9 CONTENTS ix 3.3 Query Formulation Boolean logic Parametric search Role Based Query Expansion Global analysis techniques Local analysis techniques Role Based Re-ranking Experiments and Evaluations Data set: IIIT intranet Roles Search Engine and Query Formulation Query Expansion Indexing Retrieval and Re-ranking Results Comments Summary Query Expansion and Re-ranking with domain independent knowledge ES and Wikipedia Wikipedia knowledge-base and structure Wikipedia corpus Wikipedia Concept Vocabulary Concepts Article names Redirects Anchor-text Vocabulary set Wikipedia Concept Thesaurus Two way link analysis Link co-category analysis Relation confidence score Concept Representation of Search Query Query Expansion Enterprise Document Classes Mapping document classes to Wikipedia categories Training corpus from Wikipedia articles Re-ranking search results Experiments and Evaluation CSIRO dataset Query Expansion with Concept Thesaurus Baseline Setup Concept Thesaurus setup Results Comments Re-ranking with Wikipedia categories

10 x CONTENTS Classifier training with Wikipedia Categories Enterprise dataset and Classification Results Comments Summary Wikipedia as a Vocabulary Resource for Pseudo Relevance Feedback Language model for information retrieval Selecting good feedback documents from PR Set Pseudo relevant set Unsupervised approaches Clustering documents Probabilistic topic models Supervised approaches Features for enterprise corpus documents Features for Wikipedia articles Training Experiments and Evaluation Summary Conclusions Unified view of presented approaches Role based personalization Future of role based personalization Wikipedia based approaches Concept thesaurus for query expansion Future of query expansion Re-ranking with category network Future of re-ranking Pseudo relevance feedback Future of PRF Summary Bibliography

11 List of Figures Figure Page 3.1 Directory structure of training corpus The working of Indexing Functionality Block Diagram showing the flow of Query Formulation functionality Design of QE module An example execution of the global analysis phase on a sample query A typical Wikipedia article page A small real category network example from wikipedia Few examples of redirects Wikipedia Link Structure and Creation of Vocabulary set and Thesaurus Concept Representation and QE Module Wikipedia Category section on the article page Wikipedia Category page listing articles that fall in this category xi

12 List of Tables Table Page 3.1 The evaluation results of Role Based Enterprise Search Sample query expansion The evaluation results for Wikipedia CT based Query Expansion module Classification of 20 News Group data Precision figures for Re-ranking Experiment Results of QE for different classification techniques on PRF xii

13 Chapter 1 Introduction A huge surge in information retrieval applications has been observed in last couple of decades. Emerging from its initial days as a helping tool in library sciences, information retrieval has emerged into one of the most essential electronic applications. This is credited largely to the growth of Internet. Enterprises too started moving to digital domain and storage and retrieval of large amounts of digital data became the primary need for the early enterprise IT resources. Advances in the database management models and storage devices have paved a path for organizing the enormous data. Though modern enterprises are well networked but not soundly informed about their own data. Typical text search systems were employed to retrieve information but they became outdated with time. More popular deasrch engines from the web domain have also been tried for enterprise information retrieval but were not as effective as they were expected to be. As we will see, there are many fundamental differences between enterprise and web content that requires enterprise search problem to be tackled independently of generic or web search. A formal explanation of ES has been presented by Hawking in [17] which can be summerized as, ES includes search of the organization s external and internal websites as well as other electronic media such as databases, s etc. Enterprise Search (ES) is a critical performance factor for an organization in today s information age. The volume of electronic information has reached to proportions which are unmanageable by the classic retrieval models. Better algorithms have been designed to retrieve both structured (relational databases) and unstructured information (natural text) for businesses. However, retrieval of unstructured information in an enterprise environment is a non trivial task as the relevance is the prime objective unlike its counterpart, the web search, which focuses on speed[17]. The paradigm has shifted towards semantic search for both web and ES[12]. Search strategies have come a long way since the classical retrieval models. Techniques likes query expansion, re-ranking, semantic indexing and so forth have empowered search from being a simple text retrieval tool to powerful multimedia information extraction discipline. ES being a significantly different environment from web search, draws heavily from semantic search[34]. However, there still is an immense scope for improve- 1

14 ments in ES and we try to achieve some of that with the work presented in this thesis. Before we go and define our problem statement, let s discuss the the enterprise environment and its search requirements 1.1 Web versus Enterprise Search Enterprises are business driven entities and have a strong sense of purpose. Naturally, productivity is the fundamental reason for incorporating the applications that operate within an enterprise. Search requirements for an enterprise are not so different and are meant to instigate productivity. Unlike the web search, ES is expected to have a strong inclination towards relevance of results[34][55]. By relevance, we mean those answers which satisfy the user for a given search query. Enterprise search has a relatively narrow definition of relevance. Web search users do not have a strong expectation about search results whereas ES users expect the few most relevant documents in the results[34][17]. As popularity of pages govern the ranking in web search, users also mold their expectations to see popular pages in the results. However, most of the times a correct answer(s) is the requirement in ES. Authority over content creation is another factor that makes enterprise data different for the web. While the Internet contains the collective creations of many authors and the composition is very democratic, enterprise content is manufactured to dispense information rather than to advertise and attracting attention of users. User guidelines are narrow and strict for creating documents and there is a strict access control over the collection[34]. Often it is observed that only a small fraction of users are involved in writing as compared to number of users accessing. Same effect is seen in the vocabulary of documents, which is very author biased, mostly formal and domain specific. Web pages also have a strongly linked structure through hyper-links among them. Most of the textual data on the web is presented in the form of HTML pages which is rendered by the web browsers into a human readable form. On the other hand, textual data in enterprises is very heterogeneous in nature. Apart from the structured data (relational databases, XML files), unstructured or semistructured data carrying text in natural language can be present in variety of repositories like file systems, HTTP web servers, Lotus notes, Microsoft exchange etc. Moreover, the presence of hyper-links among enterprise documents is very scarce even in the HTML content[6]. This poses a strong challenge in using popularity based ranking algorithms like PageRank[5][35] and HITS in ES. 1.2 Research Problems in Enterprise Search Information retrieval community faces many challenges for ES. Apart from the problems tackled in the traditional information retrieval field (for example crawling, query sense disambiguation, query expansion, pseudo relevance feedback, indexing, ranking etc), we list some of the main research areas specific to ES [17] which require considerable attention from the research fraternity. 2

15 1.2.1 Vocabulary bias and narrowness In a typical document search, users express a set words which they think best represent their requirement for the kind of information they seek in the documents and they expect to find the same in search results. Here, we would like to point out the difference between query and information need. Information need is the abstract notion about the information contained in the content of documents that a user has in his mind when he commences a document search. On the other hand, query is a concrete representation of information need through words. A user tries to articulate his information requirement by putting up best representational words he can think of as the query. As the search requirements within an enterprise are mostly business driven, it implies that in an enterprise, results satisfying a particular information need for a given context can be assumed to remain more or less a constant set for different users. However, the effects that individual users have over a result-set retrieved for a particular information need is through the queries they form. This brings out a very interesting point about ES, which is about user queries. The result set is strongly query dependent and the query represents the information need. The question is how well the query substitutes the information requirement. For example, a user looking for documents containing information about automobiles and gives the query as cars. A simple query term matching based search will not produce results that contain the words like truck, bus, SUV, automobile but not car. So, query does not completely represent the requirement in this example. The above example brings out another observation about the document content. Earlier we pointed out the nature of enterprise content, stating the factors that govern the composition of data. Vocabulary bias is another factor that effects the results for given search requirements. In the above example, even if user s query is automobile, the result set might not carry documents that are from that domain but lack the term automobile. The power of a language to express information with so many different words makes it hard to relate information need with expected results. This means even if a user produces a comprehensive query, it is the vocabulary of the indexed documents that will infer the result-set Heterogeneous collection Enterprises tend to have heterogeneous collections. Hard efforts are required for seamless working of ES indexing and crawling systems for different document types like web/html content, threads, relational database entries, presentations, word and PDF type documents, OCR based documents, spreadsheets. Different document types have huge differences in their structure. For example the document length, a database table can have records of equivalent lengths where as the web content may vary a lot. Then there is the problem of fetching the documents. The protocols vary with the storage systems, server types etc. For example, fetching public facing directories in a file system may require FTP protocol, web content may require HTTP protocol and the databases may need proper socket and 3

16 user account details. Many different types of parsers are also required to resolve to issue of heterogeneous collections. Plain text documents may be indexed trivially, but almost all other formats of text based documents/repositories are stored as binary files in their respective formats. No single parser may be sufficient to extract text from the entire enterprise collection. Even if enterprises are aware of such nature and distribution of data, they tend to stick with the existing composition because of economic reasons as changing the software profiles and people training may incur significant additional operating costs on the enterprise. This presents a hard problem of dealing with the nature of documents across the enterprises User and work context Enterprise environment is more regulated than the web. Enterprises offer web based portals to employees for working and transacting with the enterprise data. These portal require a login and implement access based security features on the basis of user role, project development, geography etc. The point here is that the work session of enterprise is aware of user details, which could be used to facilitate the ES systems to better understand the information requirements of the user. While security features have matured, there is still a strong requirement for incorporating the user context into the search. Consider an example, a university has two roles, meteorologist and software developer. A strict work related query like cloud is ambiguous as it could imply clouds in the sky or cloud computing. Query sense disambiguation is one aspect where user context can be applied. Similarly user context could be used for a context based indexing and ranking of results Test collections There is a strong requirement of a comprehensive test collection and related evaluation data. Significant efforts have been channeled through TREC enterprise track 1 through collection and CSIRO (public facing web-pages of Commonwealth Scientific and Industrial Research Organization (CSIRO) website) dataset, but those do not comprehensively represent an enterprise composition. Datasets that exist today only focus on the document relevance for query sets but areas like heterogeneous collections (crawling and parsing related research), user context knowledge, enterprise query logs etc are absent in the open source collections. Their is a good reason for this. Most private institution have intellectual property, trade secrets, confidential policies, operations flow information and many other valuable assets within their enterprise data. Such information cannot be released to public domain. This poses an interesting and very challenging research area of creating realistic ES test datasets. 1 Text REtrieval Conference (TREC). 4

17 1.3 Problem Definition In this thesis we have worked on two of the areas in ES that we described in the previous section, namely, user and work context aware ES and working out the issues of vocabulary bias and narrowness in ES. Our aim is to enhance the relevance of results through traditional Information retrieval approaches like the query expansion and re-ranking in ES. We implement user and work context aware ES by taking up a user role based query expansion through local & global analysis techniques and re-ranking resultset by a role based document classification. For enhancing the vocabulary bias issues in the enterprise we implement query expansion by bringing in external lexicon and thesaurus repositories constructed from open source on-line encyclopedias like Wikipedia. We also introduce broader vocabulary range into result-set with collection enrichment techniques for pseudo relevance feedback. For re-ranking of results we estimate dominant document classes in the result-set by taking help of document classifiers trained with large manually tagged datasets like the Wikipedia categories. We now explain the individual aspects of the thesis in details User role and information need We pointed out the need to exploit the user and work context because such information is more easily available in enterprise environment. We present an approach to bring in role based personalization into ES. We achieve this by two means, role based query expansion and role based re-ranking. The approach assumes the knowledge of user roles in advance. It also assumes presence of a role tagged subset of enterprise collection which we refer to as training corpus. Query expansion works with both global and local analysis techniques where global analysis uses Word- Net to broaden the query sense and a role based thesaurus constructed from term co-occurrence statistics of training corpus for introducing role specific terms in the query. Local analysis techniques use a role based flavor of pseudo relevance feedback where only top results relevant to given user role are selected for feedback. Re-ranking of result-set is achieved by pushing up document which are more relevant to given user roles. Both local analysis and re-ranking techniques use the role relevant scores of documents stored in the search index for achieving their purpose. Role relevant scores are determined with the help of a Naive Bayes document classifier trained using documents from the training corpus. Role based personalization of ES is explained in detail in chapter 3 and respective evaluation is also presented. 5

18 1.3.2 Query expansion with Wikipedia concept thesaurus The issues of document vocabulary and query mismatch are frequently observed both in web and enterprise search. This effect is more prominent in enterprise collection as the expected relevant results are fewer. There are many ways to go about increasing the terms in the query that represent the same information need. Often dictionaries are employed for this and synonyms help in increasing the term coverage. But the dictionaries (for example WordNet) are limited in scope as they are manually created by few experts. One of the most popular open source thesaurus, the WordNet, covers only generic language lexicon like nouns, pronouns, adjectives, verbs etc. Such dictionaries severely lack phrases and named entities. Moreover they do not provide an in-depth coverage of vocabulary used in popular domains like science, culture, mathematics, engineering and technology, medicine, law, politics etc. An alternative to dictionaries is Encyclopedias which have a greater grasp of topics. Wikipedia 2 is one such open source encyclopedia which is arguably the largest and most frequently updated knowledge repository on the planet. Wikipedia provides concepts from different domains and their respective descriptions in form of articles. The article are tagged by categories. Content creation in Wikipedia is open to its users (which is the Internet community) and is continuously verified through an active group of volunteers. While it has been often argued that the quality of content in Wikipedia is below than that of any published encyclopedias[51], but we argue that the scale and topic coverage outshines its quality issues. Wikipedia on its own is meant for human users. But we propose techniques to extract knowledge from its article and category link structure that could be easily used in ES as a thesaurus for query expansion. We propose our algorithms to build a concept vocabulary and a concept thesaurus from the link structure data of Wikipedia. Concept vocabulary is used to map the query from term space to concept space. Once query is mapped to concept space, concept thesaurus gives likely concepts for expansion that are related to query concepts. Another important aspect of this approach is to treat complex queries as phrases for mapping them to concept space. Expanded query in mapped back to the term space by using article names that represent the concepts. We explain this process in details and present evaluation results using CERC collection in chapter Re-ranking with Wikipedia category network We propose to use large manually tagged sets rich in categories from all the domains of Wikipedia to re-rank the result-set. This approach assumes that certain enterprise document classes exist in the enterprise collection. Using the class names and their respective sets of concise defining vocabulary we give an approach to map enterprise classes to Wikipeida categories. Once we get representational categories we use articles belonging to these categories as training set of a Bayesian document classifier. 2 Wikipedia, free English encyclopedia. Page 6

19 Every document in the collection is then given probabilities of them belonging to each document class. While searching, dominant enterprise classes are identified in the top k results and documents belonging to these dominant classes are pushed up in the result-set. The details of re-ranking with Wikipedia Category network approach is discussed in chapter 5 along with its evaluation Pseudo relevance feedback with Wikipedia based collection enrichment Relevance feedback based expansion models are frequently used to increase precision of search results in information retrieval. In absence of manual feedback, pseudo relevance feedback using top k search results as the relevant document set is employed for query expansion[46][48]. It is also observed that enriching the feedback set with external corpus may enhance search results[37][13]. We propose an approach of using pseudo relevance feedback based on query likelihood model (language based retrieval model[39]) where we enrich the feedback set with Wikipedia article content. Our approach uses a mixed set of top k enterprise document from the result set fetched for a query and Wikipedia articles which have a high cosine similarity between their content and the query. We incorporate classification techniques to estimate good feedback documents from pseudo relevant set. We use both supervised and unsupervised classification for estimating good feedback documents. Unsupervised techniques involve K Nearest Neighbor (KNN) clustering of the pseudo relevant set and then clusters are populated with Wikipedia articles using the article likelihood probability model that we propose. Nearest clusters to the query provide good feedback documents. We use another unsupervised technique based on probabilistic topic modeling[11]. We discover latent topics in pseudo relevant set. Query terms give activated latent topics which in turn provide matching documents for feedback. Supervised technique involves Bayesian classification of pseudo relevant set into good and bad feedback documents. We extract a number of features from both enterprise documents and Wikipedia articles to train the classifier. These approaches are explained in chapter 5 along with their evaluation Common theme of the thesis The four techniques that we have introduced in the above sub-sections have a strong underlying common theme which we will describe now. Every natural language gives the power to write an abstract piece of information in many forms, with many different words. Since the unstructured text is created for human understanding, it is often assumed that human intelligence will connect different styles, forms and vocabulary used for the writing with the inherent information. It is also true, because humans can identify and disambiguate meaning of terms from their context, but it is a hard problem for search systems which rely on search indexes carrying information about a single or a sequence of terms and their association with documents. There are of-course many ways to go around this problem. Natural 7

20 language processing techniques involving sentence level parsing can best relate words to abstract information units but are computationally very intensive and require difficult grammar to be implemented manually or statistically through manually tagged corpus. A simpler idea is to work with queries and try to extract the most collection-relevant meaning from them. But, as the index only contains the terms that occur in enterprise collection, we can add terms to the query to relate it to the collection without too much diluting its original intended meaning; or we can add terms to broaden its term coverage without adding much noise to the meaning so that terms present in collection are also captured in the query. In all our techniques, which implement query expansion, our intent is exactly the same. We derive structures from the target (enterprise corpus) and external collection (Wikipedia) for term addition to the query without disturbing its intent. Our re-ranking ideas are also analogous across different approaches that we have tried to implement. Basically we look for documents which belong to dominant classes present in the top k result-set and then try to push them up in the result ordering. In role based approach, the dominant class is predetermined as the selected user role, while in Wikipedia category based class identification, we find dominant classes from the result set and then implement re-ranking. 1.4 Organization of Thesis We dicusses the ongoing work in research community that is relevant to our work in Chapter 2. We present our motivations and novelties present in our approaches. We first give generic approaches to enterprise search and narrow down to the contemporary research happening in the field of user roles, query expansion, re-ranking and use of Wikipedia with enterprise search. We give the techniques for involving role context information into enterprise search in chapter 3. The role based personalization presented in that chapter, involves query expansion and re-ranking based on role based training corpus and the respective evaluation. Chapter 4 presents the approach to use Wikipedia article and category link structure to create a domain independent concept thesaurus for query expansion in enterprise search. The chapter also explains the re-ranking techniques based on the Wikipedia category network. A modified pseudo relevance feedback technique for enterprise search by inclusion of Wikipedia article content in the feedback set is explained in chapter 5. The chapter also presents supervised and unsupervised classification techniques to determine good feedback documents out of the pseudo relevant set along with experimental setup and evaluation. Chapter 6 is the last chapter of this thesis and it summarizes our work. We also give some of the future directions that this research can grow into, and conclude the thesis. 8

21 Chapter 2 Background This chapter our literature study in the field of information retrieval with attention on enterprise search (ES). The listed works have been motivational as well as challenging for us. As the focus of our work is on query expansion and re-ranking, this chapter presents research focused to these two aspects. 2.1 Enterprise Search Among the earlier works, Hawking [17] presents a number of areas that can be explored in enterprise search. The author also lists the scope of many advanced information retrieval techniques for providing solutions to enhance enterprise search. Our focus mainly remains on ad hoc retrieval of documents containing electronic text data. Fagin et al.[15] discuss the differences and common factors between the web and enterprise Intranets. They differentiate the web and enetrprice search by dicussing the strongly linked structure of the web versus business driven enterprise data. The authors argue that techniques specific to web search, like PageRank[5] and HITS[27] can be helpful but not sufficient. The authors also analyze the problems involved in implementing good enterprise search solutions. Semantic Web technologies have been applied in the context of enterprise by Fisher et al. in [1] for improving the knowledge management using ontologies but they do not focus on improving the search effectiveness. Their work describes the heterogeneous nature of the enterprise data. The issues arising because of the scale of distributed data are also discussed. Their paper describes a variety of structural and syntactical difference in the enterprise documents and how a number of protocols are employed for data access. However, they argue that the main emphasis in ES is always on relevance and speed of search in spite of the given issues. The authors also present classification techniques for categorizing data into precise domains through the notion of tagging documents with the associated metadata according to a predefined schema. Demartini [12] formally presents the application of semantic techniques in enterprise search. His work has motivated us to go beyond classical information retrieval methods and to build statistical learning systems for implementing effective ES solutions. 9

22 2.2 Related Work: User Roles and Search Demartini [12] tries to explore a co-operation of semantic web and IR systems for ES and has presented an open idea about the role based personalization on ES, which has led us to explore the area in detail. A substantial amount of research is going on in the area of Enterprise roles. But we have observed that the enterprise roles are mostly studied from security, access control[3][38][24][36] and enterprise modeling s point of view. We did not come across any significant work involving enterprise roles and information retrieval. The lack of such research in spite of the industrial demand (funded research projects) motivated us to look deeper into role based personalization of ES. We focused on annotated datasets on the basis of roles for extracting term associations and role prominence classification in documents for query expansion and re-ranking. As far as document annotation is concerned in ES, there are some significant parallels in academia and we will mention few contemporary research works. Dmitriev et al. in [14] have presented their work with document annotation in enterprise environment and suggest that annotations can help enterprise search. Their main contribution is through user annotations of enterprise documents which they exploit for feedback. They point out that the controlled environment and lack of spam makes it easier to collect annotations in enterprise search. Two methods, which are explicit and implicit feedback, are suggested by their work to collect annotations. We were encouraged by their arguments to incorporate annotated or tagged document sets for enhancing ES through query expansion and re-ranking, however, we kept our focus to the domain of roles and assumed that an annotated set for enterprise roles exists as a prerequisite to our work. 2.3 Related Work: Query Expansion in Enterprise Search Disambiguation of search queries and Query Expansion are our prime research areas. In [29], Liu et al. describe the inherent nature of user queries to be ambiguous and present an approach for word sense disambiguation of search terms. Chirita et al. in [8] have presented views on the ambiguous nature of short search queries and gives an approach to query expansion by implicitly personalizing the search output Thesaurus based techniques Zhang et al. present a WordNet based query expansion techniques in [54], where they assign weights for candidate query expansion terms selected from WordNet and ConceptNet by a technique they describe as spreading activation. In their work, the weighting task is transformed to a coarse-grained classification problem. The work attempts to identify relations between query difficulty and effectiveness of expansion. Pinto et al. in [38] propose query expansion by incorporating a semantic relatedness 10

23 measure on the concepts of WordNet thesaurus. Their work tries to find representational concept of the query terms and then use WordNet database to identify expansion terms Pseudo relevance feedback Pseudo relevance feedback is perhaps the most popular local analysis technique used in query expansion. Salton et al. first presented a compilation of many techniques to implement pseudo relevance feedback in [43]. Their paper also included the inclusion of the Rocchio algorithm [41] and the Ide dechi variant along with evaluation of several variants[20]. Some variants of blind feedback advocated that the results not judged as relevant should be considered as non-relevant set, however Schutze et al.[46] and Singhal et al.[48] showed that query expansion works better with only using a relevant set formed with topk results. 2.4 Related Work: Re-ranking Re-ranking is another area where we share our interests with many researchers. Re-ranking is applied in order to build a more intuitive and perceptive presentation of search results from the user s point of view. Sieg et al. in [47] present the idea of re-ranking with user profile based ontologies and Zhuang et al. try to achieve the same using query log analysis[56]. In another work that implements re-ranking, Zhu et al. present an idea for pushing up navigational pages (e.g., queries intended to find product or personal home pages, service pages, etc.) in the search results for enterprise search[55]. Their approach is to identify navigational pages through an off-line phase and build a separate index for them. They associate terms with these pages and then identify a queries that are intended for navigational pages. For such queries, the desired pages are presented among top results. 2.5 Related Work: Wikipedia and Enterprise search A lot of areas are being researched for ES involving open knowledge-bases like Wikipedia 1. This section lists few such works which are also relevant to our ideas involving Wikipedia structure and content for enhancing ES Wikipedia link structure and concept thesaurus Our motivation arises from one of our previous works on role based personalization in enterprise search[50]. The system used role specific training documents to build a word co-occurrence thesaurus, 1 Wikipedia, free English encyclopedia - Page 11

24 which along with WordNet was employed for query expansion. We noticed that the concept coverage of a manually constructed thesauri like WordNet is mostly limited to single word concepts and they severely lack named entities. This restricted the outlook of the query expansion module of a multi-word search query to a bag of words perspective. A growing recognition of Wikipedia as a knowledge base among information retrieval research groups across the globe was another motivation for us to proceed in this direction. Medelyan et al.[30] give a comprehensive study of ongoing research work and software development in the area of Wikipedia semantic analysis. Their findings reveal a large number research groups involved with Wikipedia thus indicating its increasing popularity. A number of developments with Wikipedia content and link structure analysis are going on, which are related to our work. Kotaro et al. in [25] present an approach for creating thesaurus from large scale web dictionaries using the hyper-link statistics present in their HTML content. They concentrate on Wikipedia and analyze the article and category links for building up a large scale thesaurus. They introduce the notion of path frequencies and inverse backward link frequencies in a directed link graph of Wikipedia concepts for finding the closeness of the articles. Our thesaurus generation is similar to their approach but we analyze both inter-article and article-category links for thesaurus generation. We use their concept of backward link frequencies for finding the conceptual narrowness of articles. Our motivation to create a concept thesaurus is not to have a dictionary but a knowledge repository that matches Wikipedia concepts to user queries and then find associated concepts (hence terms) for the query. Ito et al. in [21] present their approach of constructing a thesaurus using article link co-occurrence where they relate articles with more than one different path between them, within a certain distance in the link graph. They argue that the link co-occurrence analysis to generate a dictionary is more scalable than link structure analysis, however they do not make use of category information. Their work too, is meant for a generic dictionary and they do not suggest any implementation involving the thesaurus for document search. The idea of using Wikipedia as a domain independent knowledge base is consolidated by the finding of Milne et al.[32]. They give a comprehensive case study of Wikipedia concept coverage of domain specific topics. The authors have done a comparative analysis of concepts and relations between Wikipedia and Agrovoc 2 (a structured thesaurus of all subject fields in Agriculture, Forestry, Fisheries, Food security and related domains) and establish that Wikipedia can be a cheap and descent substitute to high cost manually created domain specific thesauri. However, their work does not implement a method to extract a domain specific thesaurus. They only prove the breadth of Wikipedia concept coverage. This was encouraging for us as we used a general domain independent thesaurus extracted from wikipedia 2 Agrovoc, a multilingual agricultural vocabulary. 12

25 link structure. In spite of the lack of domain specialization we felt that the topics and concept belonging to most domain and enterprises will be covered in this thesaurus. Milne et al. in [33] introduce a search application using Wikipedia powered assistance module namely Koru, that interactively builds queries and ranks results. They use wikipedia re-directs for building up a synonym set and study the disambiguation pages for word sense disambiguation. Their work primarily focuses on the terms of wikipedia article names. Our thesaurus construction however focuses on concepts (abstract entities described by articles) and their representational vocabulary. We go a step beyond collecting synonyms and focus on the closeness of different related concepts. They do present methods to incorporate thesaurus in search specially for query expansion, however, their methods depend on active user involvement during the search cycle. Our technique using concept thesaurus for query expansion works by first finding concepts that represent query and then the expansion terms from the concept names that are related to the query representational concepts. This work is motivated from a similar technique presented by Zhang et al.[54] where they look for representational concept of the query terms from WordNet concept repository and then use WordNet database to identify expansion terms. The major difference in our technique and this work is the phrase based approach for finding representational Wikipedia concepts Wikipedia categories and re-ranking Kaptein et al. suggest a method to incorporate Wikipedia categories for ad hoc search[23]. They give an approach to automatically generate target categories as surrogates for manually assigned categories. Our work however used Wikipedia categories as substitute for enterprise document classes instead of queries. Their idea none the less motivated us to look into the Wikipedia category network for re-ranking in ES Wikipedia as lexicon resource Our Wikipedia based feedback approach makes use of language based retrieval models for search and query expansion through feedback. We use the query likelihood model for document retrieval as suggested by Ponte et al[39]. This model was further enhanced by Miller et al.[31] and Hiemstra[19] The feedback approach for query expansion in language model based retrieval has been suggested by Zhai et al. in [52] and we use that for the same purpose. Our work has also implemented a probabilistic topic modeling for discovering latent topics and for finding document relationships with the topics. This was achieved through latent Dirichlet allocation or LDA as suggested by Blei et al. in [11] Diaz et al[13] have proposed a method to incorporate information from external text collection for pseudo relevance feedback using language model technique. By making use of language relevance 13

26 model for query expansion they showed good improvements in the results. Their aim was to broaden the vocabulary range of the collection and hence for feedback based expansion. They suggest the use of high quality corpus which is comparable to the target corpus. We carry forward this work by incorporating methods to ensure quality documents are available in the feedback set. One major difference is that our methods specifically uses wikipedia text as external corpus because of its distributed and democratic content creation and its open source nature. We also argue that the consistency in the structure of Wikipedia articles (the consistent sectioning of article text) can be used ustilized to better seggregate the training test and therefore, achieve better training. Peng et al. in [37] describe the the incorporation of alternate lexicon representations through collection enrichment techniques for query expansion. Sparse nature of enterprise data and Intranets in terms of less hyper-links iand the strong biases towards vocabulary usage by the users is discussed. However, their approach uses a relatively bigger set of similar documents in content as compared to the target corpus, while our focus is to use a general purpose rich text resource such as Wikipedia to enrich the pseudo relevant set. We also use classification (both supervised and unsupervised) based approach for selecting good feedback text from the external corpus. Cao et al. argue the inaccuracy of the basic assumption made by pseudo relevance feedback in [7]. They point out that most frequent terms in feedback documents may not be the best expansion terms. The authors further add that expansion terms discovered by orthodox techniques may frequently be unrelated to the query and harmful for the retrieval. He et al.in [18] further extend this work by moving the classification from term to document level. They apply classification techniques to identify the highquality feedback documents, or in other words, to remove the low-quality ones. They also present a list of features which they use for classification process. In [28], Li et al. give an approach to use Wikipedia as external corpus for improving Weak Ad-Hoc Queries. Their work also involves pseudo relevance feedback, but they assume the entire pseudo relevant set as relevant and do not incorporate any classification of good or bad feedback documents. 2.6 Summary This chapter presented many ongoing researches in the field of general information retrieval and ES, which are important from our point of view. As our focus is mainly on query expansion and re-ranking aspects of search, this chapter presented some of the relevant findings and accomplishments achieved by a large research community involved with information retrieval and more specifically ES. We have tried to capture the very latest in research for our synthesis of ideas. Coming chapters will present our approaches to enhance ES and their respective evaluations. 14

27 Chapter 3 Role Based Personalization of Enterprise Search We present a role based approach for personalizing Enterprise Search (ES) in this chapter. In an enterprise, a role defines a set of guidelines that govern the work profile, information access and the ownership level of an employee [9]. The role of an employee has also an implicit effect on the kind of information required by him or her from the enterprise data. We embed the role based retrieval approach into the two basic aspects of an ES system, which are the query formulation and the indexing process. We discuss the relevance of role based search in section 3.1. Section 3.2 introduces core techniques which make use of a role based tagged document set to train a document classifier and create a cooccurrence thesaurus. The same section also discusses indexing technique for parametric and role based search. Section 3.3 presents query formulation, boolean logic and parametric search. Section 3.4 discusses the role based Query Expansion involving global and local analysis techniques and section 3.5 puts forward a role relevant re-ranking step. Experiments for evaluating hte proposed aprroaches in this chapter are described in Section 3.6 along with observed results. 3.1 Enterprise Roles and Search Segregation of employees into well defined and distinct roles is a key aspect of modern enterprises. Role division felicitates distribution of tasks and authority in a systematic manner. Abstractly, roles can be viewed as crisp guidelines that define the job profile of an employee in an enterprise. With advent of computers, electronic storage and networking, it has become easier to effectively share the collective intellectual wealth in an enterprise. This knowledge sharing is considered to be one of the most important factors for a successful and smooth functioning organization. The data in large to medium sized enterprises is available through distributed systems in various forms like text, images, videos. Majority of this information is in text form which either resides in structured format (relational databases, spreadsheets etc) or as semi-structured or unstructured form [40] involving natural text documents. Enterprises have invested heavily on the information access control on the basis of roles [22][44][16]. 15

28 The Intranet or the internal web is one example of information sharing and access system for enterprises. Unlike Internet, where content creation is democratic and uncontrolled; and information access is without discrimination, Enterprise data is proprietary knowledge involving considerable investments (time, money, research) and trade secrets of a company [2]. For example, a software company like Microsoft would not want the source code of Windows operating system to be available freely in public domain when its sales are the primary source of revenue. So, the data of enterprise being of considerable value, its information flow and access is often regulated for various reasons. A high level policy document or contemplating the lay-off targets for example, should not be seen by entry level employees. On the other hand a solution to a particular problem faced frequently in developing the core product should have more reach in order to avoid such complication. Hence with roles, information discretion is desired by enterprises. Various role based access strategies are being investigated by the research and development community to address this requirement. But now we will see that along with the security s perspective of role based access of data, it is vital to have an information retrieval s point of view when it comes to roles and information in an enterprise. As we discussed, the electronic data is available on distributed systems in enterprises and Intranets facilitate its access to users (by user we mean people working with the enterprise data). As the size and scale of data outgrew human abilities to browse, enterprises implemented full text search systems to meet user information needs for searching documents using keywords or queries. Queries are nothing but a set of terms that from an information seeker s point of view, best represent his or her information need. Same keywords have different meanings in context of different roles and users tend to formulate ambiguous, inaccurate and misrepresenting queries. Consider two roles in a Database Solutions firm, a software developer and a hiring officer. While both may want to query about networking, the developer may expect results related to computer networks, on the other hand hiring officer would expect information about social networking. This problem can be resolved to a reasonable extent with role based personalization of ES. An example of ambiguous query could be bush meant for former US president Bush. Bush means a plant type and here the meanings have no relation what so ever. Enterprise data adheres to strong guidelines governing the content creation. It is also regulated and strongly related to the context and business and operations of enterprise. This presents an interesting scenario in ES, where a number of aspects about the data are known while designing a search system. User roles and role based information needs, as we mentioned earlier, is one such aspect. Another facet is the meta-data which is much more visible and structured [10] as compared to Internet. Also, the enterprise documents are work related, well formed and most importantly lack spam. We considered these properties of enterprise data in order to design a role centric search system involving parametric search which uses the meta-data; and information retrieval techniques like Query Expansion (QE henceforth) and re-ranking based on tagged content. Our idea is to implement an ES solution which will be a role based personalization of enterprise document retrieval process. We make use of a tagged set of 16

29 the enterprise corpus, having a collection of documents manually classified and marked on the basis of roles, in order to train a document classifier and to construct a word co-occurrence statistics based thesaurus for every role. For role based QE, we use the co-occurrence thesaurus along with WordNet for the global analysis of queries and implement a role based flavor of classical pseudo relevance feedback based QE using co-occurrence statistics of top k results as part of local analysis. The role classifier is used to reorder the search results according to the user role. 3.2 Role Based Tagging and Indexing This section discusses the main building blocks on which our role based search works. Ou approach assumes that the role definitions of an enterprise are not within the scope of this work and we do not have to deal with role engineering or mining. Although work has been going on for automated and semi-automated extraction of roles in an Enterprise [49] [45] [53], for our experiments we chose to create the roles manually. We are also not concerned about implementing any form of access control Role tagged dataset In order to discover the vocabulary associated with various roles we use a training corpus which contains documents tagged with role labels. We select a subset of the enterprise document collection (enterprise corpus) and manually classify the documents it contains, into various categories which correspond to the information that belongs to or is used by respective roles. Each document in the training corpus is categorized by one label and is kept in the respective directory. Directory structure of training corpus looks like figure 3.1. In our implementation we do not use nested roles and our directory hierarchy is also flat. however, a tree like structure is also possible as it is a matter of extra effort by the humans classifying the training corpus Role classifier One of the primary uses of role tagged subset of the enterprise corpus is to train a document classifier for indexing process. When the search results are presented to a user, our aim is to produce an ordering of documents which promotes the documents that are more closely associated with the selected role. If every document in the enterprise collection is already classified and tagged with proper role information, the ranking process becomes much easier. But a manual classification of a large corpus can be extremely impractical. Training corpus described in the previous sub-section is just a small part of a much larger collection. So, we rely on machine learning techniques to achieve an automated classification. We propose to use a Naive Bayes document classifier for tagging documents with role relevant scores. The choice of the classifier is just to put forward the concept and we do not advocate any particular classifier (the choice of classifier depends on many factors like the collection size, on-line capabilities etc). There are a number of classifiers available and all have their own strengths. We preferred a Naive Bayes 17

30 Figure 3.1 Directory structure of training corpus classifier because of the ease of its implementation. We will try to explain briefly what happens during document classification. We imagine that there are m document classes each corresponding to one of the m roles in the enterprise. A document can be modeled as a set of words. The Niave Bayes model also assumes independent probability of i th word (w i ) in a document that occurs in particular class C j. This can be stated as a conditional probabilityp(w i C j ). The probability of a documentd given a class C j is P(D C j ) = w i D P(w i C j ) (3.1) In order to determine the class of a document D Bayes theorem is used as following P(C j D) = P(C j) P(d) P(D C j) = P(C j D) = P(C j) P(d) w i D P(w i C j ) (3.2) P(w i C j ) can be determined by using the training corpus with maximum likelihood estimate. P(w i C j ) = tf (D) wi D D C j w i D (3.3) Here we take the ratios of number of occurrences of w i in C j and total terms in C j ; d is a document in the training corpus belonging to classc j ;tf wi (d) is the term frequency ofw i indand D is term count in D. Equation 3.2 gives us the probability of an unseen document belonging to a class. The sum of probabilities of a document for all classes is 1. 18

31 Once we train the classifier by estimatingp(w i C j ) using the training corpus, we classify the remaining documents in the collection and give them the role relevant scores. Suppose that for document D i, the probability that it belongs to Role 1 is 0.7 and that is belongs to Role 2 is 0.3, in a two role system. The role relevant score vector of this document will be D i = (< Role 1,0.7 >,< Role 2,0.3 >) (3.4) Role relevant score information is then utilized while the documents are indexed. This is explained later in section Role based co-occurrence thesaurus We also propose an automatically derived role based thesaurus using word co-occurrence statistics over a collection of documents in an enterprise. We use the training corpus which contains documents classified based on the existing roles. These documents can be used to automatically induce a role based lexicon (this would introduce role specific terms while expanding the query). A window of L consecutive words (we assume L = 8 words) slides over all the documents, and the words that exist in this window are considered to be co-occurring. Stop words are neglected while constructing the thesaurus and it is constructed in such way that it stores word to word co-occurrences as well as the frequency of the co-occurrences. To better understand the technique of creating thesaurus based on co-occurrence model, consider the example where we have the following text being scanned by the thesaurus builder The giant panda lives in a few mountain ranges in central China, mainly in Sichuan province, but also in the Shaanxi and Gansu provinces Due to farming, deforestation and other development, the panda has been driven out of the lowland areas where it once lived. The text is then striped of the stop-words and punctuation marks and we end up with the following giant panda lives few mountain ranges central China mainly Sichuan province Shaanxi Gansu provinces Due farming deforestation development panda driven out lowland areas where once lived In the first iteration given in the box below, the sliding window starts at the beginning of the text and the words giant, panda, lives, few, mountain, ranges, central and china are observed to be co-occurring. giant panda lives few mountain ranges central China mainly Sichuan province Shaanxi Gansu provinces Due farming deforestation development panda driven out lowland areas where once lived 19

32 In next iteration (the following box), the co-occurrence observation window slides one to the right making panda, lives, few, mountain, ranges, central, china and mainly as co-occurring words. So after this iteration giant and panda or panda and mainly co-occur but the words giant and mainly do not. The process is repeated till the all of the text is analyzed in a document. The we start again for remaining documents. An important point to note here is that number of windows in a document D will be D L i.e. the word count of D minus the window size. giant panda lives few mountain ranges central China mainly Sichuan province Shaanxi Gansu provinces Due farming deforestation development panda driven out lowland areas where once lived giant panda lives few mountain ranges central China mainly Sichuan province Shaanxi Gansu provinces Due farming deforestation development panda driven out lowland areas where once lived giant panda lives few mountain ranges central China mainly Sichuan province Shaanxi Gansu provinces Due farming deforestation development panda driven out lowland areas where once lived giant panda lives few mountain ranges central China mainly Sichuan province Shaanxi Gansu provinces Due farming deforestation development panda driven out lowland areas where once lived After completely analyzing the tagged training set, we compile the final co-occurrence thesaurus on the basis of the probability of finding two words co-occurring in given window verses the probability of finding the same words co-occurring in all other roles. Let R be the set of all roles such that R = {R 1,R 2,...,R n } (here we have n possible roles) The probability that terms t i and t j are co-occurring in a particular roler k is given by equation 3.5. P (t i,t j R k ) = ( ) f L (t i,t j ) L d (tf(d) L ) d R k d R k where, d is a document in R k, L is a co-occurrence observation window in d; f L (t i,t j ) = 1 if t i and t j co-occur in windowl, zero otherwise;tf(d) gives the total terms ind; L is the window size. Equation (3.5) 20

33 3.5 actually gives the ratio of number of windows in documents in R k where t i and t j co-occur to the total number of windows. Our approach aims to include those terms in role based co-occurrence thesaurus which are seen together more for the given role as compared to other roles. The probability that the terms t i andt j co-occur in all other roles apart fromr k is given by equation 3.6 P (t i,t j R k ) = d R R k ( ) f L (t i,t j ) L d d R R k tf(d) docs(r R k ) L The symbols in equations 3.5 and 3.6 have same meaning. R R k implies the set of document that belong to all the role except R k. The confidence score of two terms co-occurring for given role R k is taken as the difference in the probability of finding then in documents of given role and the probability of finding them together in other roles. (3.6) ( ) f L (t i,t j ) d R Confidence Rk (t i,t j ) = k L d tf(d) docs(r k ) L d R k d R R k ( ) f L (t i,t j ) L d d R R k tf(d) docs(r R k ) L Equation 3.7 gives us the metric to rank terms as more co-occurring than others. (3.7) Indexing Most documents have an added structure apart from their actual content. Digital documents have machine recognizable metadata associated with every document which carries information about the date of creation, author, file format etc. These are also known as the fields of metadata. While creating a parametric index, we create document posting lists for every field. Of-course, the field s range of values should be finite to accomplish such an index. Additional fields which are not part of the metadata (e.g domain name, url etc) could also be included into parametric index while scanning or crawling through the enterprise corpus. Parametric index often helps in narrowing down the search range by selecting certain values for different parameters (author name or date, for example) along with the actual query. In addition to parametric index, it is also vital to have a role based index of the enterprise documents in order to implement role based personalization. We propose a model for indexing in figure 3.2, which will take care of the needs of role based personalization. This model is built with the help of two mappings, namely term to document mapping and document to role mapping. Term to document mapping is the actual indexing done by all the search engines, where the term-document associations are stored as term frequency (tf) and inverse document frequency (idf). This mapping is stored for a normal text based search engine, but this does not cater to the requirements of role based personalization. So a new Document to Role mapping is added to the indexing process. Document to Role mapping 21

34 Figure 3.2 The working of Indexing Functionality. takes care of role based personalization. This mapping stores the percentage of relevance of a document to the available roles. We use a Naive Bayes document classifier that we described above, where the tagged training corpus is used for training. The format is - Document1: (role1, 0.35), (role2, 0.51), (role3, 0.13), (role4, 0.01) This mapping reveals how relevant a document is for a role and the same could be used to re-rank the results. 3.3 Query Formulation We propose the use of two effective methods for formulating a Query. These methods allow the user to form a better structured query rather than just a set of keywords that may ultimately result in retrieving bad documents, much to the dislike of the users. These two methods are Boolean logic and Parametric Advanced Search Boolean logic Boolean logic is a method for query based information retrieval in which search terms are combined with boolean operators like AND, OR and NOT. These operators are described below 22

35 Query Boolean queries Parametric search Boolean Logic Advanced search page Extract Operators Set default values Add control keywords Fetch values for parameters Rephrase query Query with control keywords Figure 3.3 Block Diagram showing the flow of Query Formulation functionality. AND: This operator is used to retrieve documents which contain all the term specified before and after the operator. OR: This operator is used to retrieve documents which contain any of the term specified with the operator. NOT: This operator is used to retrieve the documents which do not contain the term specified after the operator. Figure 3.3 explains the flow of boolean logic functionality in steps along with parametric search. We use a set of predefined control keywords for representing the operators. The output of query formulation will look similar as the following CONTROL-WORD Query-Term1 CONTROL-WORD Query-Term Parametric search Through advanced parametric search, a number of options are available to the user to give specific information about the desired documents. The options designed for the advanced parametric search 23

36 were inspired by available parameters in popular Internet advanced searches like Google 1, Yahoo 2, Google Scholar 3 etc. We decided on following options as search parameters. File Format (file:) Date (date:) Author (auth:) Domain (site:) URL (url:) We initially set these options to default values in the advanced search page depending on the selected user role. Once all parameters are finalized by the user, we add the required predefined control keywords to the search query. Figure 3.3 explains the steps involved in advanced parametric search functionality. The parametric search makes use of the parametric index described in section 3.3 for filtering the results. 3.4 Role Based Query Expansion This section describes a QE module where we try to incorporate various heuristics to traditional QE techniques in order to implement role based personalization. QE uses two techniques, namely global analysis and local analysis. Role information is assumed to be available in advance to the QE module. Both global and local analysis work on individual query terms and do not regard multi-term queries as single entity i.e. both techniques work with bag-of-words approach. Our system provides a role selection option in the user interface which also takes the user query. Global analysis involves thesauri to find term associations with query terms. We use WordNet 4 as a general purpose thesaurus for lexicon coverage and a role based thesaurus (described in section 3.2.3) for giving emphasis to terms belonging to selected roles. Local analysis methods tries to make use of relevant documents to the query. This approach gives a general model to gather terms for QE using relevant and non-relevant set of documents for the query produced by the relevance feedback process. We use this model for blind relevance feedback to compute role relevant co-occurring terms with query terms by analyzing top k results retrieved from the first pass of the original query. These methods are described in more detail in the following subsections. 24

37 Figure 3.4 Design of QE module diagram showing the QE design including the Global and Local analysis techniques working to expand the query. 25

38 Figure 3.5 An example execution of the global analysis phase on a sample query Global analysis techniques We use a two step QE process in global analysis. Most of the times the information required by user is expressed differently by the authors (different words with same meaning might be used). Therefore the first step utilizes a general purpose thesaurus or dictionary of the language in which query terms produced. The synonyms, hypernyms and hyponyms of the individual query terms are added to the query. By doing this, we cover a broader range of lexicon representing the same concepts that are represented by the query terms. Since we work with English, our choice of thesaurus was English WordNet. The second step utilizes the co-occurrence thesaurus constructed using the training corpus (section 3.2.3) for the selected role. This thesaurus lends most frequently collocated terms with the terms of the intermediate query from step one. This gives us a final expanded query which is then used for searching the results. The addition of new terms to original query can be weighted, however, we use an OR combination of original query and added terms. A small example has been shown in figure 3.5 for a better understanding of global analysis (the example is not very accurate, though). In this example, we assume that their is only one document belonging to the selected role in the training corpus. The text of this document is same as the example given in section 3.2.3, where we described the role based thesaurus (the stop-word removed text used in that example was giant panda lives few mountain ranges central China mainly Sichuan province Shaanxi Gansu provinces Due farming deforestation development panda driven out lowland areas where once lived ). 1 Google web search engine. 2 Yahoo web search engine. 3 Google scholar beta. 4 WordNet, a lexical database for the English language. 26

39 3.4.2 Local analysis techniques In the local analysis segment, we adjust the query relative to the documents that initially appear as a match to the original query. As stated earlier, this process is based on a bag-of-words approach to process the query. We use top k retrieved documents in the first pass of the query using a vector space retrieval and treat these documents as relevant. This idea is also known as pseudo or blind relevance feedback [42] and is a frequently used technique across both commercial and academic information retrieval. We analyze the co-occurring word statistics and the terms that are observed to occur more frequently with query terms are utilized in expanding the original query. The method can also incorporate a proper relevance feed back as we will see. Suppose that we have relevance judgments for a query Q. Let the set of relevant documents be Rel and the set of non-relevant documents be nrel. We use the same sliding window approach to find the term collocations that we described in section Each term T is represented by a vector of associated or co-occurring terms as T = (< t 1,w 1 >,< t 2,w 2 >,< t 3,w 3 > ) wheret i is a co-occurring term with T andw i is the weight of association. We compute this weight as f L (T,t i ) w i = L d (3.8) tf (d) L d collection where, d is a document in the collection which is analyzed for collocation statistics; L is a co-occurrence observation window in d; f L (T,t i ) = 1 if t i and T co-occur in window L, zero otherwise; tf(d) gives the total terms in d and L is the window size. We compute the co-occurring vector for both Rel and nrel to gett Rel andt nrel. A difference in the similarities of these vectors with the query vector gives us a metric to associate a the term T with query. Let the query vector of query Q be Q = (< q 1,1 >,< q 2,1 >,< q 3,1 > ) (the weights are uniform), then the score is computed as Score(Q,T) = CosineSimilarity( Q, T Rel ) CosineSimilarity( Q, T nrel ) (3.9) In blind relevance feed back we only have a pseudo relevant set and there is no non-relevant set. So we only uset Rel for associating scores oft withq Score(Q,T) = CosineSimilarity( Q, T Rel )) (3.10) The highest scoringnterms are used to expand the query. The discussions we had till now in the local analysis segment have focused on a generic QE. To implement a role based flavor of pseudo relevance feedback, we compute the weights w i of terms t i 27

40 which are the components in t using the role relevant scores. We described the process of attaching role relevant scores during the indexing process using a role classifier in section Every indexed document has a role relevant score for all the available roles and the sum of these scores is 1. For T = (< t 1,w 1 >,< t 2,w 2 >,< t 3,w 3 > ), weights are computed with a small modification in equation 3.8 as shown in equation 3.11 w i = d Rel f L (T,t i ) L d RS Rk (d) (3.11) tf (d) L where,rs Rk (d) is role relevant score of documentdfor the roler k. A cosine similarity score between Q and T would give us a degree of association by which T co-occurs with with query terms in relevant documents forqfor the roler k. 3.5 Role Based Re-ranking We present a modified re-ranking approach for rearranging the results that are obtained from the search engine based on the role using a re-ranking algorithm. For every document retrieved by the search engine, two scores are fetched, first score is the cosine similarity score of the documents. This is actually the search engine score as we are using a vector space model for retrieval. Second, the role relevant score of the document respective to the selected role is computed. This is achieved through extracting the role relevant score vector (described in section 3.2.4) from the index for a document and then parsing it to extract the score of the selected role. A linear combination of these two scores is used to compute a final score which is presented in equation FinalScore(D) = α COS(Q,D)+(1 α) RS Rk (D) (3.12) In equation 3.12, COS(Q,D) is cosine similarity between query Q and document D. RS Rk (D) is the role relevant score of D for the roler k. α is the role personalization factor ranging from 0 to 1. Role based ranking is evaluated for entire or partial result set depending on how many results are to be presented with keeping speed of retrieval in mind. We evaluate this score for the entire result set. Then the result set is sorted on the basis of descending values of the combined score and presented to the user. This approach tries to push up the documents in the result set which are closely related to the selected role. A highαvalue decides dominance of the role relevant documents in the result set. 3.6 Experiments and Evaluations In this chapter, we presented role based personalization of enterprise search by implementing a role based query expansion and re-ranking of search results. This section presents the empirical observations of this approach. 28

41 3.6.1 Data set: IIIT intranet For evaluating the proposed approaches, we needed a role tagged enterprise seach corpus having role based query sets as well as role based relevance judgements. To the best of our knowledge, no such evaluation dataset existed as an open source resource. Therefore we used the web pages of our own Inratnet for evaluating the presented approaches. Since the size of IIIT-H enterprise collection is relatively small, it was easy to manually create search topics or queries belonging to different roles and to carry out respective relevance judgments. We used a web document extractor tool to extract all web pages of IIIT-H Intranet sites. The extracted documents were filtered by removing the undesired pages (empty pages or error message pages). Out of these, we prepared a training corpus containing document tagged with role annotation for document classification (Rainbow toolkit 5 ) and role based thesaurus generation Roles Six roles for IIIT-H Intranet test data were selected on the basis of common university structure as: 1. Under Graduate student 2. Post Graduate student 3. Faculty 4. Staff 5. Candidate Faculty 6. Candidate PG Student We had a final of 2600 documents for Faculty role, 150 documents for Staff role, 400 documents for Candidate PG role, 261 documents for Candidate Faculty role and nearly 2000 documents for each PG and UG roles in our document corpus. UG and PG roles have many documents in common. A subset of pages from test data, i.e. IIIT intranet, was manually classified based on their query independent relevancy with each role. This dataset was used as the training data for the document classifier so that a role based model for document classification could be constructed Search Engine and Query Formulation in order to select an appropriate search tool, a detailed study of available open source search engines was conducted. We found that Lucene 6 and Nutch 7 are the most popular among the research community. These search engines are based on vector space model. Lucene is the full text search API from Apache 5 Rainbow, Naive Bayes document classifier. mccallum/bow/rainbow/ 6 Apache Lucene, text search engine library in Java. 7 Apache Nutch, Open-source web-search software, built on Lucene Java. 29

42 software foundation and Nutch is a complete web search solution which includes web crawler and document parsers. Boolean Logic Lucene Query Parser implements OR functionality as well as it provides support for many other features like searching the title of a page, nested boolean queries, searching for words that occur within a few positions of one another, searching for all terms that start with a specified term, searching for terms that are similar to a specified term, etc. Hence we modified Nutch source code to use Lucene Query Parser. Boolean operators and other query term functionality that have been implemented so far are AND, OR, NOT, Near, Phrase Search and Title Search. Parametric Advanced Search For parametric advanced search, we dealt with extraction and indexing of metadata (like author name, date, file type etc) of documents crawled. The plugins that are used for this functionality are index-more and QueryMore functionallity of Lucene API. Using IndexMore we index the author s name, date, type, url of the document, while QueryMore plugin provides the functionality to parse the query and extract the advanced operators Query Expansion The global analysis techniques for query expansion (QE) used WordNet and automatically generated role based thesaurus. WordNet 8 is a semantic lexicon for English language. It groups English words into sets of synonyms called synsets, provides definitions and the semantic relations between these synonym sets. We used JNWL 9 which is a Java based API for accessing WordNet-style relational dictionaries and provided an interface for working with WordNet. IIIT-H Intranet training corpus was used to create role based co-occurrence thesaurus. We split the text of documents into sentences by considering a sequence of words between two punctuation marks as one sentence. A sliding window technique of length 8 was used over all the sentences, and words present in a window were assumed to be co-occurring. We used Hash maps or dictionary data strructres to store the word co-occurrence information. Each word map to another Hash Table which stores the co-occurring word and the frequency of co-occurrence. The co-occurrence thesaurus was stored as hash map objects and these were preserved using JDBM 10, which is a transactional persistence engine for Java objects. During QE, we only load information required (co-occurrences for a given word) in the main memory for efficiency. 8 WordNet, a lexical database for the English language. 9 JWNL, Jawa WordNet library JDBM, a transactional persistence engine for Java. 30

43 3.6.5 Indexing Normal term-document association indexing is carried out by Nutch indexing functionality. For determining document-role associations, we used a supervised naive Bayes document classifier trained with IIIT-H role bsed training corpus. We used Rainbow toolkit for classification. After training, each document fetched while crawling the Intranet was classified and its role relevancy scores were determined. We stored these scores as a field in lucene s inverted index of the document Retrieval and Re-ranking As we mentioned in chapter 3 that we use a linear combination of search engine score and role relevancy score of every document for re-ranking withαas the smoothing parameter or the personalization factor. We configured the α value as 0.5. The results are retrieved and ordered on the basis of search engine score Results Text REtrieval Conference or TREC is one of the contributors of data collections and associated evaluation tools for evaluating IR systems. TREC evaluation package is standardized as the format of relevance judgment files is uniform for most datasets. We have used the extensions of the existing TREC evaluation setups for our experiments in order to accommodate role information. The basic input files for the TREC evaluation package are QREL FILE and BASELINE TOP FILE. The QRELS FILE contains the relevance judgments for TREC evaluation package. Relevance judgments contain both sets of relevant and irrelevant documents. The BASELINE TOP FILE contains search results and other required information about search results, such as QueryID, DocID, rank and score given by the search engine. We compare it with QRELS FILE in our evaluation process. To evaluate role based Search, we added an extra field i.e. RoleID in search results and relevance judgments. Hence, we modified the existing TREC evaluation package to serve our evaluation process. Table 3.1 presents the results we observed after evaluating our approach. Description of Column headings in table : Baseline: (TFIDF algorithm) without role information 2: QE with GA: query expansion using Global Analysis technique 3: QE with LA: is query expansion using Local Analysis technique 4: QF and re-ranking: Query formulation with re-ranking 5: All features combined: all the above features, QE + GA + LA + QF + re-ranking 31

44 Table 3.1 The evaluation results of Role Based Enterprise Search Base-line QE with GA QE with LA QE and Reranking All Features Combined Number of Queries Results Retrieved Actual Relevant Relevant Retrieved MAP MRR P at Description of Row headings in table : Number of queries for which evaluation is conducted. 4 th column contains only 108, since no results were returned for two queries due to query re-formulation 2: Number of hits returned by the algorithm for all the queries 3: Actual number of relevant results for these queries present in the dataset 4: Number of relevant results identified by the engine for these queries (demonstrates recall) 5: MAP: a technical IR metric gives a global average of precision, gives indicative ranking of systems, but not in absolute percentages 6: MRR: Average reciprocal rank of the first relevant document (if 1, it means for all the queries, first result was relevant) 7: Probability of relevant documents in the first 5 search results (demonstrates ranking capability) Comments As it is evident from table 3.1, using role information improves search quality compared to plain text search. However, our approach did not perform for all the modules combined. This can be seen when we compare last column with others. The reason could be the noise added due to personalization dominating the actual given query. Role based QE performs better than than query reformulation and reranking of results. Within QE, global analysis technique performed better than local analysis technique, however the difference is not significant. Thus it can be concluded that in role based personalization, QE using role based thesaurus performs better than pseudo relevance feedback. This should be obvious 32

45 as pseudo relevant sets, often helpful in discovering term association with query terms, do not sufficiently distinguish term associations on the basis of roles. Though query reformulation and re-ranking of search results improve the performance (when we see the comparison of last row values), however this technique might also result in less hits. 3.7 Summary The chapter described role based personalization in enterprise search through local and global analysis of enterprise collection for implementing a role based QE method. We also described a re-ranking approach on the basis of roles. The key features of this chapter were the role tagged training corpus, role based co-occurrence thesaurus construction and training a document classifier using the role tagged corpus for re-ranking. The evaluation of these techniques were also presented including the experimental setup and the results. 33

46 Chapter 4 Query Expansion and Re-ranking with domain independent knowledge This chapter introduces a QE technique which makes use of an external domain independent text resource, Wikipedia. Our idea constructs a concept thesaurus by analyzing inter-article links and articlecategory distributions. We keep our focus on Wikipedia, as it is one of the most elaborate and largest electronic encyclopedias. However, since our techniques are designed to work with document collections having a substantial number of interlinks, the proposed approach can also be generalized for other collections. Our method also builds concept vocabulary for representing a user query by a combination of Wikipedia concepts. During QE, the concept thesaurus is used to further add terms to the query by finding related concepts to the query representative concepts. In addition to the QE, the chapter also presents a novel approach to re-rank the search results by using a slightly different form of pseudo relevance feedback than the the one used traditionally for QE. The top k fetched documents are used to identify dominant document classes in the pseudo relevant set rather than QE terms. The dominant classes are identified using a document classifiers trained by the external domain independent textual knowledge base, i.e. the Wikipedia. Our decision to use Wikipedia is for maintaining the generality offered by a domain independent resource. The English version offers a huge number of concepts (more than 2 million) and their descriptions as articles from every domain. These articles are also categorized by human users into a large number of categories and the category set (having more than 400,000 categories in number) provides an in depth coverage of document classes from almost all domains. A minimal set of tagged documents (categorized articles) can be extracted from this set using a partially supervised approach that we propose in this chapter. Section 4.1 discusses the relevance of using Wikipedia for enhancing ES and describes the structure and formats of Wikipedia. We describe the concept vocabulary and vocabulary extraction process in section 4.2. Then we discuss the criteria and methodology of constructing the concept thesaurus out of Wikipedia link structure in section 4.3. QE for ES in described in section 4.4. In section 4.5, we discuss mapping enterprise document classes to category-article distributions for training the classifier. Section 4.6 describes re-ranking method using the Wikipedia categories based classification of the pseudo rel- 34

47 evant set. We present the experimental setups for evaluating the proposed approaches in Section 4.7 along with observed results 4.1 ES and Wikipedia ES systems are still away from fully realizing the potential of Wikipedia like knowledge bases[12]. ES relies on domain specific semantic tools and data such as ontologies[26], tagged document sets etc, which in most cases are manually constructed and are expensive to create and maintain. Automated mining techniques involve use of enterprise specific data and respective customizations, thus losing their generality across different verticals. Documents and web pages in enterprise lack a strong link structure, which on the other hand, is abundant in the web[17]. While finding the presence of query terms is trivial in enterprise document retrieval, establishing the importance of a particular document over others in the result set is not. As we mentioned in chapter 2 that users tend to formulate incomplete and ambiguous queries, it may be a wise idea to work with query terms ahead of starting the search process. QE is one such approach, which is often applied in ES. This chapter presents our approach of harnessing semantics from an external domain independent knowledge resources for improving search process within enterprise environment while keeping focus on text retrieval. QE works on two types of methodologies in the absence of relevance feedback. One way is to work with top k results by treating them as relevant and extracting expansion term from them using some kind of metric that associates expansion terms with the query. The other way is to use a thesaurus like structure to borrow synonyms or terms related semantically to query terms and phrases. We described how English WordNet (a general purpose dictionary) can be used in expansion in chapter 2. Several implementation also employ thesauri constructed from the enterprise data itself. We previously described a automatically constructed thesaurus for role based personalization. Manually created thesaurus and ontologies are also popular as tool set for expansion. The main advantage of manually created thesaurus (whether enterprise data specific or general purpose) is the accuracy of relations between the terms. However, any such repository is time consuming and very expensive to create and maintain because such resources are maintained by handful of individual experts. What if we have some kind of knowledge base which is community driven and the pool of content creators is large enough to cover all possible knowledge aspects belonging to any topic? Wikipedia is a very good answer to this question. The topic coverage of Wikipedia is arguably much more comprehensive than any other knowledge resource on the planet. Moreover, the topic coverage is not only limited to functional words like noun, pronouns, adjectives etc, wikipedia includes a very large number named 35

48 Figure 4.1 A typical Wikipedia article page entities, multi-term topics (topics which have complex names). For example, the number of words or entities in Wordnet is around 155 thousand, where as that in Wikipedia is much above one million. This work presents the initial phase of extracting semantics from knowledge bases where we create a domain independent and not enterprise specific thesaurus and also a vocabulary set for expansion experiments. It is our hypothesis that due to the scale of comprehensiveness, a Wikipedia based thesaurus can cover topics specific to most enterprises. The depth and breadth of topic coverage prompted us to experiment with Wikiepdia topics for QE. The key novelties of the techniques suggested in this chapter are the use of Wikipedia as the knowledge resource and treatment of multi-word queries as singular conceptual entities for extracting synonymous and semantically related concepts for expansion. Going a step beyond conventional thesaurus based expansion techniques, which perceive a search query as collection of key words and each word was regarded as a separate concept; our procedure treats a search query as a semantic unit. 36

49 Figure 4.2 A small real category network example from wikipedia. Note that an article may belong to more than one categories and similarly a category can have multiple super categories. 37

50 4.1.1 Wikipedia knowledge-base and structure Wikipedia is a collaborative knowledge-base consisting of million of pages. The main knowledge content is in form of article pages or simply, articles. Articles are informative texts which list and explain millions of topics (one topic per article) written by humans in natural language. Figure 4.1 shows a typical article page about the Management. Article page contains an article name (underlined in blue in figure 4.1). Following the article name is the descriptive text explaining about the main theme (represented by article name) and is divided neatly into various sections. Every page including the articles is uniquely identified any a url (underlined green in figure 4.1). Wikipedia has a convention of giving unique names to pages. For example, no other page except the one shown in figure 4.1 will have the name Management. Page name also forms a part of the url. A typical url in English Wikipedia will look like en.wikipedia.org/wiki/page name where spaces are replaced by underscore. Wikipedia not only contains informative text in form of articles but also a strong structure of links and category pages are present to give the users a more navigable and comprehensive browsing experience. Articles are richly linked to other articles (inter-article links) from within their text. Authors of pages can create these links to point to the description of terms and phrases used in the text of article. We call these links as inter-article links. Links to articles are shown as red underlined terms in figure 4.1. In our example of Management article, the text contains words like planning and organizing which when clicked will open the articles about Planning and Organizing respectively. A small and insignificant number of inter-article links are open ended and point to pages which have been removed or not created yet. In addition to articles, a Wikipedia page can also be a redirect, talk, specials page, portal page, category page etc. We are interested in category pages which represent the categories of articles. Many articles may contain information about a common domain or may contain some common features. For example articles Asiatic Lion and Tiger are animals, felines and endangered. They are also the animals found in India. This commonality is represented by Wikipedia categories. A link from an article to a category page represents that the article belongs to that category. Figure 4.2 can be referred for a better understanding of the category network. It gives a small actual article-category link graph. Category pages also have links to other categories which represent a super-sub category relation ship. Category pages give us two advantages. First, they club together articles into set of topics having some kind of similarity. Secondly, they also give a nice hierarchical structure to categories in form of tree. On the top, there is the Wikipedia root category which has its children as major categories like Culture, Geography, History Mathematics, Science, People, Society and many more. Each of these categories are further fine grained by their children categories. As stated, each category is represented by its category page, which contains links to articles belonging to it and the links to its sub and super categories. 38

51 4.1.2 Wikipedia corpus Wikipedia content is freely available to public domain in form of periodic snapshot files. The page contents containing all the internal links are available for download as one large XML file. Wikipedia link structure information is summarized into database dump files which are also available in public domain. We have worked with July 30, 2009 snapshot of English Wikipedia. Our work mainly required the content XML file (enwiki pages-articles.xml), and the database dumps for page information (enwiki page.sql) and page link structure (enwiki pagelinks.sql). 4.2 Wikipedia Concept Vocabulary We describe the notion of concepts in Wikipedia universe in this section. Concept thesaurus described in a later section works with these concepts and relations between them instead of terms. Concept vocabulary is the set of terms, phrases or names that represent a concept. We will now discuss these notions and the process of building concept vocabulary Concepts A Wikipedia concept is the abstract entity about which an article carries a description[25]. Every article has a name which is not a concept itself, but a set of terms representing the concept in a particular language. For example when we say water we imply a certain liquid substance. The concept represented by term water is known to every one, even the animals. However, meaning of word water is understood only by people who understand this particular English word. Same is the case with Wikipedia concepts. English Wikipedia article named water, Spanish article Agua and French article Eau, all represent the same concept. But since concepts are abstract notions which are understood by our psyche, we need terms to represent them and these terms constitute the concept vocabulary. Wikiepdia structure contains a number of entities that can represent concepts which, we will see in following paragraphs. Along with concept vocabulary, we assign a name to every concept which is the name of the article describing the concept Article names The articles in Wikipedia have a title and text description explaining a concept. The concept name is represented by the Article title for a given language. The article title in general is an unambiguous and precise phrase (can also be a single word) that best represents the concept explained by the article. These titles are good candidates to be used as concept entities in a thesaurus of Wikipedia concepts and also as concept vocabulary. 39

52 Figure 4.3 Few examples of redirects Notice that redirects (red underlined) can be alternative names, shorter forms, abbreviations, popular usage etc Redirects In addition to article names, redirects and anchor texts can be additional representations of concepts. Redirects are alternative concept names that could be used as a title of an article but instead they just link to it. For example, an article on the First World War in Wikipedia has the title World War I and few redirects that connect to it have the titles WW1, First World War, The Great War etc. Figure 4.3 shows few screen-shots of various redirects as an example Anchor-text Anchor-text also represent a concept. Usually, the anchor-texts are part of natural sentences and cover various forms and inflections of concept representational words. To clarify, inflections of the word molecule, like molecules and molecular are anchor texts in Wikipedia articles connecting to Molecule article. Further, the semantic similarity of structurally different phrases is also captured with anchor text representation (for example, anchor texts possessed by the devil, Demonic possession and Control over human form link to same article and thus have similar meanings), not to mention the inclusion of acronyms (or example, anchor text WWW links to the article World Wide Web ) Vocabulary set Concept representing strings are extracted from the article texts and stored in a dictionary based data-structure along with the target articles they represent. Next step is the mapping resolution of representative strings to their target articles. Article Titles and Redirect Titles implicitly have one to one 40

53 relationship with concepts because of the Wikipedia page creation policy, but same Anchor Text can have more than one target concepts. We give a small example to present this case, consider the texts... Blood smear from a P. Falciparum culture... (in article Malaria, anchor word culture ) and... the filtrate of a broth culture of the Penicillium mold... (in article Penicillin, anchor word culture ). The anchor culture in first text links to article Malaria culture where as culture in second text targets Microbiological culture. This example clearly shows the uncertainty in targets in spite of them belonging to a common domain (microbiology). A string to target mapping can be represented as STR =< (T 1,f 1 ),(T 2,f 2 ),(T 3,f 3 ), ) > (4.1) where T i is one the target articles of string STR and f i is the respective number of times STR points to f i in the entire Wikipedia collection. For resolving ambiguous cases, we employ a simple metric where we assign the most frequently pointed article as the target article of a representative string. target(str) = max f i T i (4.2) where (STR) =< (T 1,f 1 ),(T 2,f 2 ),(T 3,f 3 ), ) >. For all multi-concept anchors, an article that is most times linked by an anchor text in all Wikipedia inter-article links is ascertained as its final target concept. It may seem a trivial way to resolve the target conflicts for anchor text, but this approach saves us significant computational expenses which otherwise would have occurred with a more adaptive but complex approach liketf.idf. 4.3 Wikipedia Concept Thesaurus We try to capture the relation ships between Wikipedia concepts into a concept thesaurus which is not just a set of synonyms. Concept thesaurus encapsulates all logical relations between concepts, marked by links. A typical article text contains many concepts to describe the concept represented by article. Many of these concepts are linked to their own articles. These links are created by authors and not automatically generated. The links also signify those concepts that users feel as important in the context of current article. We are talking about relations like fire or combustion to gasoline; water to ocean or American independence to July 4th and so forth. We give emphasis to these links and pin point our attention to mutually linked articles, which we conceive as related concepts, as the relation is validated through two way link created by human intelligence. We also look into linked concepts (may not be mutually linked) which share a common domain, in other words the belong to common Wikipedia categories. 41

54 Figure 4.4 Wikipedia Link Structure and Creation of Vocabulary set and Thesaurus 42

55 4.3.1 Two way link analysis First step involves examining a concept and all the inter-article links from the article explaining a concept (let s call this the article under examination A ; we would use A for representing the concept as well). Among the articles linked from A, the ones which have a link back to A represent mutually related concepts Link co-category analysis In the next phase, we investigate the category pages linked to the key concept (article being analyzed), in other words the categories to which the key concept belongs. We call these as key categories. For all the articles that are listed in the key category pages we extract those which are linked from the key concept (link from key concept article to other articles). These singly linked articles which share a common category with key concept give us the second batch of related articles. An important point to note here is that a small but significant number of Wikipedia categories are related to article status and do not indicate a concept domain (categories like stub article, articles to be deleted etc.). We have taken care to implement a filter that weeds out analysis of such categories Relation confidence score The two batches of related articles are merged on the basis of key concepts. This yields us a set of Wikipedia concepts and their respective consolidated lists of related concepts based on the heuristics described in sections and An element in this batch can represented as RelationList[i] = [K i :< RC i1,rc i2, >] (4.3) WhereK i is a key concept andrc ij are its related concepts. We now resolve how closely related two articles are. For every k i in equation 4.3, we analyze the text of the article representing K i and also the link structure involving RC ij s. A relation confidence score is computed for all candidate concepts related tok i. A high count of all concept vocabulary strings which represent an RC ij in k i is considered a factor for higher confidence for that particular relation. On the other hand, articles representing RC ij s which have a relatively higher number of other articles linking to them, thus representing more generic concepts are considered to be less specific to K i. We combine these two in equation 4.4 to compute relation confidence score. (freq v (text(k i ))) ConfidenceScore(K i,rc ij ) = v VS[RC ij ] infreq(rc ij ) where,vs is the vocabulary set andvs[rc ij ] gives all the representative vocabulary strings of concept RC ij. freq v () returns the frequency of vocabulary string v. infreq() gives the number of backward (4.4) 43

56 links to an article. Kotaro et al.[25] explain a backward link as a simple relation between articles A1 and A2, if A2 has a hyperlink targeting A1. The process when applied to all key concepts in the Relation List (equation 4.3), produces the Concept Thesaurus. i th element in concept thesaurus can be represented as ConceptThesaurus[i] = [K i :< (RC i1,cs i1 ),(RC i2,cs i2 ), >] (4.5) where, CS ij is the relation confidence score of the j th related concept to K i. Algorithm 1 gives a step by step procedure of Wikipedia concept thesaurus construction. 4.4 Concept Representation of Search Query A user query is a loose representation of information needs expressed through query terms. The first step in making use of the concept thesaurus for enhancing queries is to map them to Wikipedia concepts that best represent the query. This can also be seen as mapping a query from term space to concept space. User queries frequently lack clarity and the word structure (order of words, inflections) where as much of the concept vocabulary is a set of standard article names (precise word order), hence steps beyond simple word matching are required for mapping a query to concepts. It is also observed that user queries often are a combination of multiple intents and thus may require more than one concept to represent its scope. This is also the case with ambiguous queries like bank ( river bank, financial bank ) or application ( request, software ) etc. Then there are multi-word queries which may be consisting of separate concepts ( bikes cars for example) or pharses (e.g. cooking rice ) or complex named entities (e.g. The Last of the Mohicans ). The mapping of a multi-word query should capture its sense in its entirety and not just as collection of individual key terms, thus capturing the phrase behavior of a search query. We present our approach to represent queries with best N (number of representations required) Wikipedia concepts by three levels of concept discovery. We exploit concept vocabulary and article text for determining the concepts that describe a search query. Along with maintaining a vocabulary-target list we build a term frequency and inverse document frequency (tf idf) based index on vocabulary strings (we call this as vocabulary index) and on article content text (we name it content index). First level of this process involves a simple case insensitive string comparison of query and concept vocabulary strings. At an abstract level, this process in represented by figure 4.4 (first half of the figure, above relation discovery part). If an exact match is found, we consider it as the best possible representation of the query and the concept name is retained as level one matching concept (L1MC). Second 44

57 Input: Wikipedia article list with id, title, content information Output: Wikipedia Concept thesaurus 1: CT Φ 2: for all article in AL do 3: categoryset getcategories (article) 4: linkset getlinkedarticles (article) 5: vocablist getconceptvacab(article) 6: for all linkedarticle in linkset do 7: blnrelated false 8: linksettemp getlinkedarticles (linkedarticle) 9: for all link in linksettemp do 10: if link == article then 11: blnrelated true 12: break for 13: end if 14: end for 15: if not blnrelated then 16: categorysettemp getcategories (linkedarticle) 17: if commoncategoryexists (categoryset,categorysettemp) then 18: blnrelated true 19: end if 20: end if 21: if blnrelated then 22: conceptcount 0 23: for all vocab in vocablist do 24: conceptcount conceptcount + count (vocab, text(linkedarticle)) 25: end for 26: infrequency countincominglinks (linkedarticle) 27: score conceptcount / infrequency 28: addtothesaurus (CT, article,linkedarticle, score) 29: end if 30: end for 31: end for Algorithm 1: Wikipedia Concept Thesaurus extraction. The Algorithm implements the CT extraction process. Lines 7 to 14 extract cross linked articles for every concept. The link co-category analysis is carried out by lines 15 to 20. Lines 21 to 29 determine the RS scores for all related concepts. The output of this algorithm is the final CT. 45

58 Figure 4.5 Concept Representation and QE Module 46

59 Input: CV, CVI, CCI, Q, N # Concept Vocabulary, concept Vocabulary Index, # Concept Vocabulary, concept Vocabulary Index # Concept Content Index and Query respectively Output: List of representative concepts 1: RL φ # List of representation concepts 2: L1MC # String for exact query vocabulary match 3: L2ML φ#list of concepts for query vocabulary match on cosine similarity 4: L3ML φ # List of concepts or query content match on cosine similarity 5: T # cosine similarity Threshold, (empirically determined) 6: count 0 7: L1MC getexactmatchconcept (CV, Q) 8: L2ML gettopmatch (CVI, Q, N) 9: L3ML gettopmatch (CCI, Q, N) 10: if L1MC!= then 11: addtolist (L1MC, i) 12: count count : end if 14: for all title in (L2ML) do 15: if cossimilarity (Q, title) > T then 16: addtolist (title, RL) 17: count count : if count > N then 19: break for 20: end if 21: end if 22: end for 23: QC removestopwords (Q) 24: for all title in (L3ML) do 25: if partialtermmatch (QC, title) then 26: addtolist (title, RL) 27: count count : if count > N then 29: break for 30: end if 31: end if 32: end for 33: return RL Algorithm 2: Determination of query representational Wikipedia Concepts. The algorithm builds representation list (RL) in lines 10 to 32. L1MC in discovered in line 7. Lines 8 and 9 populate L2ML and L3ML respectively. RL uses L1MC in lines 10 to 13. Lines 14 to 22 describe the addition of elements from L2ML to RL. Algorithm then appends element from M3ML in RL in lines 24 to 32. The output of the algorithm is a list of Wikipedia concept names representing the query. 47

60 level concept discovery is done through partial matches of vocabulary phrases based on cosine similarity scores between vocabulary index and query. Concept names of top N matches are stored in level two match list (L2ML). Third level of discovery requires fetching top N matching articles on the basis of their performance on cosine similarity test between query and content index. Names of concepts represented by the fetched article pages are stored as level three Match list (L3ML). Algorithm 2 systematically presents this approach. After applying the three levels of query concept matching, we start building the concept representation list. Fetched concepts are prioritized on the basis of their levels. First element in representation list or RL is L1MC if it is not null. Next we make use of L2ML entries to append RL. This is done using a cosine similarity threshold where we only add those concept names which have their cosine similarity score more than a certain threshold. Subsequent to going through L2ML, the elements from L3ML are used to append RL and a threshold check similar to L2ML check is employed. The process of appending RL is stopped as soon as its size reaches N. 4.5 Query Expansion After formulating concept representation list for user query from the previous step, we retrieve a list of M most related concepts (we call this CT-list) for every concept representing the query from concept thesaurus. These related concepts include semantically related senses to the query. Next step in QE is to merge the N CT-lists. Concepts which repeat across these N CT-lists should be given more importance, hence their relation confidence scores from different CT-lists are added. For example if for query water we have two representational concepts namely Water and Body of water For these concepts if we have the respective CT-lists as < (Ocean, 0.25),(Liquid, 0.15),(Sea, 0.1) > and, < (Water,0.25),(Sea,0.15),(river,0.1) > the merged CT-list will be < (Ocean, 0.25),(Liquid, 0.15),(Sea, 0.25),(W ater, 0.25),(river, 0.1) > Final list is sorted in a decreasing order of RS scores and top N concepts are added to query. Repetitions and stop words are removed from the reconstructed query. Subsequently, we have the final expanded query containing terms from the original query and from the related concepts. This process is represented by lower half of Figure 4.4. This process is described in algorithm 3. 48

61 Input: CT, Q # Concept Thesaurus, search query Output: query expanded with Wikipedia concepts 1: SL φ # List of related concepts 2: N # number of top related concept, determined empirically 3: RL getconceptrepresentation (Q) 4: for all concept in RL do 5: TL CT.getTopRelatedConcepts (concept, N) 6: for all relconcept in TL do 7: if SL.hasConcept (relconcept) then 8: SL.addScores (relconcept, relconcepts.score) 9: else 10: SL.append (relconcept, relconcepts.score) 11: end if 12: end for 13: end for 14: sortlistonhighscores (SL) 15: str SL.concatTopConceptNames (N) 16: str Q + removestopwords (str) 17: str removerepititivewords (str) 18: return str Algorithm 3: QE with Concept Thesaurus 49

62 Figure 4.6 Wikipedia Category section on the article page 4.6 Enterprise Document Classes In this section, we explain the our approach for finding dominant enterprise document classes in search results using a Wikipedia category powered classification and re-ranking results on the basis of dominant classes Mapping document classes to Wikipedia categories We discussed the category structure in section A small example is given in figure 4.2 to help better understand the kind of links that exist in the article-category and category network. By articlecategory network we mean the graph where categories and articles are nodes and links are represented by edges between them. This can be extracted from hyper-links to categories from articles (e.g. links in the page given in figure 4.6). Similarly category network can be represented by category pages as nodes and links between them as edges. Category graph is extracted from the links present in category pages to other categories (e.g. links in the category page given in figure 4.7). The difference between the two graphs is that in the former, edges signify an article belonging to a particular category and in later, the graph edges represent a super-sub category relation. In an Enterprise Environment, content creation is task specific as well as controlled and document classes are known. These classes can be mapped to Wikipedia categories using the class names. This is achieved with help of Wikipedia concept vocabulary set built from article names (concept names), articles redirects (concept synonyms) and anchor text in hyper-links pointing to other articles (concept representations used in a natural sentence). The representing concepts of the classes return a seed list of categories representing a class. 50

63 Figure 4.7 Wikipedia Category page listing articles that fall in this category We work with a preconceived list of classes which have well defined class names and a small distinct vocabulary belonging to each category. Let the set of document classes be C = {C 1,C 2, } where C i is i th enterprise document class. Each class can be represented as list of associated vocabulary set where the first element is the class name. C i =< CN i,cv i1,cv i2,cv i3, > (4.6) Here,CN i is the class name and CV ij is j th vocabulary string in i th class. For every vocabulary set of enterprise classes, we match the elements of Wikipedia concept vocabulary we constructed in section 4.2. Matching vocabulary strings give us matching articles (target articles of matching vocabulary). The prominent categories to which these articles link to in article-category network form a candidate category list to represent a particular document class. We use Wikipedia category graph and further populate this list with nearest neighbors to the seed categories to a humanly manageable size. The list is finalized with human intervention for filtering out unneeded categories from the list. Articles related to these categories formulate the training corpus. The candidate list is then presented to human users for filtering out irrelevant categories. Te process yields a finalized category list for every enterprise document class. This step returns us a Wikipedia category list for every document classc i which we call WCL i. WCL i = {WC i1,wc i2,wc i3,,wc im } (4.7) here,wc i1 is thej th Wikipedia category in category listwcl i of i th document class. 51

64 4.6.2 Training corpus from Wikipedia articles The off-line phase of re-ranking involves classifying every document present in the enterprise collection into topmclasses with the help of Wikipedia article training corpus. For everywcl i from the previous step, we compile a list of articles which link to WC ij WCL i. As it is clear from figure 4.6, an article may belong to many categories. So we need a measure to find out how much closer an article is to the categories in WCL i. We give more emphasis to the articles which have a higher presence of category name in their text i.e. the cosine similarity score of category names inwcli and article text is high. We also rate those article higher for givenwcl i which belong to less categories over all, in other words their inverse category count is high. A category specific score (CSS) is given to every article belonging to categories in WCL i which is given in equation 4.8. CSS i (A) = cat(a) WCL i WCL i cat(a) 1 WC ij WCL i CosSim(name(WC ij ),text(a)) (4.8) Here, article A belongs to at least one category in the set WCL i ; cat(a) is the set of categories to which article A belongs; CosSim() is the cosine similarity function. The text of top n articles based on CSS i score formulates the training corpus for a Naive Bayes classifier for the class C i. Post training, we classify enterprise documents in such a way that each document is given a class relative score. For example, if their are four document classes overall, the classification process will give each document four scores respective to each class. This information is indexed along with normal search index. So a document D as a classification score vector can be stated as D =< CSS 1,CSS 2, > (4.9) wherecss i is the classification score of D for i t h class and CSS i = Re-ranking search results The goal of re-ranking process is to relocate the documents which belong to prominent classes in topk retrieved document. This method would best work if we have relevance judgments for queries. In absence of that we use pseudo relevant set of topk retrieved documents. The aim is to improve precision in search results based on observing prominence of document categories. During a typical query based search (document retrieval), the class specific scores of top k results are used to evaluate pseudo relevant score (PRS) for every class. CSS i (D) PRS i = D P Rset here,pr set is the pseudo relevant set of top k search results. k (4.10) 52

65 For all the documents in the result set, we now calculate class prominence score, which is done through multiplying pseudo relevant scores to respective class specific scores. The expression to find class prominence score or CPS is given by equation 4.11 C CPS(D) = [CSS i (D) PRS i (D)] (4.11) i=1 Here,C is the set of enterprise document classes,css is from equation 4.8 andprs is evaluated using equation A final Score (FS) is calculated for every document in the result set through a linear combination search engine scores andcps. Equation 4.12 gives the expression for calculating FS. FS(D) = α CosSim(Q,D)+(1 α) CPS(D) (4.12) where, Q is the query, CosSim(Q,D) is the vector space retrieval score of document D. The documents are Re-ranked based on the descending order of FS. 4.7 Experiments and Evaluation We used TREC-ENT 2007 topics and CSIRO dataset for evaluating the query expansion module. Both our baseline and concept thesaurus (CT) based QE setups are evaluated on text search and we did not look into internal link structure of the evaluation data (because Enterprise Data in general, lacks the strong link structure which is present in web domain[34]). A large portion of CSIRO dataset is in the form of HTML documents, in addition, a smaller but significant number of other formats like pdf, doc, xls etc are present. We parsed the test corpus into pure text documents for indexing and retrieving CSIRO dataset In 2007, the CSIRO (Commonwealth Scientific and Industrial Research Organisation, Australia) Enterprise Research Collection[4] or CERC dataset[4] was introduced as the evaluation data for TREC (TREC-ENT, 2007) enterprise track search test collection. The dataset is a set of around 370,000 documents crawled on public websites of CSIRO. The data set contains a rich set of diverse domains having a common theme of scientific research and development. Along with crawled document set, a set of 50 topics with respective queries and relevance judgments has been provided as part of TREC-ENT TREC-ENT 2008 further introduced a set of 77 new query topics with relevance judgments on the same CERC dataset Query Expansion with Concept Thesaurus We first present the evaluation of Wikipeida concept thesaurus based query expansion for enterprise search. 53

66 Table 4.1 Sample query expansion 1: Query: timber treatment Reconstructed Query: timber treatment + Wood preservation + Chromated copper arsenatem + Creosote 2: Query: cancer risk Reconstructed Query: cancer risk + Cervical cancer + Human papillomavirus + Pap test Baseline Setup Wwe used vector space retrieval using Java based Apache Lucene text search libraries. Text parses of the document set were indexed by Lucene Indexer. For creating the test set, 50 queries were formulated from the 50 topics provided in TREC-ENT 2007, where every query was a Boolean OR combination of its individual terms present in the query Concept Thesaurus setup Wikipedia link structure is available as database dump file (.sql format). We loaded the link structure table in MySQL database. The article text was indexed using Lucene indexer. All the modules for extracting concept vocabulary, concept relations and query representations were implemented in java. The concept vocabulary set was stored as a MySQL table where vocabulary and target concepts were the columns. Concept thesaurus was stored in MySQL with concept, related concept, role specific (RS) score as the fields. We also created a Lucene index of the concept vocabulary for matching them using cosine similarity measure. We configured query representation module to extract top two representing concepts using vocabulary database and Lucene index. For query expansion, two most related concepts were added to the query for the first representing concept and one related concept for second Results As an example of query expansion we give couple of sample expansions created by Concept thesaurus based query expansion module table We recorded precision for top 10, and 100 retrieved results and overall recall measures of both setups to compare their performance. Table gives the respective observations of these measures and shows an improvement of 0.7%, 1%, 6%, 9% for precision at top10, top 20, top 50 and top 100 search results respectively, averaged over 50 queries. The overall Recall has improved by 9%. 54

67 Table 4.2 The evaluation results for Wikipedia CT based Query Expansion module Baseline CT based Query Percent Increase Expansion Number of Queries Average number of % Results Retrieved Recall % Precision (over all) % Precision at top % Precision at top % Precision at top % Precision at top % Comments The results shown in table indicate that using Wikipedia concept thesaurus for query expansion and treating query temrs group as a unit improves the recall and precision. We observed 9% increase in the recall figures with overall recall figures having the value 0.9 (indicating that on an average more than 90% of the relevant documents appear in the result set). It is known the information retrieval community, that increase in recall with loss in precision is consistent with most QE techniques. However, we observed precision did not degrade at top 10 results and it impr for top 20, top 50 and top 100 results in our approach. If we compare the average number of documents retrieved per query, we see that in spite of a relatively much larger number for QE module, relevant results are not pushed back. As the experiments only involved query level adjustments, we believe that a combination of semantics based indexing and re-ranking techniques can further improve the precision of search results Re-ranking with Wikipedia categories Here, we describe the experimentation and evaluation results for the method of re-ranking search results using a Wikipedia category network based classification that we described in chapter 4. We first give a training and classification validation example on 20 news groups data using our approach. We then look into the experimental dataset and results. 55

68 Table 4.3 Classification of 20 News Group data 20 News Group Classes Number of documents correctly identified rec.motorcycles 901 sci.med 651 comp.sys.mac.hardware 748 talk.politics.misc 753 soc.religion.christian 912 comp.graphics 697 talk.religion.misc 651 comp.windows.x 634 comp.sys.ibm.pc.hardware 660 talk.politics.guns 785 alt.atheism 564 comp.os.ms-windows.misc 287 sci.crypt 821 sci.space 745 rec.sport.hockey 785 rec.sport.baseball 865 sci.electronics 823 rec.autos 822 talk.politics.mideast 881 AVERAGE 73% Classifier training with Wikipedia Categories The first phase of our experiments was designed to verify the use Wikipedia as a general purpose classification training solution. We trained a Naive Bayes classifier with Wikipedia article text and to test its effectiveness. We used 20 News Groups 1 data for testing purpose. We took 19 classes from 20 News Groups, and mapped them to about 100 Wikipedia categories (average of 5 categories per class) using the approach mentioned in chapter 4. This mapping gave us around 5200 articles as tagged training corpus (270 articles per class on an average). During automated classification, every document in test corpus was tagged with the top identified classes. Table 4.3 presents the observations of this experiment. The observations show the accuracy of classification around 73 percent. Training and Test sets were completely independent and training set had significantly lower number of documents than the test set, still we were able to achieve a decent classification performance. This experiment provided us the motivation to use Wikipedia as general purpose training repository for document classification. 1 The 20 Newsgroups data set. 56

69 Table 4.4 Precision figures for Re-ranking Experiment Initial Result Set Re-ranked Result Set Increase Precision at top 10 results Precision at top 20 results Precision at top 50 results Precision at top 100 results Enterprise dataset and Classification We used TREC Enterprise track dataset i.e. CSIRO corpus and 50 query set of 2007 track (TRE- CENT 2007) for Re-ranking experiments. Since the data consists almost entirely of public facing web pages of CSIRO web site, it presents a large document collection with respective URLs. The domain name of CSIRO web is Within this domain there are different name-spaces which indicate different hosts. For example the name-spaces and are hosted by the marine biology, entomology and food science research departments respectively. These URLs were searched for different name-spaces (the verticals in CSIRO enterprise) for determining document classes. We came up with 14 document classes for CSIRO dataset following this process. For these classes, their names and a small vocabulary set of 10 terms/phrases was determined by analyzing the home pages of respective name-spaces. Once the 14 document classes were mapped to Wikipedia categories and articles, we had the wikipedia training corpus ready for classification. Rainbow toolkit was used as Naive Bayes classifier implementation for this purpose. Every document in CSIRO dataset was then classified and the information was stored within a Lucene index Results For a given query, we used top 20 search results to formulate pseudo relevant set. This set was used to find prominent document classes using which, re-ranking carried out. Table 4.4 presents the results of re-ranking process. 57

70 Comments Table 4.4 shows the improvements in precision in Top N results, especially for top 50 and top 100 results. The precision figures indicate to us that relevant documents are being pushed up with our classification based re-ranking approach. 4.8 Summary In this chapter, we presented query expansion techniques for enterprise search using the Wikipedia content and links. Wikipedia is meant for human use but it also carries subtle structure within it. We tried to extract a concept vocabulary and a concept thesaurus using the link structure present in articles and categories. The query expansion module maps query from term space to concept space using the concept vocabulary and then provides expansion terms using concept thesaurus. We also introduced a re-ranking approach using the Wikiepdia category network. The key idea is to map categories to enterprise classes and then use the articles linked to mapped categories, to form a a training corpus for Bayesian document classification. The documents fetched in the result-set are re-ranked on the basis of prominence of enterprise classes in the top k results. We also presented the evaluation and empirical results of these techniques. 58

71 Chapter 5 Wikipedia as a Vocabulary Resource for Pseudo Relevance Feedback An efficient Enterprise Search (ES) system is a vital aspect of a successful business which hosts sizable structured and unstructured electronic data. Enterprise data poses many challenges for text retrieval, one of which is the limited and author biased vocabulary in Enterprise Documents (ED) because of a limited ownership of ED. Thus, on a document level, we see a narrow band of vocabulary representing information, which may not be the case at corpus level[17]. Pseudo Relevance Feedback (PRF) is a generally useful technique used in information retrieval to improve search results, but it may suffer in ES because of the biased vocabulary issue. In a quest to address this, our method incorporates democratically created text resources such as Wikipedia articles, rich in natural language lexicon, for enriching enterprise vocabulary. Our motivation comes from collection enrichment techniques[13], but we keep our focus on domain independent and open text resources for vocabulary addition. Our system uses a combined search index of Wikipedia Articles (WA) and Enterprise Documents (ED) for retrieving a Pseudo Relevant (PR) set containing both ED and WA. The main idea presented in this chapter is to involve both unsupervised and supervised classification techniques for identifying those ED and WA from the PR set, which are closer to the information need of a query and can be considered as better feedback documents. The chapter presents our methodology for working on PRF using both search results and Wikiepdia vocabulary. As we have worked with query likelihood model for retrieval (Language based retrieval)[39] for this work, we will briefly discuss language model for information retrieval in section 5.1. In section 5.2 we describe our criteria to form a candidate pseudo relevant document set in order to implement feedback techniques for query expansion. Section discusses unsupervised techniques like clustering and probabilistic topic models to distinguish important feedback document from the rest in pseudo relevant set. Section presents the same procedure using supervised document classification techniques. We also present the features we use for training and classification in the same section. Section 5.3 presents the experimental setup to evaluate the proposed approaches and the observed results. 59

72 5.1 Language model for information retrieval Language model work around the technique that try to generate query out of given document using a document model. Conceptually language models determine probabilities of possible sequences of terms or strings out of some vocabulary. IfV is the vocabulary of a language then for all termstbelonging to V, the probabilities sums upto 1 i.e. P(t) = 1 (5.1) t v We will describe the original method known as the query likelihood model[39]. There are various approaches used for language model based information retrieval, and related text can be taken up a deeper understanding of this field. In query likelihood model, a document model M d is constructed for every document d in the collection. The idea is to rank the documents for a given query q by the conditional probability P(d q), which is the likelihood of a d given q. Using Bayes theorem, we get equation 5.2 P(d q) = P(q d)p(d) P(q) P(q) is constant for a given query and is ignored. Prior probability of a documentp(d) again is constant for every document and can be safely ignored while computing the score. Thus the model only needs to estimate P(q d) which is the probability of query q given a document model M d. Language model tries to emulate the query generation usingm d. One of the ways to determinep(q M d ) (M d represents document d) is to compute maximum likelihood probability (henceforth, MLE). In a unigram based approach where terms or tokens are considered one at a time and their co-occurrence and sequential information is ignored (also known as the bag-of-words approach). MLE is estimates using the formula given in equation 5.3. (5.2) ˆP(q M d ) = t q ˆP(t M d ) = t q tf t,d d (5.3) Theˆsymbol is used here as we are trying to find an estimate of the probabilities. t is a query term and d is the number of tokens in document d. This way all we need to analyze in the document are the term frequencies and document size. Equation 5.3 has an inherent issue. If a query term is absent is a document then ˆP(q M d ) will become zero. Users often issue multi-word queries and some query terms may be absent from a document. This absence does not necessarily mean that document is not relevant to the query. To avoid zero probabilities, a smoothing parameter can be introduced in equation 5.3 to compute non-zero probabilities even if some query terms are not observed in the documents. The smoothing step has another use, which is to give some weight to unobserved words. It is also a good idea to introduce probabilities computed from a more general collection level statistics for the query terms. This could be summed up by the expression given by equation 5.4. ˆP(t d) = λˆp(t M d ) = (1 λ)ˆp(t M c ) (5.4) 60

73 In equation 5.4, M c is the language model built from collection. Combining equations 5.3 and 5.4 we get ˆP(q M d ) = t q [ ( ) ( )] tft,d tft,d λ +α +(1 λ) +β d d here α and β are small values used as smoothing values for determining non-zero probabilities, also known as Laplace s smoothing. λ is the smoothing parameter for including collection level parameters, a high value of λ means a high smoothing across the collection. Optimal performance of this model depends on accurately determining value of λ. (5.5) We give a small example to show how a simple unigram language based model works in document retrieval and scoring. We use the value of λ as 0.5 andαand β as Let there be three documents d 1 : Mango is a juicy fruit d 2 : Mango from India is famous d 3 : Apples are red and sweet let a query q be india mango here d 1 = 5, d 2 = 5, d 3 = 5 and collection size, C = = 15. Using equation 1.25, we have P(q d 1 ) = [ 0.5 ( ) +(1 0.5) ( )] + [ 0.5 ( ) +(1 0.5) ( )] = P(q d 2 ) = [ 0.5 ( ) +(1 0.5) ( )] + [ 0.5 ( ) +(1 0.5) ( )] = P(q d 3 ) = [ 0.5 ( ) +(1 0.5) ( )] + [ 0.5 ( ) +(1 0.5) ( )] = This would give the order or ranked results asd 2 > d 1 > d 3. The language model that we described here was proposed by Ponte et al.[39] and this work was further extended by Miller et al.[31] and Hiemstra[19]. 5.2 Selecting good feedback documents from PR Set We present our techniques to select good feedback documents from the PR set in this section. We start with describing our criteria to formulate the PR set. Then we present both unsupervised and supervised classification techniques for the identifying more worthy documents to be used for QA. 61

74 5.2.1 Pseudo relevant set PRF assumes that top retrieved document are more relevant to a given search query. It works with top k document retrieved in first pass of the query through search engine and uses language based relevance model to compute expanded query. We created separate document indexes for ED and WA, and for a given query, both indexes are searched for creating the PR set. We work with a PR set which consists of top k 2 ranked ED and top k 2 ranked WA for a given query Unsupervised approaches In the absence of training data, we work with unsupervised classification techniques for identifying good feedback documents. We apply two techniques namely, document clustering and probabilistic topic modeling on the PR set as part of unsupervised classification. These techniques are explained in the following sub-sections Clustering documents Our first unsupervised method involves clustering ED present in PR set into m non overlapping clusters. The clustering process is performed in the term vector space using k nearest neighbor method (KNN clustering). Consider pseudo relevant collection Cl as a set of n clusters into which ED s are classified by KNN. Cl = {C 1,C 2,...,C n } (5.6) where C i s are clusters of ED. We then allot WA to one of these clusters using article-likelihood probability model. Article-likelihood estimates the probability of an article belonging to a cluster C i based on maximum likelihood probability of observing terms from article in the documents that belong toc i. LetA j be awa inpr set. The probability ofa j belonging to a particularc i is given by equation 5.7: P(A j C i ) = w A j P(w C i ) (5.7) where, w is a word token in article A j. The probability P(w Ci) is estimated using maximum likelihood method given by equation 5.8 P(w C i ) = tf(c i) tf(c i )+k P k M(w C i )+ tf(c i )+k P M(w Cl) (5.8) where, tf is the term frequency, k is the smoothing parameter for estimating non-zero values for the word tokens not present in documents of a cluster and P M is maximum likelihood probability which can be calculated by the expression give in equations 5.9 and P M (w C i ) = tf w(c i ) tf(c i ) (5.9) 62

75 P M (w Cl) = tf w(cl) tf(cl) (5.10) Once we populate ED clusters with WA, we find top ranking clusters for given query using querylikelihood retrieval described in section 5.1. We accomplish this by assuming a cluster as one large document and then applying query-likelihood retrieval for ranking the clusters. ED and WA belonging to topmclusters are used for relevance feedback Probabilistic topic models We look into probabilistic topic models in order to associate queries to latent topics extracted from pseudo relevant (containing both ED and WA) document set. We use Mallet 1 implementation of Latent Dirichlet Allocation (LDA)[11] for determining the Dirichlet prior and hence discover word-topic and topic-word associations. The topics discovered through LDA are represented by n-grams present in P R set. The probability p(d z i ) which the the probability of a documentd from the PR collection given a topicz i is calculated for every document in LDA method. This constitutes word-topic associations. Word-topic associations are discovered through looking into the n-grams that represent a topic. By looking into word-topic associations and comparing with query terms we get a set of topics activated for the given query. This in turn yields us a set of query activated documents in the PR set by mapping active topics to documents with the analysis of topic-document associations. Let the set of activated topics be Z active. For every topic z in Z active, we compute a topic relevance score (TRS Q ) for given queryqof a document D inpr set. TRS Q (D) = P(D z) such that D PR set (5.11) z Z active We rank PR documents on the basis of TRS Q (D) computed in equation 5.11 and we consider all documents with zerotrs Q (D) score as irrelevant Supervised approaches This part presents our approach of using supervised learning methods for estimating good feedback documents frompr set. We use a number of features extracted fromed andwa in order to bring out useful aspects of good feedback documents, for training a Naive Bayes classifier classify the documents in PR set. Training and classification is done separately for ED and WA present in PR set. The features are discussed in the following sections. 1 MALLET, MAchine Learning for LanguagE Toolkit. 63

76 Features for enterprise corpus documents For ED s we look into aspects like the document score from query likelihood model (described in section 5.1), query term distribution in the document text, its title, URL, proximity of query terms etc. We now describe the features extracted fromed s inpr set one by one. Features for ED inpr set. Query likelihood score of the document (F E1 ): F E1 = ˆP(q M ED ) = [ ( tft,ed λ ED t q ) ( )] tft,ed +α +(1 λ) +β ED (5.12) we have described expression in equations 5.4 and 5.5 in section 5.1. Query term distribution uniformity (F E2 ): An even spread of query terms across a document suggests a wholesome related document to the query information need rather than few sub-sections representing the information need. We focus on the variance of query term frequencies in different sub-divisions of text (a document is arbitrarily assumed to consist of l consecutive tokens long sub-divisions which represent local information). F E2 is given defined in equationeq:fe2. ( n ( F E2 = log 1+ tf Q (sub i (ED)) tf ) ) 2 Q(ED) p i (5.13) n i=1 where, tf Q (sub i (D)) is query term frequency in i th sub-division of feedback document, tf Q (D) is query term frequency in a given feedback document, n is the number of sub-divisions in given feedback document,p i is probability of finding a query term in a i th sub-division such that p i = tf Q(sub i (ED))+1 tf Q (ED)+n (5.14) Query term frequency in document title (F E3 ): F E3 = tf Q (title(ed)) (5.15) Query term frequency in document URL (F E4 ): F E4 = tf Q (URL(ED)) (5.16) Query term co-occurrence across the document (F E5 ): In order to capture the phrase behavior of a query, we analyze windows of m consecutive words in order to determine how many such windows in a given document have co-occurring query terms. F E5 is described in F E5 = q ( ) W log Co(Q,w) (5.17) 64

77 where, W is the set of all windows of size m in a feedback document; w is set of words in a given window, Co(Q,w) is the query term co-occurrence function for a query Q and window w, defined as { 0 if Q w = 0 Co(Q,w) = Q w 1 otherwise Features for Wikipedia articles While extracting features of WA s we employ similar heuristics used for ED s. However, we also make use of standardized structure of WA. WA text is segregated into standard sections (like Title, Overview, Content, Category, Appendix etc). One of the WA features implements this structure in it. Rest of the features are pretty much same as those extracted for ED s. Features forwa inpr set Query likelihood score of the document (F W1 ): F W1 = ˆP(q M WA ) = t q [ ( ) ( )] tft,wa λ WA +α tft,wa +(1 λ) WA +β (5.18) Query term distribution uniformity in Wikipedia articles (F W2 ): We calculate this in a similar manner to FE2. F W2 = log ( 1+ n i=1 ( tf Q (sub i (WA)) tf ) ) 2 Q(WA) p i (5.19) n where, p i = tf Q(sub i (WA))+1 tf Q (WA)+n (5.20) Query term frequency in Wikipedia article title and redirect name (F W3 ): Every Wikipedia article has a title and may have one or more alternate names that redirect to this article (for example article World War 1 has multiple redirect names as WW1, The Great War etc). These titles concisely and precisely represent the information given in the article text. tf Q (title(wa))+ tf Q (redirect(wa)) F W3 = redirects(w A) 1 + redirects(w A) (5.21) Query term frequency in names of WA categories (F W4 ): tf Q (category(wa)) F W4 = categories(w A) categories(w A) (5.22) 65

78 Query term frequency in headings of article section (F W5 ): tf Q (heading(wa)) F W5 = headings(w A) headings(w A) (5.23) Section wise query term co-occurrence in WA (F W6 ): As a standard, WA text is segregated into sections like Title, Overview, Content, Category, Appendix etc. We try to find query term co-occurrence in a similar manner of computing F E5, except we do it for five above mentioned sections for every WA in PR set Training Once we extract features described in sections and we get a feature vector for every ED andwa inpr set. The training set fored is taken from the standard evaluation data set (CSIRO dataset[4] and for WA training set we manually classified relevant WA present in the top k results for every query. We train two different classifiers for ED and WA. Classifier s role is to distinguish any document as relevant or not, so its a two class classification problem. We chose a Naive Bayes classifier because of its ease of implementation, but we do feel that a support vector machine classifier will be of more use in this case. The classification process gives a get of good feedback documents which we use in query expansion. 5.3 Experiments and Evaluation Here we present the experimentation and empirical observations that evaluate the approaches proposed in this chapter. We evaluated our methods on 2007 and 2008 TREC Enterprise track (TREC-ENT) topic collections. We used Indri Search platform 2 to experiment with language or unigram model based indexing and retrieval. The baseline used was PRF implementation of Indri involving only top 50 ranked ED s. For all the propose approaches, we used a PR set of top 50 ranked ED and top 50 ranked WA for any given query. In cluster based PR document selection, we experimented with 10 clusters and used top 5 ranked clusters for a given query. For Topic Model method, we used LDA implementation of Mallet API and worked with 50 topics for every query. Supervised Learning involved a Naive Bayes classifier (form Mallet 3 API). During the training, any ED or WA in PR set that increased mean average precision (MAP) through PRF was considered a good expansion document. We exercised 50 runs of randomly selecting 80% queries to train and remaining 20% queries for testing. The results of our experiments are shown in Table 5.1. Table 5.1 shows gains in MAP and other metrics for all our methods compared to the selected baseline. 2 Indri, Language modeling meets inference networks. 3 MALLET, MAchine Learning for LanguagE Toolkit. 66

79 Table 5.1 Results of QE for different classification techniques on PRF TREC-ENT 2007 TREC-ENT 2008 (base- PRF line) Clustering based PRF MAP (+6.2%) GMAP (+6.87%) InfAP (+4.9%) ndcg (+1.8%) Topic Model based PRF.4976 (+8.3%).3947 (+8.6%).3657 (+5.8%).5483 (+2.1%) Naive Bayes PRF (average of 50 runs).5117 (+11.4%).4060 (+11.2%).3772 (+9.1%).5543 (+3.2%) Among all the porposed approaches, maximum gains were observed for Naive Bayes classifier based PRF (increase in MAP by 11.4% for 2007 query set and 9.1% inferred AP for 2008 query set). These figures indicate the effectiveness of our proposed approaches. 5.4 Summary This chapter presented a pseudo relevance feedback based query expansion technique, which used a mix of enterprise documents and Wikipedia articles as the feedback set. We also presented some clever supervised and unsupervised approaches to classify good feedback documents from the pseudo relevant set. The evaluation of these techniques and respective experimental results were also presented in this chapter. 67

80 Chapter 6 Conclusions We have presented various aspects and goals of effective enterprise search. We have put foreward approaches that can help in achieving the same. In this thesis, we tried to enhance relevance of enterprise search results by tackling two aspects of text search i.e. 1) query expansion and 2) re-ranking. We brought in role based personalization to enhance user experience by focusing our concentration on understanding role relevant vocabulary from the collection. We also experimented with open source lexicon and concept repositories like WordNet and Wikipedia for extracting semantics from query and result-set documents. Bayesian learning and classification of text played an important role in all aspect of our approaches. This chapter concludes this thesis and points out the future directions of this research. 6.1 Unified view of presented approaches From research s point of view, our aim was to improve query generation and ranking aspects of enterprise based information retrieval. The work has taken an exploratory course through various state of art techniques incorporated in two aspects of search, namely query expansion and re-ranking; and focused on ES by bringing in knowledge and lexicon repositories as helping tools. All four mentioned techniques in this thesis (chapters 3 to 5) have involved these aspects. Another important relation between the mentioned approaches was the lessons that were carried forward from one work to the other. First we worked on the role based personalization. We involved WordNet as a domain independent ontology for query expansion. The limited scope of WordNet led us to look for a more comprehensive knowledge base like an encyclopedia. This led our attention to Wikipedia and its link structure for creating a concept thesaurus with much broader topic coverage. While we were working with the article and category link structures to implement the techniques mentioned in chapter 4, we felt that we were overlooking the content of articles. So an approach was devised to use the article text in pseudo relevance feedback for the work mentioned in chapter 5. 68

81 6.2 Role based personalization Role Based Personalization aims to improve existing ES systems with the help of functionalities like query expansion and automated document classification. We further plan to focus on involving ontologies for a better ES system. Our aim is to make a generic ES Application which should be independent of enterprise structure for reasonable functioning without losing the power of customization and should require a minimal human intervention for configuration and personalization Future of role based personalization We still feel that there is a great scope of improvement in the general approach. One of the underlying concepts of our ES system is the use of role information and finding semantic relations between concepts/terms in a corpus. These relations are used to expand the query. They are also responsible for document classification. For the query expansion part, we have been using co-occurrence statistics in the above mentioned system. But we observed in few cases that using term co-occurrence frequency based statistics adds noise to the query. Further, the document classification is primarily based on a supervised learning system. Human effort is required to perfect such system. Our goal is to minimize the human intervention, if not completely eliminating it. We propose to use ontologies for query expansion and document classification. Ontology is a concept map of terminology which, if populated appropriately, covers all the possible semantic relationships that can be understood by the different and complex assembly relations of terms in human language. Ontologies can be useful for automated modification of a search query as it tries to make sense out of what the user intends to search and helps in the automated deconstruction of the documents/pages to be indexed and searched. If we assume that in an Enterprise, proper domain and role specific ontologies exist and the definitions of roles are clear, the domain specific ontologies can be used to enrich the text corpus with terms that are semantically related to existing terms in the given environment. Once the documents have been populated with these relations, we can use existing clustering techniques to identify document classes specific to roles. Ontologies can also be used for query sense disambiguation and expansion. We propose to use role specific document classes to discover role specific terms and hence classify the sections of ontology on the basis of roles. The role specific enterprise ontology can be beneficial for enterprise information retrieval. 69

82 6.3 Wikipedia based approaches Concept thesaurus for query expansion We presented a technique to extract a Concept Thesaurus (CT) by focusing on the inter-article link graph of Wikipedia as well as a way to map any search query to these concepts by treating it as semantic unit instead of a bag of words. The use of an external knowledge resource for query expansion has been a tried and tested area in both web and enterprise search. WordNet is one such popular resource but its topic coverage is limited. With more than 2 million English articles, Wikipedia knowledge-base far exceeds any available dictionary or thesaurus in terms of number of concepts. The major advantage with knowledge bases like Wikipedia is the comprehensive coverage of multi-word and named-entity concepts, which is not the case with other electronic dictionaries. We have created a domain independent CT from link graph of a collaborative Encyclopedia for exploring its capabilities as a better option to traditional dictionary based expansion techniques. Enterprise search however demands comprehensive domain knowledge and we have observed that our CT at least has ample domain coverage for our dataset. Through our approach, the recall figures increased by 9% Future of query expansion As the next step in creating concept based knowledge repository, we will be working with a more efficient domain specific thesaurus. We are working on extracting sub-graphs of the Wikipedia link structure, which will contain the corpus specific concepts. Another area we are planning to work on, is document term enrichment with concept vocabulary. Using techniques similar to concept representation of queries, we plan to add standard concept names into documents. Our query expansion module introduces terms which are concept names. The word and syntactic structure of these terms might be different from semantically similar terms in the corpus, thus such documents may not be detected in a vector space model based search. Our ideas will introduce the same vocabulary in the documents which is used in reconstructing a query Re-ranking with category network In chapter 5, we also presented a re-ranking method based on identifying Wikipedia categories which represent enterprise document classes. We used these categories and their associated article text as training data for document classification for to re-rank search results. We further presented empirical evidences which supported our hypothesis that Web 2.0 tagged knowledge bases like Wikipedia can be used as general purpose training repository. 70

83 Future of re-ranking This is a part of an ongoing work in which we are exploring the use of Wikipedia link structure and article text to improve various phases of Search cycle in Enterprise Domain Pseudo relevance feedback Pseudo relevance feedback (PRF) is a useful technique to improve search results, but it may suffer in ES because of the biased vocabulary issues. To address this, our proposed approach used democratically created text resources such as Wikipedia articles, rich in natural language lexicon, for enriching enterprise vocabulary. We used a combined inverted index of Wikipedia Articles (WA) and and Enterprise Documents (ED) for retrieving a pseudo relevant set containing both ED and WA. Both unsupervised and supervised classification techniques were used for identifying good feedback ED and WA from the PR set Future of PRF We proposed to enrich ED with WA vocabulary using classification techniques to strengthen pseudo relevance feedback (PRF) for query expansion in Enterprise Search. Observed results showed improved performance of our approach compared to the baseline. We plan to take this forward by involving other democratic text resources and implementing other learning techniques such as support vector machines to identify quality feedback document sets for ES. 6.4 Summary We conclude this thesis hoping that it will turn out to be an informative reference for the information retrieval community interested in enterprise search. Our approach to include open source knowledge repositories in enterprise search is a start to harness the plethora of information available through the content and structure of various resources available on the web. We hope that this work provides a comprehensive view for beginners to design a relevance driven enterprise search system. 71

84 Related Publications Role based personalization in enterprise search. Vasudeva Varma, Prasad Pingali, and Nihar Sharma. In precedings of Indian International Conference of Artificial Intelligence (IICAI-09), Tumkur, India, Query Processing for Enterprise Search with Wikipedia Link Structure. Nihar Sharma and Vasudeva Varma. In precedings of International Conference on Knowledge Discovery and Information Retrieval (KDIR-10), Valencia, Spain, Wikipedia as a Vocabulary Resource for Pseudo Relevance Feedback in Enterprise Search. Nihar Sharma and Vasudeva Varma. 72

85 Bibliography [1] Semantic Enterprise Content Management, chapter Practical Handbook of Internet Computing. CRC Press, [2] Man Abrol, Neil Latarche, Uma Mahadevan, Jianchang Mao, Rajat Mukherjee, Prabhakar Raghavan, Michel Tourn, John Wang, and Grace Zhang. Navigating large-scale semi-structured data in business portals. In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB 01, pages , San Francisco, CA, USA, Morgan Kaufmann Publishers Inc. [3] Joao Paulo Andrade Almeida and Giancarlo Guizzardi. A semantic foundation for role-related concepts in enterprise modelling. In Proceedings of the th International IEEE Enterprise Distributed Object Computing Conference, pages 31 40, Washington, DC, USA, IEEE Computer Society. [4] Peter Bailey, Nick Craswell, Ian Soboroff, and Arjen P. de Vries. The csiro enterprise search test collection. SIGIR Forum, 41:42 45, December [5] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst., 30: , April [6] A. Z. Broder and A. C. Ciccolo. Towards the next generation of enterprise search technology. IBM Syst. J., 43: , July [7] Guihong Cao, Jian-Yun Nie, Jianfeng Gao, and Stephen Robertson. Selecting good expansion terms for pseudo-relevance feedback. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR 08, pages , New York, NY, USA, ACM. [8] Paul Alexandru Chirita, Claudiu S. Firan, and Wolfgang Nejdl. Personalized query expansion for the web. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR 07, pages 7 14, New York, NY, USA, ACM. [9] Edward J. Coyne and John M. Davis. Role Engineering for Enterprise Security Management. Artech House, Inc., Norwood, MA, USA, 1st edition,

86 [10] Uwe Crenze, Stefan Köhler, Kristian Hermsdorf, Gunnar Brand, and Sebastian Kluge. Semantic Descriptions in an Enterprise Search Solution, pages Springer-Verlag, Berlin, Heidelberg, [11] Michael I. Jordan David M. Blei, Andrew Y. Ng. Latent dirichlet allocation. Journal of Machine Learning Research, pages , [12] Gianluca Demartini. Leveraging semantic technologies for enterprise search. In Proceedings of the ACM first Ph.D. workshop in CIKM, PIKM 07, pages 25 32, New York, NY, USA, ACM. [13] Fernando Diaz and Donald Metzler. Improving the estimation of relevance models using large external corpora. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR 06, pages , New York, NY, USA, ACM. [14] Pavel A. Dmitriev, Nadav Eiron, Marcus Fontoura, and Eugene Shekita. Using annotations in enterprise search. In Proceedings of the 15th international conference on World Wide Web, WWW 06, pages , New York, NY, USA, ACM. [15] Ronald Fagin, Ravi Kumar, Kevin S. McCurley, Jasmine Novak, D. Sivakumar, John A. Tomlin, and David P. Williamson. Searching the workplace web. In Proceedings of the 12th international conference on World Wide Web, WWW 03, pages , New York, NY, USA, ACM. [16] David F. Ferraiolo, John F. Barkley, and D. Richard Kuhn. A role-based access control model and reference implementation within a corporate intranet. ACM Trans. Inf. Syst. Secur., 2:34 64, February [17] David Hawking. Challenges in enterprise search. In Proceedings of the 15th Australasian database conference - Volume 27, ADC 04, pages 15 24, Darlinghurst, Australia, Australia, Australian Computer Society, Inc. [18] Ben He and Iadh Ounis. Finding good feedback documents. In Proceeding of the 18th ACM conference on Information and knowledge management, CIKM 09, pages , New York, NY, USA, ACM. [19] Djoerd Hiemstra. A probabilistic justification for using tf.idf term weighting in information retrieval. International Journal on Digital Libraries, pages , [20] E Ide. New experiments in relevance feedback. The SMART Retrieval System - Experiments in Automatic Document Processing, pages , [21] Masahiro Ito, Kotaro Nakayama, Takahiro Hara, and Shojiro Nishio. Association thesaurus construction methods based on link co-occurrence analysis for wikipedia. In Proceeding of the 17th 74

87 ACM conference on Information and knowledge management, CIKM 08, pages , New York, NY, USA, ACM. [22] Mohammad Jafari and Mohammad Fathian. Management advantages of object classification in role-based access control (rbac). In Proceedings of the 12th Asian computing science conference on Advances in computer science: computer and network security, ASIAN 07, pages , Berlin, Heidelberg, Springer-Verlag. [23] Rianne Kaptein, Marijn Koolen, and Jaap Kamps. Using wikipedia categories for ad hoc search. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, SIGIR 09, pages , New York, NY, USA, ACM. [24] Axel Kern, Andreas Schaad, and Jonathan Moffett. An administration concept for the enterprise role-based access control model. In Proceedings of the eighth ACM symposium on Access control models and technologies, SACMAT 03, pages 3 11, New York, NY, USA, ACM. [25] Nakayama Kotaro, Takahiro HARA, and Shojiro NISHIO. A thesaurus construction method from large scale web dictionaries. In Proceedings of the 21st International Conference on Advanced Networking and Applications, AINA 07, pages , Washington, DC, USA, IEEE Computer Society. [26] Juhnyoung Lee and Richard Goodwin. Ontology management for large-scale enterprise systems. Electron. Commer. Rec. Appl., 5:2 15, July [27] Longzhuang Li, Yi Shang, and Wei Zhang. Improvement of hits-based algorithms on web documents. In Proceedings of the 11th international conference on World Wide Web, WWW 02, pages , New York, NY, USA, ACM. [28] Yinghao Li, Wing Pong Robert Luk, Kei Shiu Edward Ho, and Fu Lai Korris Chung. Improving weak ad-hoc queries using wikipedia as external corpus. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR 07, pages , New York, NY, USA, ACM. [29] Shuang Liu, Clement Yu, and Weiyi Meng. Word sense disambiguation in queries. In Proceedings of the 14th ACM international conference on Information and knowledge management, CIKM 05, pages , New York, NY, USA, ACM. [30] Olena Medelyan, David Milne, Catherine Legg, and Ian H. Witten. Mining meaning from wikipedia. Int. J. Hum.-Comput. Stud., 67: , September [31] David R. H. Miller, Tim Leek, and Richard M. Schwartz. A hidden markov model information retrieval system. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR 99, pages , New York, NY, USA, ACM. 75

88 [32] David Milne, Olena Medelyan, and Ian H. Witten. Mining domain-specific thesauri from wikipedia: A case study. In Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, WI 06, pages , Washington, DC, USA, IEEE Computer Society. [33] David N. Milne, Ian H. Witten, and David M. Nichols. A knowledge-based search engine powered by wikipedia. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, CIKM 07, pages , New York, NY, USA, ACM. [34] Rajat Mukherjee and Jianchang Mao. Enterprise search: Tough stuff. Queue, 2:36 46, April [35] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, [36] Joon S. Park and Junseok Hwang. Role-based access control for collaborative enterprise in peerto-peer computing environments. In Proceedings of the eighth ACM symposium on Access control models and technologies, SACMAT 03, pages 93 99, New York, NY, USA, ACM. [37] Jie Peng, Craig Macdonald, Ben He, and Iadh Ounis. A study of selective collection enrichment for enterprise search. In Proceeding of the 18th ACM conference on Information and knowledge management, CIKM 09, pages , New York, NY, USA, ACM. [38] Francisco Joao Pinto, Antonio Farina Martinez, and Carme Fernandez Perez-Sanjulian. Joining automatic query expansion based on thesaurus and word sense disambiguation using wordnet. Int. J. Comput. Appl. Technol., 33: , January [39] Jay M. Ponte and W. Bruce Croft. A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR 98, pages , New York, NY, USA, ACM. [40] Prabhakar Raghavan. Structured and unstructured search in enterprises. IEEE Data Eng. Bull., 24(4):15 18, [41] J. J. Rocchio. Relevance feedback in information retrieval. The SMART Retrieval System - Experiments in Automatic Document Processing, pages , [42] G. Salton and C. Buckley. Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, pages , [43] Gerard Salton. Relevance feedback. The SMART Retrieval System - Experiments in Automatic Document Processing, [44] Andreas Schaad, Jonathan Moffett, and Jeremy Jacob. The role-based access control system of a european bank: a case study and discussion. In Proceedings of the sixth ACM symposium on Access control models and technologies, SACMAT 01, pages 3 9, New York, NY, USA, ACM. 76

89 [45] Jürgen Schlegelmilch and Ulrike Steffens. Role mining with orca. In Proceedings of the tenth ACM symposium on Access control models and technologies, SACMAT 05, pages , New York, NY, USA, ACM. [46] Hinrich Schütze, David A. Hull, and Jan O. Pedersen. A comparison of classifiers and document representations for the routing problem. In Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR 95, pages , New York, NY, USA, ACM. [47] Ahu Sieg, Bamshad Mobasher, and Robin Burke. Web search personalization with ontological user profiles. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, CIKM 07, pages , New York, NY, USA, ACM. [48] Amit Singhal, Mandar Mitra, and Chris Buckley. Learning routing queries in a query zone. In Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR 97, pages 25 32, New York, NY, USA, ACM. [49] Jaideep Vaidya, Vijayalakshmi Atluri, and Qi Guo. The role mining problem: finding a minimal descriptive set of roles. In Proceedings of the 12th ACM symposium on Access control models and technologies, SACMAT 07, pages , New York, NY, USA, ACM. [50] Vasudeva Varma, Prasad Pingali, and Nihar Sharma. Role based personalization in enterprise search. pages , [51] Dennis M. Wilkinson and Bernardo A. Huberman. Cooperation and quality in wikipedia. In Proceedings of the 2007 international symposium on Wikis, WikiSym 07, pages , New York, NY, USA, ACM. [52] Chengxiang Zhai and John Lafferty. Model-based feedback in the language modeling approach to information retrieval. In Proceedings of the tenth international conference on Information and knowledge management, CIKM 01, pages , New York, NY, USA, ACM. [53] Dana Zhang, Kotagiri Ramamohanarao, and Tim Ebringer. Role engineering using graph optimisation. In Proceedings of the 12th ACM symposium on Access control models and technologies, SACMAT 07, pages , New York, NY, USA, ACM. [54] Jiuling Zhang, Beixing Deng, and Xing Li. Concept based query expansion using wordnet. In Proceedings of the 2009 International e-conference on Advanced Science and Technology, AST 09, pages 52 55, Washington, DC, USA, IEEE Computer Society. [55] Huaiyu Zhu, Sriram Raghavan, Shivakumar Vaithyanathan, and Alexander Löser. Navigating the intranet with high precision. In Proceedings of the 16th international conference on World Wide Web, WWW 07, pages , New York, NY, USA, ACM. 77

90 [56] Ziming Zhuang and Silviu Cucerzan. Re-ranking search results using query logs. In Proceedings of the 15th ACM international conference on Information and knowledge management, CIKM 06, pages , New York, NY, USA, ACM. 78