Exploring Adaptive Window Sizes for Entity Retrieval

Transcription

1 Exploring Adaptive Window Sizes for Entity Retrieval Fawaz Alarfaj, Udo Kruschwitz, and Chris Fox School of Computer Science and Electronic Engineering University of Essex Colchester, CO4 3SQ, UK Abstract. With the continuous attention of modern search engines to retrieve entities and not just documents for any given query, we introduce a new method for enhancing the entity-ranking task. An entity-ranking task is concerned with retrieving a ranked list of entities as a response to a specific query. Some successful models used the idea of association discovery in a window of text, rather than in the whole document. However, these studies considered only fixed window sizes. This work proposes a way of generating an adaptive window size for each document by utilising some of the document features. These features include document length, average sentence length, number of entities in the document, and the readability index. Experimental results show a positive effect once taking these document features into consideration when determining window size. 1 Introduction In an organisational setting, search engines have become mandatory for aiding knowledge workers with their day-to-day information needs. Traditionally, information retrieval systems function by returning a list of documents in response to the user s query, although, the needed information may not be necessarily in the form of documents. In fact, users more often search for specific things or entities which include people, organisations, or products. One special type of entity-search is concerned with finding people who have specific knowledge; this type of entity-search is called expert-finding, i.e. identifying experts who have the relevant skills and knowledge on a given topic [4]. With a search topic input, state-of-the-art expert-finding systems will measure the knowledge of any candidate expert using the content of the highest ranking documents by highlighting associations based on co-occurrences between the search topic and the candidate evidence [2]. Evidence of expertise could be considered to be highlighted with search terms, furthermore the number and frequency is used to ascertain the likelihood of an individual being considered an expert. There are two main assumptions, firstly, the more a candidate is located within a document including terms of description the more likely they are to be an expert on the subject and secondly, a stronger association is seen when the identifiers are closer to the search terms. With these assumptions in mind, M. de Rijke et al. (Eds.): ECIR 2014, LNCS 8416, pp , c Springer International Publishing Switzerland 2014

2 574 F. Alarfaj, U. Kruschwitz, and C. Fox some research has used fixed-size windows to measure the proximity between candidate identifiers and search terms. Zhu et al. tested 31 window sizes on the W3C collection. They found the best window size to be around 200 words. According to Zhu et al., small window sizes could lead to high precision but low recall. Conversely, large window sizes lead to high recall but low precision [7]. Therefore, other studies consider multiple levels of associations in documents by combining multiple fixed window sizes [3]. In this paper, we consider the idea of an adaptive window size, where the size of the window is a function of various document features. We argue that, in general, each document has distinct features differing from other documents in the collection. The proposed idea is to use these features to set the window size in order to improve the overall ranking function, while many document features could be examined. The focus in this work is aimed at four main features: document length; candidate frequency (i.e., number of candidates that appear in a document); average sentence length; and readability index. To the best of our knowledge, no existing work has dealt with using the document features to determine the optimal window size for the proximity function, apart from our earlier work [1]. It is important to note that the adaptive window size approach could be applied to any proximity search, in particular for an entity-oriented search that generalises expert search. The study is performed in the expert-search domain due to the availability of expert-search benchmarks. The main research question considered is whether an adaptive window size leads to improvements over fixed window size methods. 2 Adaptive Window Size for Proximity Ranking The window size for the proximity function will be determined for each document based on the following features. Document Length: according to Miao et. al. [5], in large documents, it is more likely to find more occurrences of a query topic. It is also more likely to have irrelevant words (noise) in such documents. Thus, in order to minimise the negative influence of noise, the window size should be relatively smaller as the document gets bigger. Candidate Frequency: refers to the number of candidates found in a document. When a document has more occurrences of candidates evidence, the window size should be relatively larger to accommodate more occurrences. Average Sentence Length: the window size is adjusted in proportion to the average sentence length (in tokens) in the document. Readability Index 1 : the window size is adjusted using the readability index where the window gets bigger whenever the index gets smaller. These features are combined in the following equation: W indowsize = σ 4 (log( 1 DocLength ) β 1 + CanF req β 2 + AvgSentSize β 3 + ReadabilityIndex β 4 ) (1) 1 FleschKincaid test is used to calculate the Readability Index in this experiment.

3 Exploring Adaptive Window Sizes for Entity Retrieval 575 The variable σ allows to scale the window size. The weighting factors β, which determine each feature s contribution in the equation are determined empirically. Once the size of the window has been identified, it can be applied to all search terms found in the document, enabling the extraction of the candidates evidence accompanying the search term. Each of these are given a weight in the window depending upon their proximity to the search query. The proximity weight is calculated using Gaussian kernel function, which according to previous work [6], produces the best results in this context. 3 Experiments Improving on our earlier work [1], we have added the new feature, readability index, to the set of features. Moreover, we have applied this method on extra test collections. In this work, two datasets are used to test the proposed approach: W3C corpus and CSIRO corpus, and the four test collections of TREC Enterprise Track between see (Table 1). We used 10 training topics to train our variables, thus having a clear distinction between test and training data. Table 1. TREC Expert Finding Test Collections W3C CSIRO TREC 2005 TREC 2006 TREC 2007 TREC 2008 Documents 331, , , ,715 Candidates 1,092 1,092 3,000 3,000 Size 5.7 GB 5.7 GB 4.2 GB 4.2 GB Topics Qrels Stopwords and HTML markup were eliminated prior to processing. Lucene 3 was used as a retrieval engine. For evaluation, we applied a range of standard IR measures, but in our discussion, we focus on mean average precision (MAP). In this work, we use the two-stage model [2] for the initial candidate ranking as follows: D p(ca q) = p(d i q) p(ca d i,q) (2) i=1 where p(d q) is the document relevance to the query, which is calculated by the underlying search engine. p(ca d i,q) is calculated using the two assumptions mentioned earlier: p(ca d, q) = P occu(ca d)+p kernel (ca d) ζ (3)

4 576 F. Alarfaj, U. Kruschwitz, and C. Fox where P occu (ca d) represents the first assumption (i.e., the more often the candidate appears in relevant documents, the more likely he/she is an expert), and P kernel (ca d) represents the second assumption (i.e., the closer the candidate appears to relevant terms, the more likely he/she is an expert). In this work, the two probabilities are considered as independent, hence the summation. The value of the constant ζ is chosen to ensure that p(c d, q) is a probability measure. The value of ζ is computed as follows: ζ = N (P occu (ca i d)+p kernel (ca i d)) i=1 where N is the total number of candidates in the document d. For the cooccurrence part, (i.e., P occu (ca d)), a TF IDF weighting scheme is applied [3]: P occu (ca d) = n(ca, d) i n(ca i,d) log D {d : n(ca, d ) > 0} (4) where n(ca, d) is the number of times the candidate appears in the document. i n(ca i,d) is the number of times any candidate appears. D is the number of documents in the collection. d : n(ca, d ) > 0 is number of documents where the candidate appears. Finally, P kernel (ca d) is defined as follows: P kernel (ca d) = k(t, c) N i=1 k(t, ca i) (5) As mentioned above, non-uniform Gaussian kernel functions have been used to calculate the candidate s proximity: { 1 k(t, c) = 2πσ 2 exp( u2 2σ 2 ), u = c t, if c t w (6), otherwise where c is the candidate position in the document, t is the topic position, and w is the window size for the current document. For further elucidation, Figure 1 shows a simple illustrative example of how p(ca q) is measured. The example topic returned three relevant documents, which used to rank three candidates. In this example, each candidate c i has n = number of times he/she appears in the document and k = the result of the kernel function. The two ranking models (P occu (ca d) and P kernel (ca d)) are combined to determine the final candidate rank. To test the effect of each document feature separately, using training topics, we first generate the adaptive window size with a single feature. Figure 2 shows the MAP at sigma values between 0 and The analysis of variance, ANOVA, test at p<0.05 suggested a statistical difference between the features. This is true for all datasets. It is clear from the figure that the second feature (i.e., the number of candidates in the document) appears to score the highest in all datasets.

5 Exploring Adaptive Window Sizes for Entity Retrieval 577 c1 n=4 k=0.2 N = 10 Σ = 0.6 k d1 c2 c3 n=0 k=0.0 n=6 k=0.4 C 1 C 2 C 3 i = Poccu(ca d) j =Pkernel(ca d) i+j (i+j)/z Z = Σ(i+j) Rank = Σ p(c d,q) d c1 n=10 k=0.4 N = 14 Σ = 1.2 k c 1 q d2 c2 n=1 k=0.6 C 1 i = Poccu(ca d) j =Pkernel(ca d) i+j (i+j)/z c 3 C d 3 c3 c1 n=3 k=0.2 n=2 k=0.5 C N = 8 Σ = 1.5 k Z = Σ(i+j) 0.19 c 2 c2 n=5 k=0.5 C 1 i = Poccu(ca d) j =Pkernel(ca d) i+j (i+j)/z C c3 n=1 k=0.5 C Z = Σ(i+j) 0.23 Fig. 1. An example for the system framework 0.27 TREC TREC TREC TREC MAP Fig. 2. The results with an adaptive window using a single feature, where 1 is the document length, 2 is the number of candidates in the document, 3 is the average sentence length, and 4 is the readability index In order to compare the proposed adaptive-window method to a strong baseline, a fixed window size of 200 words is used as suggested by [7] with Gaussian proximity functions. For comparison, we also added the highest-scoring result from the TREC Enterprise track. From the results table, (Table 2), it can be seen that the use of the proposed method resulted in an improvement ranging from 10% to 20% over the fixed window baseline. Using paired t-test on average precision values, we found the difference between our best run and the corresponding baseline to be statistically significant. We indicate p<0.01 using and p<0.05 using. The significant improvement is reported for MAP only.

6 578 F. Alarfaj, U. Kruschwitz, and C. Fox Table 2. Summarised results, here MAP means Mean Average Precision and MRR means Mean Reciprocal Rank W3C CSIRO TREC 2005 TREC 2006 TREC 2007 TREC 2008 MAP MRR MAP MRR MAP MRR MAP MRR F ix200 Gaussian AdaptiveGaussian Best T REC Conclusions and Future Work We proposed an approach to adaptively select the size of the context window for boosting the retrieval scores of the entities that are close to query terms. As such, the size of the window cannot be fixed for all documents, rather it should be dependent upon the features of the current document. We found that adopting this method results in significant improvements over standard metrics. Moreover, we also find that the results of the adaptive-window using the four features outperform the results using only a single feature. Among the four features used in this study, the number of candidates feature appears to be the most important. Going forward, we intend to put the adaptive window size method into practice on other TREC benchmarks and expert-finding collections. Furthermore, we will investigate whether using other document features to determine window size can be effective. References 1. Alarfaj, F., Kruschwitz, U., Fox, C.: An adaptive window-size approach for expertfinding. In: DIR 2013, Delft, The Netherlands (April 2013) 2. Balog, K., Fang, Y., de Rijke, M., Serdyukov, P., Si, L.: Expertise retrieval. Foundations and Trends in Information Retrieval 6(2-3), (2012) 3. Balog, K., Azzopardi, L., de Rijke, M.: A language modeling framework for expert finding. Information Processing and Management 45(1), 1 19 (2009) 4. Macdonald, C., Ounis, I.: Searching for expertise: Experiments with the voting model. The Computer Journal 52(7), (2009) 5. Miao, J., Huang, J.X., Ye, Z.: Proximity-based rocchio s model for pseudo relevance. In: SIGIR 2012, Portland, Oregon, pp (2012) 6. Petkova, D., Croft, W.: Proximity-based document representation for named entity retrieval. In: CIKM 2007, pp ACM, New York (2007) 7. Zhu, J., Song, D., Rüger, S.: Integrating multiple windows and document features for expert finding. JASIST 60(4), (2009)