TEMPER : A Temporal Relevance Feedback Method

Size: px
Start display at page:

Download "TEMPER : A Temporal Relevance Feedback Method"

Transcription

1 TEMPER : A Temporal Relevance Feedback Method Mostafa Keikha, Shima Gerani and Fabio Crestani {mostafa.keikha, shima.gerani, fabio.crestani}@usi.ch University of Lugano, Lugano, Switzerland Abstract. The goal of a blog distillation (blog feed search) method is to rank blogs according to their recurrent relevance to the query. An interesting property of blog distillation which differentiates it from traditional retrieval tasks is its dependency on time. In this paper we investigate the effect of time dependency in query expansion. We propose a framework, TEMPER, which selects different terms for different times and ranks blogs according to their relevancy to the query over time. By generating multiple expanded queries based on time, we are able to capture the dynamics of the topic both in aspects and vocabulary usage. We show performance gains over the baseline techniques which generate a single expanded query using the top retrieved posts or blogs irrespective of time. 1 Introduction User generated content is growing very fast and becoming one of the most important sources of information on the Web. Blogs are one of the main sources of information in this category. Millions of people write about their experiences and express their opinions in blogs everyday. Considering this huge amount of user generated data and its specific properties, designing new retrieval methods is necessary to facilitate addressing different types of information needs that blog users may have. Users information needs in blogosphere are different from those of general Web users. Mishne and de Rijke [1] analyzed a blog query log and accordingly they divided blog queries into two broad categories called context and concept queries. In context queries users are looking for contexts of blogs in which a Named Entity occurred to find out what bloggers say about it, whereas in concept queries they are looking for blogs which deal with one of searcher s topics of interest. In this paper we focus on the blog distillation task (also known as blog feed search) 1 where the goal is to answer topics from the second category [2]. Blog distillation is concerned with ranking blogs according to their recurring central interest to the topic of a user s query. In other words, our aim is to discover relevant blogs for each topic 2 that a user can add to his reader and read them in future [3]. 1 In this paper we use words feed and blog interchangeably 2 In this paper we use words topic and query interchangeably

2 An important aspect of blog distillation, which differentiates it from other IR tasks, is related to the temporal properties of blogs and topics. Distillation topics are often multifaceted and can be discussed from different perspectives [4]. Vocabulary usage in the relevant documents to a topic can change over time in order to express different aspects (or sub-topics) of the query. These dynamics might create term mismatch problem during the time, such that a query term may not be a good indicator of the query topic in all different time intervals. In order to address this problem, we propose a time-based query expansion method which expands queries with different terms at different times. This contrasts other applied query expansion methods in blog search where they generate only one single query in the expansion phase [5, 4]. Our experiments on different test collections and different baseline methods indicate that time-base query expansion is effective in improving the retrieval performance and can outperform existing techniques. The rest of the paper is organized as follows. In section 2 we review state of the art methods in blog retrieval. Section 3 describes existing query expansion methods for blog retrieval in more detail. Section 4 explains our time-based query expansion approach. Experimental results over different blog data sets are discussed in section 6. Finally, we conclude the paper and describe future work in section 7. 2 Related Work The main research on the blog distillation started after 2007, when the TREC organizers proposed this task in the blog track [3]. Researchers have applied different methods from areas that are similar to blog distillation, like ad-hoc search, expert search and resource selection in distributed information retrieval. The most simple models use ad-hoc search methods for finding relevant blogs to a specific topic. They treat each blog as one long document created by concatenating all of its posts together [6, 4, 7]. These methods ignore any specific property of blogs and mostly use standard IR techniques to rank blogs. Despite their simplicity, these methods perform fairly well in blog retrieval. Some other approaches have been applied from expert search methods in blog retrieval [8, 2]. In these models, each post in a blog is seen as evidence that the blog has an interest in the query topic. In [2], MacDonald et al. use data fusion models to combine this evidence and compute a final relevance score for the blog, while Balog et al. adapt two language modeling approaches of expert finding and show their effectiveness in blog distillation [8]. Resource selection methods from distributed information retrieval have been also applied to blog retrieval [4, 9, 7]. Elsas et al. deal with blog distillation as a recourse selection problem [4, 9]. They model each blog as a collection of posts and use a Language Modeling approach to select the best collection. A similar approach is proposed by Seo and Croft [7], which they call Pseudo Cluster Selection. They create topic-based clusters of posts in each blog and select blogs that have the most similar clusters to the query.

3 Temporal properties of posts have been considered in different ways in blog retrieval. Nunes et al. define two new measures called temporal span and temporal dispersion to evaluate how long and how frequently a blog has been writing about a topic [10]. Similarly Macdonald and Ounis [2] use a heuristic measure to capture the recurring interests of blogs over time. Some other approaches give higher scores to more recent posts before aggregating them [11, 12]. All these proposed methods and their improvements show the importance and usefulness of temporal information in blog retrieval. However, none of the mentioned methods investigates the effect of time on the vocabulary change for a topic. We employ the temporal information as a source to distinguish between different aspects of topic and terms that are used for each aspect. This leads us to a time-based query expansion method where we generate mutliple expanded queries to cover multiple aspects of a topic over time. Different query expansion possibilities for blog retrieval have been explored by Elsas et al. [4] and Lee et al. [5]. Since we use these methods as our baselines, we will discuss them in more detail in the next section. 3 Query Expansion in Blog Retrieval Query expansion is known to be effective in improving the performance of the retrieval systems [13 15]. In general the idea is to add more terms to an initial query in order to disambiguate the query and solve the possible term mismatch problem between the query and the relevant documents. Automatic Query Expansion techniques usually assume that top retrieved documents are relevant to the topic and use their content to generate an expanded query. In some situations, it has been shown that it is better to have multiple expanded queries as apposed to the usual single query expansion, for example in server-based query expansion technique in distributed information retrieval [16]. An expanded query, while being relevant to the original query, should have as much coverage as possible on all aspects of the query. If the expanded query is very specific to some aspect of the original query, we will miss part of the relevant documents in the re-ranking phase. In blog search context, where queries are more general than normal web search queries [4], the coverage of the expanded query gets even more important. Thus in this condition, it might be better to have multiple queries where each one covers different aspects of a general query. Elsas et al. made the first investigation on the query expansion techniques for blog search [4]. They show that normal feedback methods (selecting the new terms from top retrieved posts or top retrieved blogs) using the usual parameter settings is not effective in blog retrieval. However, they show that expanding query using an external resource like Wikipedia can improve the performance of the system. In a more recent work, Lee et al. [5] propose new methods for selecting appropriate posts as the source of expansion and show that these methods can be effective in retrieval. All these proposed methods can be summarized as follows:

4 Top Feeds: Uses all the posts of the top retrieved feeds for the query expansion. This model has two parameters including number of selected feeds and number of the terms in the expanded query [4]. Top Posts: Uses the top retrieved posts for the query expansion. Number of the selected posts and number of the terms to use for expansion are the parameters of this model [4]. FFBS: Uses the top posts in the top retrieved feeds as the source for selecting the new terms. Number of the selected posts from each feed is fixed among different feeds. This model has three parameters; number of the selected feeds, number of the selected posts in each feed and number of the selected terms for the expansion [5]. WFBS: Works the same as FFBS. The only difference is that number of the selected posts for each feed depends on the feed rank in the initial list, such that more relevant feeds contribute more in generating the new query. Like FFBS, WFBS has also three parameters that are number of the selected feeds, total number of the posts to be used in the expansion and number of the selected terms [5]. Among the mentioned methods, Top Feeds method has the possibility to expand the query with non-relevant terms. The reason is that all the posts in a top retrieved feed are not necessarily relevant to the topic. On the other hand, Top Posts method might not have enough coverage on all the subtopics of the query, because the top retrieved posts might be mainly relevant to some dominant aspect of the query. FFBS and WFBS methods were originally proposed in order to have more coverage than the Top Posts method while selecting more relevant terms than the Top Feeds method [5]. However, since it is difficult to summarize all the aspects of the topic in one single expanded query, these methods would not have the maximum possible coverage. 4 TEMPER In this section we describe our novel framework for time-based relevance feedback in blog distillation called TEMPER. TEMPER assumes that posts at different times talk about different aspects (sub-topics) of a general topic. Therefore, vocabulary usage for the topic is time-dependant and this dependancy can be considered in a relevance feedback method. Following this intuition, TEMPER selects time-dependent terms for query expansion and generated one query for each time point. We can summarize the TEMPER framework in the following 3 steps: 1. Time-based representation of blogs and queries 2. Time-based similarity between a blogs and a query 3. Ranking blogs according to the their overall similarity to the query. In the remainder of this section, we describe our approach in fulfilling each of these steps.

5 4.1 Time-Based Representation of Blogs and Queries Initial Representation of Blogs and Queries In order to consider time in the TEMPER framework, we first need to represent blogs and queries in the time space. For a blog representation, we distribute its posts based on their publish date. In order to have a daily representation of the blog, we concatenate all the posts that have the same date. For a query representation, we take advantage of the top retrieved posts for the query. Same as blog representation, we select the top K relevant posts for the query and divide them based on their publish date while concatenating posts with the same date. In order to have a more informative representation of the query, we select the top N terms for each day using the KL-divergence between the term distribution of the day and the whole collection [17]. Note that in the initial representation, there can be days that do not have any term distribution associated with them. However, in order to calculate the relevance of a blog to a query, TEMPER needs to have the representation of the blog and query in all the days. We employ the available information in the initial representation to estimate the term distributions for the rest of the days. In the rest of this section, we explain our method for estimating these representations. Term Distributions Over Time TEMPER generates a representation for each topic or blog for each day based on the idea that a term at each time position propagates its count to the other time positions through a proximity-based density function. By doing so, we can have a virtual document for a blog/topic at each specific time position. The term frequencies of such a document is calculated as follows: T tf (t, d, i) = tf(t, d, j)k(i, j) (1) j=1 where i and j indicate time position (day) in the time space. T denotes the time span of the collection. tf shows the term frequency of term t in blog/topic d at day i and it is calculated based on the frequency of t in all days. K(i, j) decreases as the distance between i and j increases and can be calculated using kernel functions that we describe later. The proposed representation of document in the time space is similar to the proximity-based method where they generate a virtual document at each position of the document in order to capture the proximity of the words [18, 19]. However, here we aim to capture the temporal proximity of terms. In this paper we employ the laplace kernel function which has been shown to be effective in a previous work [19] together with the Rectangular (square) kernel function. In the following formulas, we present normalized kernel functions with their corresponding variance formula.

6 1. Laplace Kernel k(i, j) = 1 [ ] i j 2b exp b (2) where σ 2 = 2b 2 2. Rectangular Kernel k(i, j) = { 1 2a if i j a 0 otherwise where σ 2 = a2 3 (3) 4.2 Time-Based Similarity Measure By having the daily representation of queries and blogs, we can calculate the daily similarity between these two representations and create a daily similarity vector for the blog and the query. The final similarity between the blog and the query is then calculated by summing over the daily similarities: sim temporal (B, Q) = T sim(b, Q, i) (4) where sim(b i, Q i ) shows the similarity between a blog and a query representation at day i and T shows the time span of the collection in days. Another popular method in time series similarity calculation is to see each time point as one dimension in the time space and use the euclidian length of the daily similarity vector as the final similarity between the two representations [20]: sim temporal (B, Q) = T sim(b, Q, i) 2 (5) We use the cosine similarity as a simple and effective similarity measure for calculating similarity between the blog and the topic representations at the specific day i: w tf(w, B, i) tf(w, Q, i) sim(b, Q, i) = w tf(w, B, i)2 w tf(w, Q, (6) i)2 The normalized value of the temporal similarity over all blogs is then used as P temporal. sim temporal (B, Q) P temporal (B Q) = B sim temporal(b (7), Q) Finally in order to take advantage of all the available evidence regarding the blog relevance, we interpolate the temporal score of the blog with its initial relevance score. i=1 i=1

7 Table 1. Effect of cleaning the data set on Blogger Model. Statistically significant improvements at the 0.05 level is indicated by. Model Cleaned MAP Bpref BloggrModel No BloggrModel Yes P (B Q) = αp initial (B Q) + (1 α)p temporal (B Q) (8) where α is a parameter that controls the amount of temporal relevance that is considered in the model. We use the Blogger Model method for the initial ranking of the blogs [8]. The only difference with the original Blogger Model is that we set the prior of a blog to be proportional to the log of the number of its posts, as opposed to the uniform prior that was used in the original Blogger Model. This log-based prior has been used and shown to be effective by Elsas et al. [4]. 5 Experimental Setup In this section we first explain our experimental setup for evaluating the effectiveness of the proposed framework. Collection and Topics We conduct our experiments over three years worth of TREC blog track data from the blog distillation task, including TREC 07, TREC 08 and TREC 09 data sets. The TREC 07 and TREC 08 data sets include 45 and 50 assessed queries respectively and use Blog06 collection. The TREC 09 data set uses Blog08, a new collection of blogs, and has 39 new queries 3 We use only the title of the topics as the queries. The Blogs06 collection is a crawl of about one hundred thousand blogs over an 11-weeks period [22], and includes blog posts (permalinks), feed, and homepage for each blog. Blog08 is a collection of about one million blogs crawled over a year with the same structure as Blog06 collection [21]. In our experiments we only use the permalinks component of the collection, which consist of approximately 3.2 million documents for Blog06 and about 28.4 million documents for Blog08. We use the Terrier Information Retrieval system 4 to index the collection with the default stemming and stopwords removal. The Language Modeling approach using the dirichlet-smoothing has been used to score the posts and retrieve top posts for each query. 3 Initially there were 50 queries in TREC 2009 data set but some of them did not have relevant blogs for the selected facets and are removed in the official query set [21]. We do not use of the facets in this paper however we use the official query set to be able to compare with the TREC results. 4

8 Table 2. Evaluation results for the implemented models over TREC09 data set. BloggerModel TopFeeds TopPosts FFBS WFBS TEMPER-Rectangular-Sum TEMPER-Rectangular-Euclidian TEMPER-Laplace-Sum TEMPER-Laplace-Euclidian Retrieval Baselines We perform our feedback methods on the results of the Blogger Model method [8]. Therefore, Blogger Model is the first baseline against which, we will compare the performance of our proposed methods. The second set of baselines are the query expansion methods proposed in previous works [4, 5]. In order to have a fair comparison, we implemented the mentioned query expansion methods on top of Blogger Model. We tuned the parameters of these models using 10-fold cross validation in order to maximize MAP. The last set of baselines are provided by TREC organizers as part of the blog facet distillation task. We use these baselines to see the effect of TEMPER in re-ranking the results of other retrieval systems. Evaluation We used the blog distillation relevance judgements provided by TREC for evaluation. We report the Mean Average Precision (MAP) as well as binary Preference (bpref), and Precision at 10 documents (P@10). Throughout our experiments we use the Wilcoxon signed ranked matched pairs test with a confidence level of 0.05 level for testing statistical significant improvements. 6 Experimental Results In this section we explain the experiments that we conducted in order to evaluate the usefulness of the proposed method. We mainly focus on the results of TREC09 data set, as it is the most recent data set and has enough temporal information which is an important feature for our analysis. However, in order to see the effect of the method on the smaller collections, we briefly report the final results on the TREC07 and TREC08 data sets. Table 1 shows the evaluation results of Blogger Model on TREC09 data set. Because of the blog data being highly noisy, we carry out a cleaning step on the collection in order to improve the overall performance of the system. We use the cleaning method proposed by Parapar et al. [23]. As we can see in Table 1, cleaning the collection is very useful and improves the MAP of the system about 14%. We can see that the results of Blogger Model on the cleaned data is already better than the best TREC09 submission on the title-only queries.

9 Table 3. Evaluation results for the implemented models over TREC08 data set. BloggerModel TopPosts WFBS TEMPER-Laplace-Euclidian Table 4. Evaluation results for the implemented models over TREC07 data set. BloggerModel TopPosts WFBS TEMPER-Laplace-Euclidian Table 2 summarizes retrieval performance of Blogger Model and the baseline query expansion methods along with different settings of TEMPER on the TREC 2009 data set. The best value in each column is bold face. A dag( ), a ddag( ) and a star( ) indicate statistically significant improvement over Blogger Model, TopPosts and WFBS respectively. As can be seen from the table, none of the query expansion baselines improves the underlying Blogger Model significantly. From table 2 we can see that TEMPER with different settings (using rectangular/laplace kernel, sum/euclidean similarity method) improves Blogger Model and the query expansion methods significantly. These results show the effectiveness of time-based representation of blogs and query and highlights the importance of time-based similarity calculation of blogs and topics. In tables 3 and 4 we present similar results over TREC08 and TREC07 data sets. Over the TREC08 dataset, it can be seen that TEMPER improves Blogger Model and different query expansion methods significantly. Over the TREC07 dataset, TEMPER improves Blogger Model significantly. However, the performance of TEMPER is comparable with the other query expansion methods and the difference is not statistically significant. As it was mentioned in section 5, we also consider the three standard baselines provided by TREC10 organizers in order to see the effect of our proposed feedback method on retrieval baselines other than Blogger Model. Table 8 shows the results of TEMPER over the TREC baselines. It can be seen that TEMPER improves the baselines in most of the cases. The only baseline that TEMPER does not improve significantly is stdbaseline1 5. Tables 5, 6 and 7 show the performance of TEMPER compared to the best title-only TREC runs in 2009, 2008 and 2007 respectively. It can be seen from the tables that TEMPER is performing better than the best TREC runs over the TREC09 dataset. The results over the TREC08 and TREC07 are comparable 5 Note that the stdbaslines are used as blackbox and we are not yet aware of the underlying method

10 Table 5. Comparison with the best TREC09 title-only submissions. TEMPER-Laplace-Euclidian TREC09-rank1 (buptpris 2009) TREC09-rank2 (ICTNET) TREC09-rank3 (USI) Table 6. Comparison with the best TREC08 title-only submissions. TEMPER-Laplace-Euclidian TREC08-rank2 (CMU-LTI-DIR) TREC08-rank1 (KLE) TREC08-rank3 (UAms) to the best TREC runs and can be considered as the third and second best reported results over TREC08 and TREC07 datasets respectively. TEMPER has four parameters including : number of the posts selected for expansion, number of the terms that are selected for each day, standard deviation (σ) of the kernel functions and α as the weight of the initial ranking score. Among these parameters, we fix number of the terms for each day to be 50, as used in a previous work [4]. Standard deviation of the kernel function is estimated using top retrieve posts for each query. Since the goal of the kernel function is to model the distribution of distance between two consequent relevant posts, we assume the distances between selected posts (top retrieved posts) as the samples of this distribution. We then use the standard deviation of the sample as an estimation for σ. The other two parameters are tuned using 10-fold cross validation method. Figure 1 and 2 show sensitivity of the system to these parameters. It can be seen that the best performance is gained by selecting about 150 posts for expansion while any number more than 50 gives a reasonable result. The value of α depends on the underneath retrieval model. We can see that TEMPER outperforms Blogger Model for all values of α and the best value is about Conclusion and Future Work In this paper we investigated blog distillation where the goal is to rank blogs according to their recurrent relevance to the topic of the query. We focused on the temporal properties of blogs and its application in query expansion for blog retrieval. Following the intuition that term distribution for a topic might change over time, we propose a time-based query expansion technique. We showed that it is effective to have multiple expanded queries for different time points and score the posts of each time using the corresponding expanded query. Our experiments on different blog collections and different baseline methods showed that this method can improve the state of the art query expansion techniques.

11 Table 7. Comparison with the best TREC07 title-only submissions. TEMPER-Laplace-Euclidian TREC07-rank1 (CMU) TREC07-rank2 (UGlasgow) TREC07-rank3 (UMass) Table 8. Evaluation results for the standard baselines on TREC09 data set. Statistically significant improvements are indicated by. stdbaseline TEMPER-stdBaseline stdbaseline TEMPER-stdBaseline stdbaseline TEMPER-stdBaseline Future work will involve more analysis on temporal properties of blogs and topics. In particular, modeling the evolution of topics over time can help us to better estimate the topics relevance models. This modeling over time can be seen as a temporal relevance model which is an unexplored problem in blog retrieval. 8 Acknowledgement This work was supported by Swiss National Science Foundation (SNSF) as XMI project (ProjectNr /1). References 1. Mishne, G., de Rijke, M.: A study of blog search. In: Proceedings of ECIR (2006) Macdonald, C., Ounis, I.: Key blog distillation: ranking aggregates. In: Proceedings of CIKM (2008) Macdonald, C., Ounis, I., Soboroff, I.: Overview of the trec-2007 blog track. In: Proceedings of TREC (2008) 4. Elsas, J.L., Arguello, J., Callan, J., Carbonell, J.G.: Retrieval and feedback models for blog feed search. In: Proceedings of SIGIR (2008) Lee, Y., Na, S.H., Lee, J.H.: An improved feedback approach using relevant local posts for blog feed retrieval. In: Proceedings of CIKM (2009) Efron, M., Turnbull, D., Ovalle, C.: University of Texas School of Information at TREC In: Proceedings of TREC (2008) 7. Seo, J., Croft, W.B.: Blog site search using resource selection. In: Proceedings of CIKM 2008, New York, NY, USA, ACM (2008) Balog, K., de Rijke, M., Weerkamp, W.: Bloggers as experts: feed distillation using expert retrieval models. In: Proceedings of SIGIR (2008) Arguello, J., Elsas, J., Callan, J., Carbonell, J.: Document representation and query expansion models for blog recommendation. In: Proceedings of ICWSM (2008)

12 MAP TEMPER Number of the posts MAP TEMPER Blogger Model Alpha Fig. 1. Effect of number of the posts used for expansion on the performance of TEMPER. Fig. 2. Effect of alpha on the performance of TEMPER. 10. Nunes, S., Ribeiro, C., David, G.: Feup at trec 2008 blog track: Using temporal evidence for ranking and feed distillation. In: Proceedings of TREC (2009) 11. Ernsting, B., Weerkamp, W., de Rijke, M.: Language modeling approaches to blog postand feed finding. In: Proceedings of TREC (2007) 12. Weerkamp, W., Balog, K., de Rijke, M.: Finding key bloggers, one post at a time. In: Proceedings of ECAI (2008) Cao, G., Nie, J.Y., Gao, J., Robertson, S.: Selecting good expansion terms for pseudo-relevance feedback. In: Proceedings of SIGIR 2008, New York, NY, USA, ACM (2008) Lavrenko, V., Croft, W.B.: Relevance based language models. In: Proceedings of SIGIR 2001, New York, NY, USA, ACM (2001) Salton, G.: The SMART Retrieval System Experiments in Automatic Document Processing. Prentice-Hall, Inc., Upper Saddle River, NJ, USA (1971) 16. Shokouhi, M., Azzopardi, L., Thomas, P.: Effective query expansion for federated search. In: Proceedings of SIGIR 2009, New York, NY, USA, ACM (2009) Zhai, C., Lafferty, J.D.: Model-based feedback in the language modeling approach to information retrieval. (2001) Lv, Y., Zhai, C.: Positional language models for information retrieval. In: Proceedings SIGIR 09. (2009) Gerani, S., Carman, M.J., Crestani, F.: Proximity-based opinion retrieval. In: Proceedings of SIGIR 10. (2010) Keogh, E.J., Pazzani, M.J.: Relevance feedback retrieval of time series data. In: Proceeding of SIGIR (1999) Macdonald, C., Ounis, I., Soboroff, I.: Overview of the TREC-2009 Blog Track. In: Proceedings of TREC (2009) 22. Macdonald, C., Ounis, I.: The TREC Blogs06 collection: Creating and analysing a blog test collection. Department of Computer Science, University of Glasgow Tech Report TR (2006) 23. Parapar, J., López-Castro, J., Barreiro, Á.: Blog Posts and Comments Extraction and Impact on Retrieval Effectiveness. In: Proceeding of Spanish Conference on Information Retrieval (2010)

Blog feed search with a post index

Blog feed search with a post index DOI 10.1007/s10791-011-9165-9 Blog feed search with a post index Wouter Weerkamp Krisztian Balog Maarten de Rijke Received: 18 February 2010 / Accepted: 18 February 2011 Ó The Author(s) 2011. This article

More information

Combining Document and Sentence Scores for Blog Topic Retrieval

Combining Document and Sentence Scores for Blog Topic Retrieval Combining Document and Sentence Scores for Blog Topic Retrieval Jose M. Chenlo, David E. Losada Grupo de Sistemas Inteligentes Departamento de Electrónica y Comunicación Universidad de Santiago de Compostela,

More information

UMass at TREC 2008 Blog Distillation Task

UMass at TREC 2008 Blog Distillation Task UMass at TREC 2008 Blog Distillation Task Jangwon Seo and W. Bruce Croft Center for Intelligent Information Retrieval University of Massachusetts, Amherst Abstract This paper presents the work done for

More information

Improving Web Page Retrieval using Search Context from Clicked Domain Names

Improving Web Page Retrieval using Search Context from Clicked Domain Names Improving Web Page Retrieval using Search Context from Clicked Domain Names Rongmei Li School of Electrical, Mathematics, and Computer Science University of Twente P.O.Box 217, 7500 AE, Enschede, the Netherlands

More information

Blog Site Search Using Resource Selection

Blog Site Search Using Resource Selection Blog Site Search Using Resource Selection Jangwon Seo jangwon@cs.umass.edu Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts, Amherst Amherst, MA 01003

More information

Using Contextual Information to Improve Search in Email Archives

Using Contextual Information to Improve Search in Email Archives Using Contextual Information to Improve Search in Email Archives Wouter Weerkamp, Krisztian Balog, and Maarten de Rijke ISLA, University of Amsterdam, Kruislaan 43, 198 SJ Amsterdam, The Netherlands w.weerkamp@uva.nl,

More information

Retrieval and Feedback Models for Blog Feed Search

Retrieval and Feedback Models for Blog Feed Search Retrieval and Feedback Models for Blog Feed Search Jonathan L. Elsas, Jaime Arguello, Jamie Callan and Jaime G. Carbonell Language Technologies Institute School of Computer Science Carnegie Mellon University

More information

A Hybrid Method for Opinion Finding Task (KUNLP at TREC 2008 Blog Track)

A Hybrid Method for Opinion Finding Task (KUNLP at TREC 2008 Blog Track) A Hybrid Method for Opinion Finding Task (KUNLP at TREC 2008 Blog Track) Linh Hoang, Seung-Wook Lee, Gumwon Hong, Joo-Young Lee, Hae-Chang Rim Natural Language Processing Lab., Korea University (linh,swlee,gwhong,jylee,rim)@nlp.korea.ac.kr

More information

University of Glasgow at TREC 2007: Experiments in Blog and Enterprise Tracks with Terrier

University of Glasgow at TREC 2007: Experiments in Blog and Enterprise Tracks with Terrier University of Glasgow at TREC 2007: Experiments in Blog and Enterprise Tracks with Terrier David Hannah, Craig Macdonald, Jie Peng, Ben He, Iadh Ounis Department of Computing Science University of Glasgow

More information

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School

More information

SIGIR 2004 Workshop: RIA and "Where can IR go from here?"

SIGIR 2004 Workshop: RIA and Where can IR go from here? SIGIR 2004 Workshop: RIA and "Where can IR go from here?" Donna Harman National Institute of Standards and Technology Gaithersburg, Maryland, 20899 donna.harman@nist.gov Chris Buckley Sabir Research, Inc.

More information

Predicting Query Performance in Intranet Search

Predicting Query Performance in Intranet Search Predicting Query Performance in Intranet Search Craig Macdonald University of Glasgow Glasgow, G12 8QQ, U.K. craigm@dcs.gla.ac.uk Ben He University of Glasgow Glasgow, G12 8QQ, U.K. ben@dcs.gla.ac.uk Iadh

More information

Shopping for Top Forums: Discovering Online Discussion for Product Research

Shopping for Top Forums: Discovering Online Discussion for Product Research Shopping for Top Forums: Discovering Online Discussion for Product Research ABSTRACT Jonathan L. Elsas Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 jelsas@cs.cmu.edu

More information

Overview of the TREC-2006 Blog Track

Overview of the TREC-2006 Blog Track Overview of the TREC-2006 Blog Track Iadh Ounis, Maarten de Rijke, Craig Macdonald, Gilad Mishne, Ian Soboroff 1 Introduction trecblog-organisers@dcs.gla.ac.uk The rise on the Internet of blogging, the

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

Clinical Decision Support with the SPUD Language Model

Clinical Decision Support with the SPUD Language Model Clinical Decision Support with the SPUD Language Model Ronan Cummins The Computer Laboratory, University of Cambridge, UK ronan.cummins@cl.cam.ac.uk Abstract. In this paper we present the systems and techniques

More information

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India

More information

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Incorporating Window-Based Passage-Level Evidence in Document Retrieval Incorporating -Based Passage-Level Evidence in Document Retrieval Wensi Xi, Richard Xu-Rong, Christopher S.G. Khoo Center for Advanced Information Systems School of Applied Science Nanyang Technological

More information

Terrier: A High Performance and Scalable Information Retrieval Platform

Terrier: A High Performance and Scalable Information Retrieval Platform Terrier: A High Performance and Scalable Information Retrieval Platform Iadh Ounis, Gianni Amati, Vassilis Plachouras, Ben He, Craig Macdonald, Christina Lioma Department of Computing Science University

More information

Learning to Expand Queries Using Entities

Learning to Expand Queries Using Entities This is a preprint of the article Brandão, W. C., Santos, R. L. T., Ziviani, N., de Moura, E. S. and da Silva, A. S. (2014), Learning to expand queries using entities. Journal of the Association for Information

More information

Aggregating Evidence from Hospital Departments to Improve Medical Records Search

Aggregating Evidence from Hospital Departments to Improve Medical Records Search Aggregating Evidence from Hospital Departments to Improve Medical Records Search NutLimsopatham 1,CraigMacdonald 2,andIadhOunis 2 School of Computing Science University of Glasgow G12 8QQ, Glasgow, UK

More information

Using Transactional Data From ERP Systems for Expert Finding

Using Transactional Data From ERP Systems for Expert Finding Using Transactional Data from ERP Systems for Expert Finding Lars K. Schunk 1 and Gao Cong 2 1 Dynaway A/S, Alfred Nobels Vej 21E, 9220 Aalborg Øst, Denmark 2 School of Computer Engineering, Nanyang Technological

More information

Top-k Retrieval using Facility Location Analysis

Top-k Retrieval using Facility Location Analysis Top-k Retrieval using Facility Location Analysis Guido Zuccon 1, Leif Azzopardi 1, Dell Zhang 2, and Jun Wang 3 {guido, leif}@dcs.gla.ac.uk, dell.z@ieee.org, j.wang@cs.ucl.ac.uk 1 School of Computing Science,

More information

TF-IDF. David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt

TF-IDF. David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt TF-IDF David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt Administrative Homework 3 available soon Assignment 2 available soon Popular media article

More information

An Information Retrieval using weighted Index Terms in Natural Language document collections

An Information Retrieval using weighted Index Terms in Natural Language document collections Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia

More information

Query term suggestion in academic search

Query term suggestion in academic search Query term suggestion in academic search Suzan Verberne 1, Maya Sappelli 1,2, and Wessel Kraaij 2,1 1. Institute for Computing and Information Sciences, Radboud University Nijmegen 2. TNO, Delft Abstract.

More information

Improving Contextual Suggestions using Open Web Domain Knowledge

Improving Contextual Suggestions using Open Web Domain Knowledge Improving Contextual Suggestions using Open Web Domain Knowledge Thaer Samar, 1 Alejandro Bellogín, 2 and Arjen de Vries 1 1 Centrum Wiskunde & Informatica, Amsterdam, The Netherlands 2 Universidad Autónoma

More information

UIC at TREC 2010 Faceted Blog Distillation

UIC at TREC 2010 Faceted Blog Distillation UIC at TREC 2010 Faceted Blog Distillation Lieng Jia and Clement Yu Department o Computer Science University o Illinois at Chicago 851 S Morgan St., Chicago, IL 60607, USA {ljia2, cyu}@uic.edu ABSTRACT

More information

The University of Lisbon at CLEF 2006 Ad-Hoc Task

The University of Lisbon at CLEF 2006 Ad-Hoc Task The University of Lisbon at CLEF 2006 Ad-Hoc Task Nuno Cardoso, Mário J. Silva and Bruno Martins Faculty of Sciences, University of Lisbon {ncardoso,mjs,bmartins}@xldb.di.fc.ul.pt Abstract This paper reports

More information

Site-Specific versus General Purpose Web Search Engines: A Comparative Evaluation

Site-Specific versus General Purpose Web Search Engines: A Comparative Evaluation Panhellenic Conference on Informatics Site-Specific versus General Purpose Web Search Engines: A Comparative Evaluation G. Atsaros, D. Spinellis, P. Louridas Department of Management Science and Technology

More information

Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc

Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,

More information

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

PDF hosted at the Radboud Repository of the Radboud University Nijmegen PDF hosted at the Radboud Repository of the Radboud University Nijmegen The following full text is an author's version which may differ from the publisher's version. For additional information about this

More information

Information Retrieval. Lecture 8 - Relevance feedback and query expansion. Introduction. Overview. About Relevance Feedback. Wintersemester 2007

Information Retrieval. Lecture 8 - Relevance feedback and query expansion. Introduction. Overview. About Relevance Feedback. Wintersemester 2007 Information Retrieval Lecture 8 - Relevance feedback and query expansion Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 32 Introduction An information

More information

A survey on the use of relevance feedback for information access systems

A survey on the use of relevance feedback for information access systems A survey on the use of relevance feedback for information access systems Ian Ruthven Department of Computer and Information Sciences University of Strathclyde, Glasgow, G1 1XH. Ian.Ruthven@cis.strath.ac.uk

More information

Universities of Leeds, Sheffield and York http://eprints.whiterose.ac.uk/

Universities of Leeds, Sheffield and York http://eprints.whiterose.ac.uk/ promoting access to White Rose research papers Universities of Leeds, Sheffield and York http://eprints.whiterose.ac.uk/ This is an author produced version of a paper published in Advances in Information

More information

UMass at TREC WEB 2014: Entity Query Feature Expansion using Knowledge Base Links

UMass at TREC WEB 2014: Entity Query Feature Expansion using Knowledge Base Links UMass at TREC WEB 2014: Entity Query Feature Expansion using Knowledge Base Links Laura Dietz and Patrick Verga University of Massachusetts Amherst, MA, U.S.A. {dietz,pat}@cs.umass.edu Abstract Entity

More information

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS

More information

Query Recommendation employing Query Logs in Search Optimization

Query Recommendation employing Query Logs in Search Optimization 1917 Query Recommendation employing Query Logs in Search Optimization Neha Singh Department of Computer Science, Shri Siddhi Vinayak Group of Institutions, Bareilly Email: singh26.neha@gmail.com Dr Manish

More information

Predicting IMDB Movie Ratings Using Social Media

Predicting IMDB Movie Ratings Using Social Media Predicting IMDB Movie Ratings Using Social Media Andrei Oghina, Mathias Breuss, Manos Tsagkias, and Maarten de Rijke ISLA, University of Amsterdam, Science Park 904, 1098 XH Amsterdam, The Netherlands

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

Term extraction for user profiling: evaluation by the user

Term extraction for user profiling: evaluation by the user Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,

More information

Personalizing Image Search from the Photo Sharing Websites

Personalizing Image Search from the Photo Sharing Websites Personalizing Image Search from the Photo Sharing Websites Swetha.P.C, Department of CSE, Atria IT, Bangalore swethapc.reddy@gmail.com Aishwarya.P Professor, Dept.of CSE, Atria IT, Bangalore aishwarya_p27@yahoo.co.in

More information

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking Creating Synthetic Temporal Document Collections for Web Archive Benchmarking Kjetil Nørvåg and Albert Overskeid Nybø Norwegian University of Science and Technology 7491 Trondheim, Norway Abstract. In

More information

Comparing Tag Clouds, Term Histograms, and Term Lists for Enhancing Personalized Web Search

Comparing Tag Clouds, Term Histograms, and Term Lists for Enhancing Personalized Web Search Comparing Tag Clouds, Term Histograms, and Term Lists for Enhancing Personalized Web Search Orland Hoeber and Hanze Liu Department of Computer Science, Memorial University St. John s, NL, Canada A1B 3X5

More information

Multileaved Comparisons for Fast Online Evaluation

Multileaved Comparisons for Fast Online Evaluation Multileaved Comparisons for Fast Online Evaluation Anne Schuth 1, Floor Sietsma 1, Shimon Whiteson 1, Damien Lefortier 1,2, and Maarten de Rijke 1 1 University of Amsterdam, Amsterdam, The Netherlands

More information

Introduction to Information Retrieval http://informationretrieval.org

Introduction to Information Retrieval http://informationretrieval.org Introduction to Information Retrieval http://informationretrieval.org IIR 6&7: Vector Space Model Hinrich Schütze Institute for Natural Language Processing, University of Stuttgart 2011-08-29 Schütze:

More information

Eng. Mohammed Abdualal

Eng. Mohammed Abdualal Islamic University of Gaza Faculty of Engineering Computer Engineering Department Information Storage and Retrieval (ECOM 5124) IR HW 5+6 Scoring, term weighting and the vector space model Exercise 6.2

More information

Subordinating to the Majority: Factoid Question Answering over CQA Sites

Subordinating to the Majority: Factoid Question Answering over CQA Sites Journal of Computational Information Systems 9: 16 (2013) 6409 6416 Available at http://www.jofcis.com Subordinating to the Majority: Factoid Question Answering over CQA Sites Xin LIAN, Xiaojie YUAN, Haiwei

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

Comparison of Standard and Zipf-Based Document Retrieval Heuristics

Comparison of Standard and Zipf-Based Document Retrieval Heuristics Comparison of Standard and Zipf-Based Document Retrieval Heuristics Benjamin Hoffmann Universität Stuttgart, Institut für Formale Methoden der Informatik Universitätsstr. 38, D-70569 Stuttgart, Germany

More information

What to Mine from Big Data? Hang Li Noah s Ark Lab Huawei Technologies

What to Mine from Big Data? Hang Li Noah s Ark Lab Huawei Technologies What to Mine from Big Data? Hang Li Noah s Ark Lab Huawei Technologies Big Data Value Two Main Issues in Big Data Mining Agenda Four Principles for What to Mine Stories regarding to Principles Search and

More information

Relevance Feedback versus Local Context Analysis as Term Suggestion Devices: Rutgers TREC 8 Interactive Track Experience

Relevance Feedback versus Local Context Analysis as Term Suggestion Devices: Rutgers TREC 8 Interactive Track Experience Relevance Feedback versus Local Context Analysis as Term Suggestion Devices: Rutgers TREC 8 Interactive Track Experience Abstract N.J. Belkin, C. Cool*, J. Head, J. Jeng, D. Kelly, S. Lin, L. Lobash, S.Y.

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

Standard Deviation Estimator

Standard Deviation Estimator CSS.com Chapter 905 Standard Deviation Estimator Introduction Even though it is not of primary interest, an estimate of the standard deviation (SD) is needed when calculating the power or sample size of

More information

Approaches to Exploring Category Information for Question Retrieval in Community Question-Answer Archives

Approaches to Exploring Category Information for Question Retrieval in Community Question-Answer Archives Approaches to Exploring Category Information for Question Retrieval in Community Question-Answer Archives 7 XIN CAO and GAO CONG, Nanyang Technological University BIN CUI, Peking University CHRISTIAN S.

More information

Document Re-ranking via Wikipedia Articles for Definition/Biography Type Questions *

Document Re-ranking via Wikipedia Articles for Definition/Biography Type Questions * Document Re-ranking via Wikipedia Articles for Definition/Biography Type Questions * Maofu Liu a, Fang Fang a, and Donghong Ji b a College of Computer Science and Technology, Wuhan University of Science

More information

FINDING THE RIGHT EXPERT Discriminative Models for Expert Retrieval

FINDING THE RIGHT EXPERT Discriminative Models for Expert Retrieval FINDING THE RIGHT EXPERT Discriminative Models for Expert Retrieval Philipp Sorg 1 and Philipp Cimiano 2 1 AIFB, Karlsruhe Institute of Technology, Germany 2 CITEC, University of Bielefeld, Germany philipp.sorg@kit.edu,

More information

Dynamical Clustering of Personalized Web Search Results

Dynamical Clustering of Personalized Web Search Results Dynamical Clustering of Personalized Web Search Results Xuehua Shen CS Dept, UIUC xshen@cs.uiuc.edu Hong Cheng CS Dept, UIUC hcheng3@uiuc.edu Abstract Most current search engines present the user a ranked

More information

Information Need Assessment in Information Retrieval

Information Need Assessment in Information Retrieval Information Need Assessment in Information Retrieval Beyond Lists and Queries Frank Wissbrock Department of Computer Science Paderborn University, Germany frankw@upb.de Abstract. The goal of every information

More information

WHAT IS THE TEMPORAL VALUE OF WEB SNIPPETS?

WHAT IS THE TEMPORAL VALUE OF WEB SNIPPETS? WHAT IS THE TEMPORAL VALUE OF WEB SNIPPETS? Ricardo Campos 1, 2, 4 Gaël Dias 2, Alípio Jorge 3, 4 1 Tomar Polytechnic Institute, Tomar, Portugal 2 Centre of Human Language Tecnnology and Bioinformatics,

More information

Optimization of Algorithms and Parameter Settings for an Enterprise Expert Search System

Optimization of Algorithms and Parameter Settings for an Enterprise Expert Search System Optimization of Algorithms and Parameter Settings for an Enterprise Expert Search System Valentin Molokanov, Dmitry Romanov, Valentin Tsibulsky National Research University Higher School of Economics Moscow,

More information

Learning to Rank Revisited: Our Progresses in New Algorithms and Tasks

Learning to Rank Revisited: Our Progresses in New Algorithms and Tasks The 4 th China-Australia Database Workshop Melbourne, Australia Oct. 19, 2015 Learning to Rank Revisited: Our Progresses in New Algorithms and Tasks Jun Xu Institute of Computing Technology, Chinese Academy

More information

Identifying Best Bet Web Search Results by Mining Past User Behavior

Identifying Best Bet Web Search Results by Mining Past User Behavior Identifying Best Bet Web Search Results by Mining Past User Behavior Eugene Agichtein Microsoft Research Redmond, WA, USA eugeneag@microsoft.com Zijian Zheng Microsoft Corporation Redmond, WA, USA zijianz@microsoft.com

More information

Axiomatic Analysis and Optimization of Information Retrieval Models

Axiomatic Analysis and Optimization of Information Retrieval Models Axiomatic Analysis and Optimization of Information Retrieval Models ChengXiang Zhai Dept. of Computer Science University of Illinois at Urbana Champaign USA http://www.cs.illinois.edu/homes/czhai Hui Fang

More information

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION Brian Lao - bjlao Karthik Jagadeesh - kjag Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND There is a large need for improved access to legal help. For example,

More information

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics. Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

How To Cluster On A Search Engine

How To Cluster On A Search Engine Volume 2, Issue 2, February 2012 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: A REVIEW ON QUERY CLUSTERING

More information

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. Miguel Ruiz, Anne Diekema, Páraic Sheridan MNIS-TextWise Labs Dey Centennial Plaza 401 South Salina Street Syracuse, NY 13202 Abstract:

More information

Blending Vertical and Web results

Blending Vertical and Web results Blending Vertical and Web results A Case Study using Video Intent Damien Lefortier 1,2, Pavel Serdyukov 1, Fedor Romanenko 1, and Maarten de Rijke 2 1 Yandex, Moscow {damien, pavser, fedor}@yandex-team.ru

More information

A Comparative Study of the Effectiveness of Search Result Presentation on the Web

A Comparative Study of the Effectiveness of Search Result Presentation on the Web A Comparative Study of the Effectiveness of Search Result Presentation on the Web Hideo Joho and Joemon M. Jose Department of Computing Science University of Glasgow 17 Lilybank Gardens, Glasgow, G12 8QQ,

More information

Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

More information

Introduction. A. Bellaachia Page: 1

Introduction. A. Bellaachia Page: 1 Introduction 1. Objectives... 3 2. What is Data Mining?... 4 3. Knowledge Discovery Process... 5 4. KD Process Example... 7 5. Typical Data Mining Architecture... 8 6. Database vs. Data Mining... 9 7.

More information

Mining Text Data: An Introduction

Mining Text Data: An Introduction Bölüm 10. Metin ve WEB Madenciliği http://ceng.gazi.edu.tr/~ozdemir Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Frank Rizzo

More information

DOCODE-Lite: A Meta-Search Engine for Document Similarity Retrieval

DOCODE-Lite: A Meta-Search Engine for Document Similarity Retrieval DOCODE-Lite: A Meta-Search Engine for Document Similarity Retrieval Felipe Bravo-Marquez 1, Gaston L Huillier 1,Sebastián A. Ríos 1, Juan D. Velásquez 1, and Luis A. Guerrero 2 1 University of Chile, Department

More information

Revenue Optimization with Relevance Constraint in Sponsored Search

Revenue Optimization with Relevance Constraint in Sponsored Search Revenue Optimization with Relevance Constraint in Sponsored Search Yunzhang Zhu Gang Wang Junli Yang Dakan Wang Jun Yan Zheng Chen Microsoft Resarch Asia, Beijing, China Department of Fundamental Science,

More information

Understanding the popularity of reporters and assignees in the Github

Understanding the popularity of reporters and assignees in the Github Understanding the popularity of reporters and assignees in the Github Joicy Xavier, Autran Macedo, Marcelo de A. Maia Computer Science Department Federal University of Uberlândia Uberlândia, Minas Gerais,

More information

Dynamics of Genre and Domain Intents

Dynamics of Genre and Domain Intents Dynamics of Genre and Domain Intents Shanu Sushmita, Benjamin Piwowarski, and Mounia Lalmas University of Glasgow {shanu,bpiwowar,mounia}@dcs.gla.ac.uk Abstract. As the type of content available on the

More information

A COMBINED TEXT MINING METHOD TO IMPROVE DOCUMENT MANAGEMENT IN CONSTRUCTION PROJECTS

A COMBINED TEXT MINING METHOD TO IMPROVE DOCUMENT MANAGEMENT IN CONSTRUCTION PROJECTS A COMBINED TEXT MINING METHOD TO IMPROVE DOCUMENT MANAGEMENT IN CONSTRUCTION PROJECTS Caldas, Carlos H. 1 and Soibelman, L. 2 ABSTRACT Information is an important element of project delivery processes.

More information

TREC 2007 ciqa Task: University of Maryland

TREC 2007 ciqa Task: University of Maryland TREC 2007 ciqa Task: University of Maryland Nitin Madnani, Jimmy Lin, and Bonnie Dorr University of Maryland College Park, Maryland, USA nmadnani,jimmylin,bonnie@umiacs.umd.edu 1 The ciqa Task Information

More information

Big Data Summarization Using Semantic. Feture for IoT on Cloud

Big Data Summarization Using Semantic. Feture for IoT on Cloud Contemporary Engineering Sciences, Vol. 7, 2014, no. 22, 1095-1103 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ces.2014.49137 Big Data Summarization Using Semantic Feture for IoT on Cloud Yoo-Kang

More information

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

Folksonomies versus Automatic Keyword Extraction: An Empirical Study Folksonomies versus Automatic Keyword Extraction: An Empirical Study Hend S. Al-Khalifa and Hugh C. Davis Learning Technology Research Group, ECS, University of Southampton, Southampton, SO17 1BJ, UK {hsak04r/hcd}@ecs.soton.ac.uk

More information

SEARCHING QUESTION AND ANSWER ARCHIVES

SEARCHING QUESTION AND ANSWER ARCHIVES SEARCHING QUESTION AND ANSWER ARCHIVES A Dissertation Presented by JIWOON JEON Submitted to the Graduate School of the University of Massachusetts Amherst in partial fulfillment of the requirements for

More information

Disambiguating Implicit Temporal Queries by Clustering Top Relevant Dates in Web Snippets

Disambiguating Implicit Temporal Queries by Clustering Top Relevant Dates in Web Snippets Disambiguating Implicit Temporal Queries by Clustering Top Ricardo Campos 1, 4, 6, Alípio Jorge 3, 4, Gaël Dias 2, 6, Célia Nunes 5, 6 1 Tomar Polytechnic Institute, Tomar, Portugal 2 HULTEC/GREYC, University

More information

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This

More information

I. The SMART Project - Status Report and Plans. G. Salton. The SMART document retrieval system has been operating on a 709^

I. The SMART Project - Status Report and Plans. G. Salton. The SMART document retrieval system has been operating on a 709^ 1-1 I. The SMART Project - Status Report and Plans G. Salton 1. Introduction The SMART document retrieval system has been operating on a 709^ computer since the end of 1964. The system takes documents

More information

1 o Semestre 2007/2008

1 o Semestre 2007/2008 Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction

More information

Mining Expertise and Interests from Social Media

Mining Expertise and Interests from Social Media Mining Expertise and Interests from Social Media Ido Guy, Uri Avraham, David Carmel, Sigalit Ur, Michal Jacovi, Inbal Ronen IBM Research Lab Haifa, Israel {ido, uria, carmel, sigalit, jacovi, inbal}@il.ibm.com

More information

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Gareth J. F. Jones, Declan Groves, Anna Khasin, Adenike Lam-Adesina, Bart Mellebeek. Andy Way School of Computing,

More information

Latent Semantic Indexing with Selective Query Expansion Abstract Introduction

Latent Semantic Indexing with Selective Query Expansion Abstract Introduction Latent Semantic Indexing with Selective Query Expansion Andy Garron April Kontostathis Department of Mathematics and Computer Science Ursinus College Collegeville PA 19426 Abstract This article describes

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System

Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System Hani Abu-Salem* and Mahmoud Al-Omari Department of Computer Science, Mu tah University, P.O. Box (7), Mu tah,

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Named Entity Recognition in Broadcast News Using Similar Written Texts

Named Entity Recognition in Broadcast News Using Similar Written Texts Named Entity Recognition in Broadcast News Using Similar Written Texts Niraj Shrestha Ivan Vulić KU Leuven, Belgium KU Leuven, Belgium niraj.shrestha@cs.kuleuven.be ivan.vulic@@cs.kuleuven.be Abstract

More information

How can we discover stocks that will

How can we discover stocks that will Algorithmic Trading Strategy Based On Massive Data Mining Haoming Li, Zhijun Yang and Tianlun Li Stanford University Abstract We believe that there is useful information hiding behind the noisy and massive

More information

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet Muhammad Atif Qureshi 1,2, Arjumand Younus 1,2, Colm O Riordan 1,

More information

A proposal for transformation of topic-maps into similarities of topics

A proposal for transformation of topic-maps into similarities of topics A proposal for transformation of topic-maps into similarities of topics ABSTRACT Newer information filtering and retrieval models like the Fuzzy Set Model or the Topic-based Vector Space Model consider

More information

Mining Semi-Structured Online Knowledge Bases to Answer Natural Language Questions on Community QA Websites

Mining Semi-Structured Online Knowledge Bases to Answer Natural Language Questions on Community QA Websites Mining Semi-Structured Online Knowledge Bases to Answer Natural Language Questions on Community QA Websites ABSTRACT Parikshit Sondhi Department of Computer Science University of Illinois at Urbana-Champaign

More information

A Meta-Search Method with Clustering and Term Correlation

A Meta-Search Method with Clustering and Term Correlation A Meta-Search Method with Clustering and Term Correlation Dyce Jing Zhao Dik Lun Lee Qiong Luo Department of Computer Science Hong Kong University of Science & Technology {zhaojing,dlee,luo}@cs.ust.hk

More information