Subordinating to the Majority: Factoid Question Answering over CQA Sites

Transcription

1 Journal of Computational Information Systems 9: 16 (2013) Available at Subordinating to the Majority: Factoid Question Answering over CQA Sites Xin LIAN, Xiaojie YUAN, Haiwei ZHANG Institute of Computer Science and Technology, Nankai University, Tianjin , China Abstract Question Answering communities such as Yahoo! Answers have emerged as a popular media for online information seeking and knowledge sharing. When an asker doesn t choose the best answer, the best answer may be chosen by the voters. Unfortunately, the quality of the submitted questions and answers vary widely increasingly so that a large fraction of the content is not usable for answer queries. There re more and more researches on best answer selection. However, they require large amounts of training data or manually labeled data, which limits the applicability of the supervised approaches to new sites and domains. In this paper we address this problem by the similarity between answers. The similarity between any two answers is evaluated by VSM( vector space model). We regard the similarityas their effect for each other, and the effect is transmitted by iteration. The iteration stops when the computation reaches a stable state. Finally, the rank of answers depends on the iteration result and votes of others. The experimental results show that our approach leads to a better performance than other baseline approaches. Keywords: Community Question Answering; Best Answer Selection; Factoid Question; Answers Similarity 1 Introduction Community Question Answering (CQA) has become a popular media for online information seeking and knowledge sharing [1]. In the last few years, many CQA systems have been launched, including Yahoo! Answers, BuyAns, Live QnA. CQA sites make their content-questions and associated answers submitted on the site. Rather than browsing results of search engines, users present detailed information needs and get direct responses authored by humans. Su et al. [2] analyzed the quality of answers in QA portals and found that the quality of each answers vary significantly. In addition, the ability, or inability, to obtain a high-quality answer has significant impact on user satisfaction. Many previous approaches can be classified into the three categories. 1) Probabilistic approaches [3, 4, 5, 6]: They make researches on the content of CQA sites, which includes analysis of the Project supported by the National Nature Science Foundation of China (No ). Corresponding author. address: forwarding82@gmail.com (Xiaojie YUAN) / Copyright 2013 Binary Information Press DOI: /jcis7716 August 15, 2013

2 6410 X. Lian et al. /Journal of Computational Information Systems 9: 16 (2013) content and the quality of the questions and answers. Some methods analyze the reputation systems and social norms on these sites. 2) Link-based approaches [7, 8]: The methods assume that good users give good answers. Therefore, their target is to find expert for a given question by user networks. They apply link-analysis algorithms PageRank [14], HITS [15] to identify users with high expertise. 3) Learning-based approaches [9, 10, 11]: The methods extract various features from questions, answers, and the users who posted them, and training a number of classifiers to select the best answer using those features. In a word, existing methods either require large amounts of supervision or only focus on the network properties of the CQA. Some methods consider the content similarity between questions and answers, without the content similarity between the answers. In this paper we present a ranking framework to take advantage of the similarity between the answers to retrieve high quality answer for factoid question. For factoid question, the best answer is generally definite. The majority of people give similar answers. Our goal is to find the most supported answer by the similarity and votes of others. We construct similarity matrix by computing the similarity between any two answers, which is evaluated by VSM( vector space model). The score of an answer is just the expected score of answers it s similar to. The effect factor between the two answers is described by the similarity, and it s transmitted by iteration. The iteration stops when the computation reaches a stable state. Then the scores of answers are modified with the vote information. The experimental results show our approach leads to a better performance than other baseline approaches. To our knowledge, this is the first method of no training data and manually labeled data, which is fit for large amount of question-answers in CQA sites. The rest of this paper is organized as follows. Section 2 reviews some prior work related to our approach. Section 3 details the proposed method including algorithms. Section 4 reports on the performance study. At last, we conclude the paper in Section 5. 2 Related Work Probabilistic approaches focus on the content of CQA sites. Bian et al. [4] utilized users interactions to retrieve relevant high-quality content in social media. It explored the algorithm to integrate relevance, user interaction, and community feedback information to find the right factual, well-formed content to answer a user s question. Wang et al. [5] assumed that answers were connected to their questions with various types of latent links, and proposed an analogical reasoning-based approach which measured the analogy between the new question-answer linkages and those of previous relevant knowledge which contained only positive links; the candidate answer which had the most analogous link was assumed to be the best answer. Linked-based methods have been shown to be successful for several tasks in social media. Their target is to discover users authorities by user networks, which is also called expert finding. Jurczyk et al. [7] and Zhang et al. [8] evaluated link algorithms PageRank and HITS to rank users based on their authority scores. The difference is that Zhang et al. is applied to a small data set. Some researchers resorted to machine learning techniques. Jeon et al. [9] extracted a number of non-textual features which cover the contextual information of questions and their answers, and proposed a language modeling-based retrieval model for processing these features in order

3 X. Lian et al. /Journal of Computational Information Systems 9: 16 (2013) to predict the quality of answers collected from a specific CQA service. Agichtern et al. [12] introduced a general classification framework for combining the evidence from different sources of information, that can be tuned automatically for a given social media type and quality definition. Blooma et al. [13] proposed more features, textual and non-textual, and used regression analyzers to generate predictive features for the best answer identification. Shah and Pomerantz [11] measured the quality of answers in CQA sites by extracting various features and training a number of classifiers to select the best answer using those features. The definition of quality is, akin to popularity. Bian et al. [10] developed a semi-supervised coupled mutual reinforcement framework for simultaneously calculating content quality and user reputation, that requires relatively few labeled examples to initialize the training process. Closest to our work, Ko et al. [3] focused on developing a unified framework that not only used multiple resources for validating answer candidates, but also considered evidence of similarity among answer candidates in order to boost the ranking of the correct answer. In their another paper [16], they applied a probabilistic graphical model for answer ranking in question answering. This model estimated the joint probability of correctness of all answer candidates, from which the probability of correctness of an individual candidate can be inferred. The joint prediction model can estimate both the correctness of individual answers as well as their correlations, which enables a list of accurate and comprehensive answers. However, the two methods need training data. 3 Prediction Model In this section we will describe how to find the best answer of factoid question over CQA sites. We start with a more precise definition of the problem of best answer retrieval. 3.1 Problem definition In QA systems, there are a very large amount of questions and answers posted by a diverse community of users. One posted question can attract several answers from a number of different users. For factoid question, the best answer is generally definite. There are some similarity between these answers. Our goal is to find the most supported answer by the similarity. The most supported answer is regarded as the best answer. Definition 1 (Score of answers) The score of an answer A i (denoted by score(a i )) in a answer set A is the probability of A i being the best answer. We abstract the social content in QA system as a set of question-answers triples: < q, A, V > where q is one of factoid questions in the whole archive of the QA system, A is the answer set to this question. V is the vote set corresponding with the answers. Each answer have a positive vote (thumbs up) and a negative vote (thumbs down). For A i, the vote information of the ith answer to question q is: V i =< upnum, downnum >

4 6412 X. Lian et al. /Journal of Computational Information Systems 9: 16 (2013) We will first discuss how to describe the similarity between the answers with the text content and then discuss a support matrix for iterative computation. Finally, we discuss how to integrate vote information for best answer retrieval. 3.2 Support matrix For question q, the similarity between answer A i and answer A j is described by text similarity sim(a i, A j ). sim(a i, A j ) is based on VSM (vector space model). We abstract nouns as index item. For answer set A, we compute the similarity sim(a i, A j ) between any two answers(a i,a j,i j). sim(a i, A j ) = sim(a j, A i ), sim(a i, A i ) = 0, so the time complexity is n(n 1), n is the number of 2 answers in the A i. The similarity matrix is defined as follows: M sim = s 11 s s 1n s 21 s s 2n s n1 s n2... s nn where s ij = sim(a i, A j ) (1) For A i and A j, A i may only be similar to A j while A j is similar to many answers. The support degree between is different. Therefore, we normalize the similarity matrix to describe the support degree from other answers, which is called as support matrix M sup : t 11 t t 1n M sup = t 21 t t 2n where t ij = t n1 t n2... t nn s ij n i=1 s, s ij M sim (2) ij M sim focuses on the relationship between two answers. M sup considers the effect of other answers. In the M sup, t ij t ji. 3.3 Iterative computation An answer is ranked higher as there are more answers that are similar to it. An answer that is supported by many answers with high scores receives a high rank itself. If no answer is similar to an answer, there is no support for that answer. As in Authority-hub analysis and PageRank, BestF inder adopts an iterative method to compute the scores of answers. Initially, it has very little information about the answers. At each iteration BestF inder updates the scores of answers. Finally, it stops when the computation reaches a stable state. The score of an answer is just the expected score of answers it s similar to. For answer A i, we compute its score score(a i ) by calculating the average score of answers that has support degree to A i. m score(a j ) M sup [j, i] score(a i ) = j=1 m, M sup [j, i] > 0, j i (3)

5 X. Lian et al. /Journal of Computational Information Systems 9: 16 (2013) m is the number of the answers that are similar to A i. We choose the initial state in which all answers have a uniform score s 0. (s 0 is set to 1/n, n is the number of answers.) In each iteration, BestF inder improves the score of high-quality answer while reduces the score of lowquality answer. It stops iterating when it reaches a stable state. The stableness is measured by the change of the scores of all answers. If it changes a little after an iteration, then BestF inder will stop. Algorithm 1: Iterative computation function Input: support matrix M sup [n][n], Answers old score array oldscore[n] Output: Answers new score array newscore[n] for i = 0; i < n; i + + do newscore[n] = 0 ; nonzeronum = 0 ; for j = 0; j < n; j + + do /* count the number of answers which are similar to answer A i */ if M sup [j][i] > 0 then newscore[i] = newscore[i] + oldscore[i] M sup [j][i] ; nonzeronum = nonzeronum + 1 ; end end /* for the expected score */ if nonzeronum > 0 then newscore[i] = newscore[i] ; nonzeronum end end 3.4 Answers score In the Yahoo! Answers, after reading existing answers for a question, a user can give his or her judgment as the evaluation for the answers. If he or she considers the answer as useful, he or she can add a plus vote to this answer. Otherwise, a minus votes may be added to the answer. If the asker doesn t choose the best answer after some fixed period of time, the best answer may be chosen by the voters. Therefore, vote information from others is an important factor for answer selection. The answer s score should integrate the answer similarity and vote information. We introduce two effect factors α and β to describe the effect degree of answers similarity and others votes. Then we can define the answer s score as follows: score(a i ) = α score(a i) n + β score(a i ) i=1 V up i V up i Vi down + 2, α + β = 1 (4) In the experiment, we set α = 0.6, β = 0.4. Some answers have no votes. In this case, the support of the answers is 0.5. That is, the probability of being supported is the same as the probability of being opposed to. Therefore, we add a constant 2 to the denominator, and add a constant 1 to the numerator. For an answer set A, if there s no similarity between any two answers, the scores of answers are decided by the portion of votes information. Finally, the answer with the maximal score is the best answer.

6 6414 X. Lian et al. /Journal of Computational Information Systems 9: 16 (2013) Experimental Evaluation We now describe the measures used for the evaluation, the dataset and the experimental results. For evaluation, we consider an answer to be a high quality answer, if the asker chose it as the best answer, and gave it a rating of at least Datasets We use the same datasets used in [10]. They used the TREC QA benchmarks to crawl QA archives and related user information. This was done by submitting TREC QA queries into the CQA site and retrieving the returned questions, answers and related users. The factoid questions are from seven years of the TREC QA track evaluations (years ). They submitted each TREC query to the Yahoo! Answers web service and retrieve up to 10 top-ranked related questions according to the Yahoo! Answers ranking. The detail of data collection can be found in the paper [10]. There are, in total, users, questions and answers. Note that, although the proportion of factoid questions in Yahoo! Answers may not be large, we use them in order to have objective metric of correctness, and extrapolate performance to whole QA archives. 4.2 Evaluation metrics We consider an answer to be a high quality answer, if the asker chose it as the best answer, and gave it a rating of at least 3. Therefore, there s only one correct answer for a question. Two metrics were used for the evaluation. One is Accuracy: for a given question, Accuracy reports the fraction of answers ranked in the first that was chosen as the best answer. We used the best answer tagged by the Yahoo! Answers web site as the ground truth. Since Accuracy ignores the exact rank of a correct answer, we used Mean Reciprocal Rank (MRR) metric for compensation. The MRR of each individual query is the reciprocal of the rank at which the first relevant answer was returned, or 0 if none of the top N results contained a relevant answer. The score for a sequence of queries is the mean of the individual query s reciprocal ranks. Thus, MRR is calculated as: 4.3 Methods compared MRR = 1 Q r q Q r 1 r q To our knowledge, this is the first method of no training data and manually labeled. To evaluate the Q&A quality, we compare the quality of the baseline methods: Baseline BestRatio: Answers are ranked by the best answer ratio of answerers. The best ratio is the ratio of the answerer s answers being regarded as the best answer. It indicates an answerer s authority. Baseline Votes: Answers are ranked by the score computed as the difference of thumbs-up votes and thumbs-down votes received for each answer. This ranking closely approximates

7 X. Lian et al. /Journal of Computational Information Systems 9: 16 (2013) the ranking obtained when a user clicks order by votes option on the Yahoo! Answers site. The detail of this method and how to compute MRR under this setting is discussed in [4]. 4.4 Experimental results Figure 1 illustrates the performance of our method and the baselines with varying number of candidate answers. BestF inder significantly outperformed the baselines with less than 5 candidate answers. The Baseline V otes is stable with more than 8 candidate answers. BestF inder is not very effective with more candidate answers. That s because of the complexity of CQA sites. There may be some correct answers, but the system requires only one best answer. The choice of best answer depends on the asker. He/she may integrate some subjective factors. In contrary to traditional QA, answerers give more detailed description, which decrease the weight of keywords. In fact, most questions have less than 5 answers. Therefore, BestF inder is more effective as a whole. Fig. 1: MPP and Accuracy of BestF inder and baselines for varying number of candidate answers Figure 2 shows the changes of answers scores after each iteration, which is defined as Euclidean distance of the old and new scores. We can see BestFinder@number of answers converges in a steady speed. Therefore, BestF inder doesn t require too much iteration to reach a stable state. Fig. 2: Changes of answers scores after each iteration 5 Conclusions We presented a framework for non-supervised best answer selection of factoid questions in Community Question Answering. We regard the similarity between any two answers as their effect for

8 6416 X. Lian et al. /Journal of Computational Information Systems 9: 16 (2013) each other, and the effect is transmitted by iteration. The iteration stops when the computation reaches a stable state. Finally, the rank of answers depends on the iteration result and votes of others. We have demonstrated the effectiveness of BestF inder in large-scale experiments of a CQA dataset comprising over 100,000 users, 27,000 questions and 200,000 answers. In contrary to supervised method, BestF inder doesn t require training data and manually labeled data. In addition, our experiments demonstrate significant improvements over the baselines especially for the less answers. References [1] L. A. Adamic, J. Zhang, E. Bakshy and M. S. Ackerman. Knowledge sharing and yahoo answers: Everyone knows something. In Proc of WWW, 2008, pp [2] Q. Su, D. Pavlov, J. Chow and W. Baker. Internet-scale collection of human-reviewed data. In Proc of WWW, 2007, pp [3] J. Ko, L. Si and E. Nyberg. A probabilistic framework for answer selection in question answering. In Proc of NAACL HLT, 2007, pp [4] J. Bain, Y. Liu, E. Agichtein and H. Zha. Finding the right facts in the crowd: Factoid question answering over social media. In: Proc. of WWW, 2008, pp [5] X. Wang, X. Tu, D. Feng and L. Zhang. Ranking community answers by modeling question-answer relationships via analogical reasoning. In: Proc. of SIGIR, 2009, pp [6] J. Liu, S. Wang, Y. Peng, X. Huang and W. Wang. Answer Extraction of Chinese Restricted Domain Question Answering System Based on Ontology. Journal of Computational Information Systems 2010, 6(1), [7] P. Jurczyk and E. Agichterin. Discovering authorities in question answer communities by using link analysis. In: Proc. of ACM CIKM, 2007, pp [8] J. Zhang, M. S. Ackerman and L. Adamic. Expertise networks in online communities: structure and algorithms. In: Proc. of WWW, 2007, pp [9] J. Jeon, W. Croft, J. Lee and S. Park. A framework to predict the quality of answers with nontextual features. In Proc of SIGIR HLT, 2006, pp [10] J. Bain, Y. Liu, D. Zhou, E. Agichtein and H. Zha. Learning to recognize reliable users and content in social media with coupled mutual reinforcement. In: Proc. of WWW, 2009, pp (2009). [11] C. Shah and J. Pomerantz. Evaluating and predicting answer quality in community QA. In: Proc. of SIGIR, 2010, pp [12] E. Agichtein, C. Castillo, D. Donato, A. Gionis and G. Mishne. Finding high-quality content in social media with an application to community-based question answering. In Proc of WSDM, 2008, pp [13] M. Blooma, A. Chua and D. Goh. A predictive framework for retrieving the best answer. In: Proc. of SAC, 2008, pp [14] L. Page, S. Brin, R. Motwani and T. Winograd. The pagerank citation ranking: Bringing order to the web. In: Technical report, Stanford Digital Library Technologies Project, [15] J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 1999, 46(5), [16] J. Ko, L. Si and E. Nyberg. A probabilistic graphical model for joint answer ranking in question answering. In: Proc. of SIGIR, 2007, pp