Ranking Community Answers by Modeling Question-Answer Relationships via Analogical Reasoning

Transcription

1 Ranking Community Answers by Modeling Question-Answer Relationships via Analogial Reasoning Xin-Jing Wang Mirosoft Researh Asia 4F Sigma, 49 Zhihun Road Beijing, P.R.China Xudong Tu,Dan Feng Huazhong Si.&Teh. Univ Luoyu Road, Wu Han Hu Bei, P.R.China Lei Zhang Mirosoft Researh Asia 4F Sigma, 49 Zhihun Road Beijing, P.R.China ABSTRACT The method of finding high-quality answers has signifiant impat on user satisfation in ommunity question answering systems. However, due to the lexial gap between questions and answers as well as spam typially existing in usergenerated ontent, filtering and ranking answers is very hallenging. Previous solutions mainly fous on generating redundant features, or finding textual lues using mahine learning tehniques; none of them ever onsider questions and their answers as relational data but instead model them as independent information. Moreover, they only onsider the answers of the urrent question, and ignore any previous knowledge that would be helpful to bridge the lexial and semanti gap. We assume that answers are onneted to their questions with various types of latent links, i.e. positive links indiating high-quality answers, negative links indiating inorret answers or user-generated spam, and propose an analogial reasoning-based approah whih measures the analogy between the new question-answer linkages and those of previous relevant knowledge whih ontains only positive links; the andidate answer whih has the most analogous link is assumed to be the best answer. We onduted experiments based on 29.8 million Yahoo!Answer question-answer threads and showed the effetiveness of our approah. Categories and Subjet Desriptors H.3.3 [Information Storage and Retrieval]: Information Searh and Retrieval information filtering, searh proess; G.3 [Probability and Statistis]: orrelation and regression analysis; H.3.5 [Information Storage and Retrieval]: Online Information Servies web-based servies General Terms Algorithms, Experimentation This work was performed in MSRA. Permission to make digital or hard opies of all or part of this work for personal or lassroom use is granted without fee provided that opies are not made or distributed for profit or ommerial advantage and that opies bear this notie and the full itation on the first page. To opy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speifi permission and/or a fee. SIGIR 09, July 19 23, 2009, Boston, Massahusetts, USA. Copyright 2009 ACM /09/07...$5.00. Keywords Community Question Answering, Analogial Reasoning, Probabilisti Relational Modeling, Ranking 1. ITRODUCTIO User-generated ontent (UGC) is one of the fastest-growing areas of the use of the Internet. Suh ontent inludes soial question answering, soial book marking, soial networking, soial video sharing, soial photo sharing, et. UGC web sites or portals providing these servies not only onnet users diretly to their information needs, but also hange everyday users from ontent onsumers to ontent reators. One type of UGC portals that has beome very popular in reent years is the ommunity question-answering (CQA) sites, whih have attrated a large number of users both seeking and providing answers to a variety of questions on diverse subjets. For example, by Deember, 2007, one popular CQA site, Yahoo!Answers, had attrated 120 million users worldwide, and had 400 million answers to questions available [21]. A typial harateristi of suh sites is that they allow anyone to post or answer any questions on any subjet, whih intuitively results in high variane in the quality of answers, e.g. UGC data typially has many ontent spam produed for fun or for profit. Thus, the ability, or inability, to obtain a high-quality answer has signifiant impat on user satisfation [28]. Distinguishing high-quality answers from others in CQA sites is not a trivial task, e.g. a lexial gap typially exists between a question and its high-quality answers. The lexial gap in ommunity questions and answers is aused by at least two fators: (1) textual mismath between questions and answers; and (2) user-generated spam or flippant answers. In the first ase, questions and answers are generally short, and the words that appear in a question are not neessarily repeated in its high-quality answers. Moreover, a word itself an be ambiguous or have multiple meanings, e.g., apple an either refer to apple omputer or apple the fruit. Meanwhile, the same onept an be desribed with different words, e.g. ar and automobile. In the seond ase, user-generated spam and flippant answers usually have a negative effet and greatly inrease the number of answers, thereby make it diffiult to identify the highquality ones. Figure 1 gives several examples of the lexial gap between questions and answers. To bridge the lexial gap for better answer ranking, various tehniques have been proposed. Conventional teh- 179

2 niques for filtering answers primarily fous on generating omplementary features provided by highly strutured CQA sites [1, 4, 5, 14, 22], or finding textual lues using mahinelearning tehniques [2, 3, 19, 20], or identifying user authority via graph-based link analysis whih assumes that authoritative users tend to generate high-quality answers [17]. There are two disadvantages of these work. Firstly, only the answers of the new question are taken into onsideration. Intuitively it suffers greatly from the word ambiguity sine questioners and answerers may use different words to desribe the same objets (e.g. ar and automobile ). Seondly, questions and answers are assumed as independent information resoures and their impliit orrelations are ignored. We argue that questions and answers are relational. Reall the traditional QA approahes based on natural language proessing (LP) tehniques [25] whih attempted to disover natural language properties suh as targeted answer format, targeted answer part-of-speeh, et., and used them as lues to figure out right answers. These are examples of impliit semanti lues, whih is more valuable for right answer detetion than the lexial lues suggested by terms. We address these two problems in this study. In the offline stage, a large arhive of questions and their answers are olleted as a knowledge base. We then assume that there are various types of latent linkages between questions and answers, e.g. the high-quality answers are onneted to their questions via semanti links, the spam or flippant answers are via noisy links, while low-quality or inorret answers are through inferior links, et. We denote the links assoiated with high-quality answers as positive links and the rest as negative ones and train a Bayesian logisti regression model for link predition. By this means we are able to expliitly model the impliit q-a relationships. In the online stage, given a new question, we retrieve a set of relevant q-a pairs from the knowledge base with the hope that they would over the voabulary of the new question and its answers. These q-a pairs onstrut the prior knowledge on a similar topi to help bridge the lexial gap. Moreover, to disover the semanti lues, instead of prediting diretly based on the new question and its answers, we measure the analogy of a andidate linkage to the links embedded in the prior knowledge, and the most analogous link indiates the best answer. This is what we all analogial reasoning(ar). Intuitively, taking an ideal ase as an example, if the rawled knowledge base ontains all questions in the world, to identify the best answer for a new question, we an just find the dupliate question in the knowledge base and use its best answer to evaluate a andidate answer s quality. However, sine it is impratial to obtain all available questions as new questions are being submitted every day, we an instead searh for a set of previously answered questions that best math the new question aording to some speified metris and disover analogous relationships between them. Intuitively, the retrieved similar questions and their bestanswers are more likely to have words in ommon with the orret answers to the new question. Moreover, the retrieved set provides the knowledge of how the questions similar to the new question were answered, while the way of answering an be regarded as a kind of positive linkage between a question and its best answer. We rawled 29.8 million Yahoo!Answers questions to evaluate our proposed approah. Eah question has about 15 answers on average. We used 100,000 q-a pairs to train Q: How do you pronoune the Congolese ity Kinshasa? A: Kin as in your family sha as in sharp sa as in sargent. Q: What is the minimum positive real number in Matlab? A: Your IQ. Q: How will the sun enigma affet the earth? A: Only God truly knows the mysteries of the sun, the universe and the heavens. They are His serets! Figure 1: Real examples from Yahoo!Answers, whih suggest the lexial gap between questions and answers. the link predition model and tested the entire proess with about 200,000 q-a pairs. We ompared our method with two widely adopted information retrieval metris and one stateof-the-art method, and ahieved signifiant performane improvement in terms of average preision and mean reiproal rank. These experimental results suggest that taking the struture of relational data into aount is very helpful in the noisy environment of a CQA site. Moreover, the idea of leveraging previous knowledge to bridge the lexial gap and ranking a doument with ommunity intelligene is more effetive than traditional approahes whih only rely on individual intelligene. The paper is organized as follows. In Setion 2, the most related work is disussed, whih overs the state-of-the-art approahes on ommunity-driven answer ranking. In Setion 3, we detail the proposed approah. We evaluate its performane in Setion 4, and disuss several fators whih affet the ranking performane. We onlude our paper in Setion 5 with disussions on future work. 2. RELATED WORK 2.1 Community Answer Quality Ranking Different from traditional QA systems whose goal is to automatially generate an answer for the question of interest [25, 16], the goal of ommunity answer ranking, however, is to identify from a losed list of andidate answers one or more that semantially answers the orresponding question. Sine CQA sites have rih strutures, e.g. question/answer assoiations, user-user interations, user votes, et., they offer more publily available information than traditional QA domains. Jeon et al. [14] extrated a number of non-textual features whih over the ontextual information of questions and their answers, and proposed a language modeling-based retrieval model for proessing these features in order to predit the quality of answers olleted from a speifi CQA servie. Agihtein and his olleagues [1, 23] made great efforts on finding powerful features inluding strutural, textual, and ommunity features, and proposed a general lassifiation framework for ombining these heterogeneous features. Meanwhile, in addition to identifying answer quality, they also evaluated question quality and users satisfation. Blooma et al. [5] proposed more features, textual and nontextual, and used regression analyzers to generate preditive features for the best answer identifiation. Some researhers resorted to mahine learning tehniques. Ko et al. [19] proposed a unified probabilisti answer rank- 180

3 Figure 2: Sketh of the AR-based approah: 1)given a question Q (green blok), some previous positive q-a pairs are retrieved (red dots in dotted blue irle); 2) eah andidate q-a pair is sored aording to how well it fits into the previous knowledge w.r.t. their linkages properties. The magenta blok highlights the highest-sored q-a pair andidate, whih is assumed to ontain the best answer. ing model to simultaneously address the answer relevane and answer similarity problems. The model used logisti regression to estimate the probability that a andidate answer is orret given its relevane to the supporting evidene. A disadvantage of this model is that it onsidered eah andidate answer separately. To solve this problem, the authors improved their solution and proposed a probabilisti graphial model to take into aount the orrelation of andidate answers [20]. It estimated the joint probability of the orretness of all answers, from whih the probability of orretness of an individual answer an be inferred, and the performane was improved. Bian et al. [3] presented the GBRank algorithm whih utilizes users interations to retrieve relevant high-quality ontent in soial media. It explored the mehanism to integrate relevane, user interation, and ommunity feedbak information to find the right fatual, well-formed ontent to answer a user s question. Then they improved the approah by expliitly onsidering the effet of maliious users interations, so that the ranking algorithm is more resilient to vote manipulation or shilling [2]. Instead of diretly solving the answer ranking problem, some researhers proposed to find experts in ommunities with the assumption that authoritative users tend to produe high quality ontent. For example, Jurzyk et al. [17] adapted the HITS [18] algorithm to a CQA portal. They ran this algorithm on the user-answer graph extrated from online forums and showed promising results. Zhang et al. [30] further proposed ExpertiseRank to identify users with high expertise. They found that the expertise network is highly orrelated to answer quality. 2.2 Probabilisti Relational Modeling Probabilisti Relational Models [11] are Bayesian etworks whih simultaneously onsider the onepts of objets, their properties, and relations. Getoor et al. [10] inorporated models of link predition in relational databases, and we adopt the same idea to learn the logisti regression model for link predition. 3. THE APPROACH 3.1 Proess Overview Figure 2 shows the entire proess of our approah. The red dots in the CQA Arhives stand for positive q-a pairs whose answers are good, while the blak rosses represent negative pairs whose answers are not good (not neessarily noisy). Eah q-a pair is represented by a vetor of textual and non-textual features as listed in Table 2. In the offline stage, we learn a Bayesian logisti regression model based on the rawled QA arhive, taking into aount both positive and negative q-a pairs. The task of the model is to estimate how likely a q-a pair ontains a good answer. In the online stage, a supporting set (enlosed by the dotted blue irle in Figure 2) of positive q-a pairs is first retrieved from the CQA arhives using only the new question Q (the green blok) as a query. The supporting set, along with the learnt link predition model, is used for soring and ranking eah new q-a pair, and the top-ranked q-a pair (the magenta blok) ontains the best answer. Some real q-a examples are given to better illustrate the idea. 3.2 Learning the Link Predition Model Modeling latent linkage via logisti regression We train a Bayesian logisti regression (BLR) model with finite-dimensional parameters for latent linkage predition and set multivariate Gaussian priors for the parameters. The advantage of introduing a prior is that it helps to integrate over funtion spaes. Formally, let X ij = [Φ 1(Q i, A j ), Φ 2(Q i, A j ),..., Φ K(Q i, A j )] be a K-dimensional feature vetor of the pair of question Q i and answer A j, where Φ defines the mapping Φ : Q A R K. Let C ij {0, 1} be an indiator of linkage types, where C ij = 1 indiates a positive linkage, i.e. A j is a high-quality answer of Q i, and C ij = 0 otherwise. Let Θ = [Θ 1, Θ 2,..., Θ K] be the parameter vetor of the logisti regression model to be learnt, i.e. P(C ij = 1 X ij, Θ) = exp(θ T X ij ) (1) 181

4 Figure 3: (a) The plate model of the Bayesian logisti regression model. (b) When C = 1, Q and A is positively linked whih shares information embedded in Θ, represented by the undireted edges. where X ij X train = {X ij, 1 i D q, 1 j D a} is a training q-a pair and D q, D a are the number of training questions and answers respetively. Generally we have D a D q. Figure 3(a) shows the plate model. The C node is introdued to expliitly model the fators that link a question Q and an answer A. Sine C represents a link while Q and A represent ontent, this model seamlessly reinfores the ontent and linkage in relational data. If C = 1 is observed, as shown in Figure 3(b), it means that there exists a positive link between the orresponding Q and A, and this link is analogous to links whih onstrut Θ. We use undireted edges to represent these impliations. To ahieve a more disriminative formulation, we use both positive q-a pairs and negative q-a pairs to train this model. Meanwhile, sine noisy answers generally oupy a larger population, to balane the number of positive and negative training data, we randomly sample a similar number of negative and positive points Learning the Gaussian prior P(Θ) The reason that we use a Gaussian prior is threefold: Firstly, instead of diretly using the BLR model s outputs, we evaluate how probably the latent linkage embedded in a new q-a pair belongs to the subpopulation defined by the previous knowledge, with respet to the BLR predition (we detail this approah in Setion 3.4.2). Therefore Gaussian is a good prior andidate. Seondly, Gaussian distribution has a omparatively small parameter spae (mean and variane), whih makes the model easier for ontrol. Thirdly, Gaussian distribution is onjugate to itself, i.e. the posterior is still a Gaussian, whih results in a simple solution. To ensure the preiseness of the prior, we adopt the approah suggested by Silva et al.[26]. Firstly the BLR is fit to the training data using Maximum Likelihood Estimation (MLE), whih gives an initial ˆΘ. Then the ovariane matrix ˆΣ is defined as a smoothed version of the MLE estimated ovariane: (ˆΣ) 1 = (XT ŴX) (2) where is a salar; is the size of the training data set; X = {X ij } is the K feature matrix of the training q-a pairs, either positive or negative. Ŵ is a diagonal matrix with Ŵ ii = ˆp(i) (1 ˆp(i)), where ˆp(i) is the predited probability of a positive link for the ith row of X. The prior for Θ is then the Gaussian (ˆΘ, ˆΣ). 3.3 Previous Knowledge Mining To our knowledge, none of previous CQA answer ranking approahes ever leveraged a supporting set. In fat, suh an idea is advantageous in at least two aspets: 1) bridging the lexial gap: it enlarges the voabulary, whih is more likely to over the words not appearing in a question but in its orret answers or vie versa. In fat, some traditional QA approahes have disovered that the amount of impliit knowledge whih assoiates an answer to a question an be quantitatively estimated by exploiting the redundany in a large data set [6, 24]. 2) bridging the semanti gap: its positive linkages enable the analogy reasoning approah for new link predition. Intuitively, it is more advantageous to rank a doument based on ommunity intelligene than simply on individual intelligene [7, 12]. We use information retrieval tehniques to identify suh a supporting set. However, ommunity q-a pairs retrieval again is not a trivial task. Jeon and his olleagues [15] proposed a translation-based retrieval model using the textual similarity between answers to estimate the relevane of two questions. Xue et al [29], furthermore, ombined the wordto-word translation probabilities of question-to-question retrieval and answer-to-answer retrieval. Lin et al. [22] and Bilotti et al. [4], on the other hand, adopted the traditional information retrieval method but utilized strutural features to ensure retrieval preision. We adopt Lin and Bilotti s way for q-a pair retrieval for simpliity. In partiular, let Q q be a new question and its answer list be {A j q, 1 j M}, the supporting q-a pairs S = {(Q 1 : A 1 ), (Q 2 : A 2 ),..., (Q L : A L )} ontain those whose questions osine similarities to the new question are above a threshold: S = {(Q i : A i ) os(q q, Q i ) > λ, i 1,..., D} (3) where D is the size of our rawled Yahoo!Answer arhive and λ is the threshold. Eah question is represented in the bag-of-word model. The effet of λ is shown in Figure 5. As analyzed before, whether the retrieved questions are semantially similar to the new question is not ritial. This is beause the inferene in our ase is based on the strutures of q-a pairs rather than on the ontents. Moreover, aording to the exhaustive experiments onduted by Dumais et al. [8], when the knowledge base is large enough, the auray of question answering an be promised with simple doument ranking and n-gram extration tehniques. ote that S ontains only positive q-a pairs. Therefore if the linkage of a andidate q-a pair is predited as analogous to the linkages in the subpopulation S, we an say that it is a positive linkage whih indiates a high-quality answer. 3.4 Link Predition The idea of analogial reasoning There is a large literature on analogial reasoning in artifiial intelligene and psyhology, whih ahieved great suess in many domains inluding lustering, predition, and dimensionality redution. Interested readers an refer to Frenh s survey [9]. However, few previous work ever applied analogial reasoning onto IR domain. Silva et al. [27] used it to model latent linkages among the relational objets ontained in university webpages, suh as student, ourse, department, staff, et., and obtained promising result. An analogy is defined as a measure of similarity between strutures of related objets (q-a pairs in our ase). The key aspet is that, typially, it is not so important how eah individual objet of a andidate pair is similar to individual 182

5 objets of the supporting set (i.e. the relevant previous q-a pairs in our ase). Instead, implementations that rely on the similarity between the pairs of objets will be used to predit the existene of the relationships. In other words, similarities between two questions or two answers are not as important as the similarity between two q-a pairs. Silva et al. [27] proposed a Bayesian analogial reasoning (BAR) framework, whih uses a Bayesian model to evaluate if an objet belongs to a onept or a luster, given a few items from that luster. We use an example to better illustrate the idea. Consider a soial network of users, there are several diverse reasons that a user u links to a user v: they are friends, they joined the same ommunities, or they ommented the same artiles, et. If we know the reasons, we an group the users into subpopulations. Unfortunately, these reasons are impliit; yet we are not ompletely in the dark: we are already given some subgroups of users whih are representative of a few most important subpopulation, although it is unlear what reasons are underlying. The task now beomes to identify whih other users belong to these subgroups. Instead of writing some simple query rules to explain the ommon properties of suh subgroups, BAR solves a Bayesian inferene problem to determine the probability that a partiular user pair should be a member of a given subgroup, or say they are linked in an analogous way Answer ranking via analogial reasoning The previous knowledge retrieved is just suh a subpopulation to predit the membership of a new q-a pair. Sine we keep only positive q-a pairs in this supporting set and high-quality answers generally answer a question semantially rather than lexially, it is more likely that a andidate q-a pair ontains a high-quality answer when it is analogous to this supporting q-a set. To measure suh an analogy, we adopt the soring funtion proposed by Silva et al. [27] whih measures the marginal probability that a andidate q-a pair (Q q, A j q) belongs to the subpopulation of previous knowledge S: sore(q q, A j q) = log P(C j q = 1 X j q,s,c S = 1) log P(C j = 1 X j q)) where A j q is the j-th andidate answer of the new question. X j q represents the features of (Q q, A j q). C S is the vetor of link indiators for S, and C S = 1 indiates that all pairs in S is positively linked, i.e. C 1 = 1, C 2 = 1,..., C L = 1. The idea underlying is to measure to what extent (Q q, A j q) would fit into S, or to what extent S explains (Q q, A j q). The more analogous it is to the supporting linkages, the more probability the andidate linkage is positive. Aording to the Bayes Rule, the two probabilities in Eq.(4) an be solved by Eq.(5) and Eq.(6) respetively: P(C j q = 1 X j q,s,c S = 1) = P(C j = 1 X j q, Θ)P(Θ S,C S = 1)dΘ (4) (5) P(Cq j = 1 Xq) j = P(C j = 1 Xq, j Θ)P(Θ)dΘ (6) where Θ is the parameter set. P(C j = 1 X j q, Θ) is given by the BLR model defined in Eq.(1). The solution of these two equations is given in the appendix. The entire proess is shown in Table 1. Table 1: Summary of our AR-based method I. Training Stage: Input: feature matrix X train of all the training q-a pairs Output: BLR model parameters Θ and its prior P(Θ). Algorithm: 1. Train BLR model on X train using Eq.(1). 2. Learn Prior P(Θ) using Eq.(2). II. Testing Stage: Input: a query QA thread: {Q q : A i q} M i=1 Output: sore(q q, A i q), i = 1,..., M Algorithm: 1. Generate the feature matrix X test = {X j q } with X j q the features of the j-th q-a pair (Q q : A j q); 2. Retrieve a set of positive q-a pairs S from the CQA arhive using Eq.(3); 3. Do Bayesian inferene to obtain P(Θ S, C S = 1); 4. For eah (Q q, A j q), estimate the probability of a positive linkage using Eq.(4); 5. Ranking the answers {A j q, j = 1,..., M} in desending order of the sores. The top-ranked answer is assumed as the best answer of Q q. 4. EVALUATIO We rawled 29.8 million questions from the Yahoo!Answers web site; eah question has answers on average. These questions over 1,550 leaf ategories defined by expert and all of them have user-labeled best answers, whih is a good test bed to evaluate our approah. 100,000 randomly seleted q-a pairs were used to train the link predition model, and 16,000 q-a threads were used for testing eah ontains answers on average, whih resulted in about 200,000 testing q-a pairs. A typial harateristi of CQA sites is its rih struture whih offers abundant meta-data. Previous work have shown the effetiveness of ombining strutural features with textual features [1, 14, 23]. We adopted a similar approah and defined about 20 features to represent a q-a pair, as listed in Table 2. Two metris were used for the evaluation. One is Average Preision@K: For a given query, it is the mean fration of relevant answers ranked in the top K results; the higher the preision, the better the performane is. We use the best answer tagged by the Yahoo!Answers web site as the ground truth. Sine average preision ignores the exat rank of a orret answer, we use the Mean Reiproal Rank (MRR) metri for ompensation. The MRR of an individual query is the reiproal of the rank at whih the first relevant answer is returned, or 0 if none of the top K results ontain a relevant answer. The sore for a set of queries is the mean of eah individual query s reiproal ranks: MRR = 1 1 (7) Q r r q q Q r 183

6 Table 2: The BAR Algorithm in CQA Formulation Textual Features of Questions-Answers Pairs Q.(A.) TF Question (Answer) term frequeny, stopwords removed, stemmed ovel word TF Term frequeny of non-ditionary word, e.g. ms #Common-words umber of ommon words in Q and A Statistial Features of Questions-Answers Pairs Q.(A.) raw length umber of words, stopwords not removed Q/A raw length ratio Question raw length / answer raw length Q/A length ratio Q/A length ratio, stopword removed Q/A anti-stop ratio #stopword-inquestion/#stopword-in-answer Common n-gram len. Length of ommon n-grams in Q and A #Answers Answer position umber of answers to a question The position of an answer in the q-a thread User interation / Soial Elements Features #Interesting mark umber of votes mark a question as interesting Good mark by users umber of votes mark an answer as good Bad mark by users umber of votes mark an answer as bad Rating by asker The asker-assigned sore to an answer Thread life yle The time span between Q and its latest A where Q r is the set of test queries; r q is the rank of the first relevant answer for the question q. Three baseline methods were used: the earest eighbor Measure (), the Cosine Distane Metri (COS), and the Bayesian Set Metri (BSets)[12]. The first two diretly measure the similarity between a question and an answer, without using a supporting set; neither do they treat questions and answers as relational data but instead as independent information resoures. The BSets model uses the soring funtion Eq.(8): sore(x) = log p(x S) log p(x) (8) where x represents a new q-a pair and S is the supporting set. It measures how probably x belongs to S. The differene of BSets to our AR-based method is that the former ignores the q-a orrelations, but instead join the features of a question and an answer into a single row vetor. 4.1 Performane Average Preision@K Figure 4 illustrates the performane of our method and the baselines with the Average Preion@K metri when K = 1, 5, 10. Sine eah question has less than 15 answers on average, the average preision at K > 10 is not evaluated. AP@K Our BSets COS K=1 K=5 K=10 Figure 4: Average Preision@K. Our method signifiantly outperformed the baselines. Table 3: MRR for Our Method and Baselines Method MRR Method MRR 0.56 Cosine 0.59 BSets [12] 0.67 Our Method 0.78 The red bar shows the performane of our approah. The blue, yellow, and purple bars are of the BSets, COS and methods respetively. In all ases, our method signifiantly out-performed the baselines. The gap between our method and BSets shows the positive effet of modeling the relationships between questions and answers, while the gap between BSets and, COS shows the power of ommunity intelligene. The superiority of our model to BSets shows that modeling ontent as well as data strutures improves the performane than modeling only ontent; this is beause in CQA sites, questions are very diverse and the retrieved previous knowledge are not neessarily semantially similar to the query question (i.e. they are still noisy). Moreover, from the experimental result that method performed the worst, we an tell that the lexial gap between questions and answers, questions and questions, and q-a pairs annot be ignored in the Yahoo!Answers arhive Mean Reiproal Rank (MRR) Table 3 gives the MRR performane of our method and the baselines. The trend oinides with the one that is suggested by the Average Preision@K measure. Again our method signifiantly out-performed the baseline methods, whih means that generally the best answers rank higher than other answers in our method. 4.2 Effet of Parameters Figure 5 evaluates how our method performs with the two parameters:, the salar in Eq.(2), and λ, the threshold in Eq.(3). MRR metri is used here. Figure 5(a) shows the joint effet of these two parameters on the MRR performane. The best performane was ahieved when λ = 0.8, = 0.6. To better evaluate the individual effet of the parameters, we illustrate the urve of MRR vs. λ and MRR vs. in Figure 5(b) and Figure 5() respetively by fixing the other to its optimal value. As desribed in Setion 3.3, the threshold λ ontrols the relevane of the supporting q-a pair set. Intuitively, if λ is set too high, few q-a pairs will be retrieved. This auses too small a subpopulation of the supporting q-a pairs suh that the linkage information beomes too sparse and thus is inadequate to predit the analogy of a andidate. On the 184

7 MRR 0.4 MRR delta λ / (a) The effet of and λ on the performane with MRR metri (b) The trend of performane on MRR vs. λ. = 0.6. () The trend of performane on MRR vs.. λ = 0.8. Figure 5: The effet of the prior salar in Eq.(2) and the similarity threshold λ in Eq.(3). The Mean Reiproal Rank metri was used. The best performane was ahieved at λ = 0.8, = 0.6. other hand, if λ is set too low, likely too many q-a pairs will be retrieved whih will introdue too muh noise into the subpopulation. Therefore the semanti meaning of the supporting links is obsured. Figure 5(b) exatly reflets suh a ommon sense. The best performane was ahieved when λ was about 0.8. After that the performane dropped quikly. However, before λ was inreased to 0.8, the performane improved slowly and smoothly. This indiates that the noisy q-a pairs orresponding to diverse link semantis were removed gradually and the subpopulation was getting more and more foused, thus strengthened the power of analogial reasoning. is used to sale the ovariane matrix of the prior P(Θ). As shown in Figure 5(), it is a sensitive parameter and is stable in a omparatively narrow range (i.e. about ). This, intuitively, oinides with the property of analogial reasoning approahes [9, 27, 26], where a datadependent prior is quite important for the performane. In our approah, we used the same prior for the entire testing dataset whih ontains q-a pairs from diverse ategories. A better performane an be foreseen if we learn different priors for different ategories. 5. COCLUSIO A typial harateristi of ommunity question answering sites is the high variane of the quality of answers, while a mehanism to automatially detet a high-quality answer has a signifiant impat on users satisfation with suh sites. The typial hallenge, however, lies in the lexial gap aused by textual mismath between questions and answers as well as user-generated spam. Previous work mainly fouses on deteting powerful features, or finding textual lues using mahine learning tehniques, but none of them ever took into aount previous knowledge and the relationships between questions and their answers. Contrarily, we treated questions and their answers as relational data and proposed an analogial reasoning-based method to identify orret answers. We assume that there are various types of linkages whih attah answers to their questions, and used a Bayesian logisti regression model for link predition. Moreover, in order to bridge the lexial gap, we leverage a supporting q-a set whose questions are relevant to the new question and whih ontain only high-quality answers. This supporting set, together with the logisti regression model, is used to evaluate 1) how probably a new q-a pair has the same type of linkages as those in the supporting set, and 2) how strong it is. The andidate answer that has the strongest link to the new question is assumed as the best answer that semantially answers the question. The evaluation on 29.8 million Yahoo!Answers q-a threads showed that our method signifiantly out-performed the baselines both in average preision and in mean reiproal rank, whih suggests that in the noisy environment of a CQA site, leveraging ommunity intelligene as well as taking the struture of relational data into aount are benefiial. The urrent model only uses ontent to reinfore strutures of relational data. In the future work, we would like to investigate how latent linkages reinfore ontent, and vie versa, with the hope of improved performane. 6. ACKOWLEDGEMET We thank Chin-Yew Lin and Xinying Song s help on the q-a data. 7. REFERECES [1] E. Agihtein, C. Castillo, and et. Finding high-quality ontent in soial media. In Pro. of WSDM, [2] J. Bian, Y. Liu, and et. A few bad votes too many? towards robust ranking in soial media. In Pro. of AIRWeb, [3] J. Bian, Y. Liu, and et. Finding the right fats in the rowd: Fatoid question answering over soial media. In Pro. of WWW, [4] M. Bilotti, P. Ogilvie, and et. Strutured retrieval for question answering. In Pro. of SIGIR, [5] M. Blooma, A. Chua, and D. Goh. A preditive framework for retrieving the best answer. In Pro. of SAC, [6] E. Brill, J. Lin, M. Banko, and et. Data-intensive question answeringa. In TREC, [7] J. Chu-Carroll, K. Czuba, and et. In question answering, two heads are better than one. In Pro. of HLT/AACL, [8] S. Dumais, M. Banko, and et. Web question answering: Is more always better? In Pro. of SIGIR,

8 [9] R. Frenh. The omputational modeling of analogy-marking. Trends in ognitive Sienes, 6, [10] L. Getoor,. Friedman, and et. Learning probabilisti models of link struture. JMLR, 3: , [11] L. Getoor,. Friedman, and et. Probabilisti relational models. Introdution to Statistial Relational Learning, [12] Z. Ghahramani and K. Heller. Bayesian sets. In Pro. of IPS, [13] T. Jaakkola and M. Jordan. Bayesian parameter estimation via variational methods. Statistis and Computing, 10:25 37, [14] J. Jeon, W. Croft, and et al.. a framework to predit the quality of answers with non-textual features. In Pro. of SIGIR, [15] J. Jeon, W. Croft, and et. Finding similar questions in large question and answer arhives. In Pro. of CIKM, [16] V. Jijkoun and M. Rijke. Retrieving answers from frequently asked questions pages on the web. In Pro. of CIKM, pages 76 83, [17] P. Jurzyk and E. Agihtein. Disovering authorities in question answer ommunities by using link analysis. In Pro. of CIKM, [18] J. Kleinberg. Authoritative soures in a hyperlinked environment. Journal of the ACM, 46(5): , [19] J. Ko, L. Si, and E. yberg. A probabilisti framework for answer seletion in question answering. In Pro. of AACL/HLT, [20] J. Ko, L. Si, and E. yberg. A probabilisti graphial model for joint answer ranking in question answering. In Pro. of SIGIR, [21] J. Leibenluft. Librarianaŕs worst nightmare: Yahoo!answers, where 120 million users an be wrong. Slate Magazine, [22] J. Lin and B. Katz. Question answering from the web using knowledge annotation and knowledge mining tehniques. In Pro. of CIKM, [23] Y. Liu, J. Bian, and E. Agihtein. Prediting information seeker satisfation in ommunity question answering. In Pro. of SIGIR, pages , [24] B. Magnini and M. e. a. egri. Is it the right answer? exploiting web redundany for answer validation. In Pro. of ACL, pages , [25] D. Molla and J. Viedo. Question answering in restrited domains: An overview. In Pro. of ACL, [26] R. Silva, E. Airoldi, and K. Heller. Small sets of interating proteins suggest latent linkage mehanisms through analogial reasoning. In Gatsby Tehnial Report, GCU TR , [27] R. Silva, K. Heller, and Z. Ghahramani. Analogial reasoning with relational bayesian sets. In Pro. of AISTATS, [28] Q. Su, D. Pavlov, and et. Internet-sale olletion of human-reviewed data. In Pro. of WWW, pages , [29] X. Xue, J. Jeon, and W. Croft. Retrieval models for question and answer arhives. In Pro. of SIGIR, pages , [30] J. Zhang, M. Akerman, and L. Adami. Expertise networks in online ommunities: Struture and algorithms. In Pro. of WWW, pages , APPEDIX Jaakkola et al. [13] suggested a variational approximation solution to the BLR model. Let g(ξ) be the logisti funtion, g(ξ) = (1 + e ξ ) 1, and onsider the ase for the single data point evaluation, Eq.(6), the method lower-bounds the integrand as follows: P(C X, Θ) = g(θ T X) g(ξ) exp{ HC ξ 2 λ(ξ)(h 2 C ξ 2 )} where H C = (2C 1)Θ T X and λ(ξ) = tanh( ξ 2 ). tanh( ) is 4ξ the hyperboli tangent funtion. Thus P(Θ X, C) an be approximated by normalizing P(C X, Θ)P(Θ) Q(C X, Θ)P(Θ) where Q(C X, Θ) is the right-hand side of Eq.(9). Sine this bound assumes a quadrati form as a funtion of Θ and our priors are Gaussian, the approximate posterior will be Gaussian, whih we denote by (µ pos, Σ pos). However, this bound an be loose unless a suitable value for the free parameter ξ is hosen. The key step in the approximation is then to optimize the bound with respet to ξ. Let the Gaussian prior P(Θ) be denoted as (µ, Σ). The proedure redues to an iterative optimization algorithm where for eah step the following updates are made: Σ 1 pos = Σ 1 + 2λ(ξ)XX T µ pos = Σ 1 pos[σ 1 µ + (C 1 2 )X] ξ = (X T Σ posx + (X T µ pos) 2 ) 1 2 To approximate Eq.(5), a sequential update is performed: starting from the prior P(Θ) for the first data point (X, C) in (S,C S = 1), the resulting posterior (µ pos, Σ pos) is treated as the new prior for the next point. The ordering is hosen from an uniform distribution in our implementation. Finally, given the optimized approximate posterior, the preditive integral Eq.(5) an be approximated as: log(q(c ij X ij,s,c S )) = log g(ξ ij) ξij 2 + λ(ξij)ξ2 ij 1 2 µt SΣ 1 S µ T S µt ijσ 1 ij µ T ij log Σ 1 ij Σ 1 S where parameters ( S, Σ S ) are the ones in the approximate posterior Θ (S,C S ) (µ S, Σ S ), and (µ ij, Σ ij) ome from the approximate posterior Θ (S,C S, X ij, C ij ). Parameters (ξ ij, ξ S) ome from the respetive approximate posteriors. And Eq.(6) an be approximated as: log Q(C ij X ij ) = log g(ξ ij) xiij 2 + λ(ξij)ξ2 ij (9) 186