Topical Authority Identification in Community Question Answering

Topical Authority Identification in Community Question Answering Guangyou Zhou, Kang Liu, and Jun Zhao National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences 95 Zhongguancun East Road, Beijing 100190, China {gyzhou,kliu,jzhao}@nlpr.ia.ac.cn Abstract. In this paper, we address the problem of authority identification in community question answering (CQA). Most of the existing approaches attempt to identify authorities in CQA by means of link analysis techniques. However, these traditional techniques only consider the link structure while ignore the topic information about the users, giving rise to an increasing problem of topic drift. Tosolvetheproblem of topic drift, we propose a topical ranking method, which is an extension of PageRank algorithm to identify authorities in CQA. Compared to the traditional link analysis techniques, our proposed method is more effective because it measures the authority scores by taking into account both the link structure and the topic information. We conduct experiments on real world data set from Yahoo! Answers. Experimental results show that our proposed method significantly outperforms the traditional link analysis techniques and achieves the state-of-the-art performance for authority identification in CQA. Keywords: authority identification; PageRank; community question answering. 1 Introduction Community question answering (CQA) is a particular form of online service for leveraging user-generated content, which has gained increasing popularity in recent years. These online services, such as Yahoo! Answers 1 and Live QnA 2, provide a platform for users to ask and answer questions. Unfortunately, the quality of answers has high variance: ranging from very high to low quality, sometimes abusive content or even spam [1]. Therefore, it is desirable to automatically identify authorities in CQA, so as to route the newly posted questions to the appropriate authorities, who can provide good quality answers to these questions [2,3,4]. Finally, the overall quality of answers can be substantially improved. 1 http://answers.yahoo.com/ 2 http://qna.live.com/ C.-L. Liu, C. Zhang, and L. Wang (Eds.): CCPR 2012, CCIS 321, pp. 622 629, 2012. c Springer-Verlag Berlin Heidelberg 2012

Topical Authority Identification in Community Question Answering 623 Authority Identification in CQA is the task of identifying users who can provide a large number of high quality, complete, and reliable answers [5], which has recently gained a wide interest in NLP and IR communities [2,3,6,7]. These existing approaches measure the authority scores by means of link analysis techniques such as PageRank [8] and HITS [9], or their variants. However, the traditional link analysis techniques only consider the link structure while ignore the topic information about the users, giving rise to an increasing problem of topic drift. To tackle the problem of topic drift, this paper proposes a topical ranking method for authority identification in CQA. Given a set of users, we first automatically distill the topics that users are interested in by analyzing the content of their answered questions and the corresponding answers. Based on the topics distilled, topic-specific question-answer relationships between askers and answerers are constructed. Finally, we measure the authority scores by taking into account both the link structure and the topic information about users. To the best of our knowledge, this is the first extensive and empirical study of identifying authoritative users in CQA by taking into account both the link structure and the topic information about users. To date, little work has been made regarding topic information about users in studies of authority identification in CQA, which remains an under-explored research area. This paper is thus designed to fill the gap. Specially, we make the following contributions: We automatically distill the topics that users are interested in by analyzing the content of their answered questions and the corresponding answers (in Section 2.1). We propose a topical ranking method by taking into account both the link structure and the topic information about users (in Section 2.2). Finally, we conduct experiments on CQA data set. The results show that our proposed method significantly outperforms the traditional link analysis techniques (in Section 3). The rest of this paper is organized as follows. Section 2 presents our proposed method. Section 3 presents the experimental results. Finally, we conclude with ideas for future work in Section 4. 2 Topical Authority Identification 2.1 Topic Distillation Topic distillation aims to automatically identify the topics that users (askers and answerers) are interested in based on the user profiles. 3 In this paper, we use the widely studied topic model Latent Dirichlet Allocation (LDA) [10] to identify the latent topic information from the large scale question-answer collection. 3 Here, the user profiles refer to the questions answered by the users and the corresponding answers.

624 G. Zhou, K. Liu, and J. Zhao LDA is a bayesian probabilistic graphical model, which models each document as a mixture of underlying topics and generates each word from one topic. The generation process of a document is described in Table 1. A document d is associated with a multinomial distribution over K topics, which is denoted as θ d.for each word w di in document d: (1) a topic z di is first sampled from the multinomial distribution θ d, which is generated from the Dirichlet prior parameterized by α; (2) then each word w di is generated from multinomial distribution φ zdi, which is generated from the Dirichlet prior parameterized by β. The two Dirichlet priors for document-topic distributions θ d and topic-word distributions φ z reduce the probability of overfitting training documents and enhance the ability of inferring topic distribution for new documents [11]. Here, we employ Gibbs sampling [12] for parameter estimation due to its faster convergence and better performance [13]. Table 1. The generation process of LDA For each topic z i {1,,K}, sample a multinomial distribution over words, φ zi Dir(β) For each document d: 1. sample a multinomial distribution over topics, θ d Dir(α) 2. For each word w di in document d: * sample a topic z di Multinomial(θ d ) * sample a word w di Multinomial(φ zdi ) To distill the topics that users are interested in using LDA, documents should naturally correspond to questions and answers. However, since the goal is to distill the topics that each user is interested in rather than the topics that each question and the corresponding answers are about, we aggregate the user profiles provided by each individual user into a big document. Thus, each document essentially corresponds to a user. The results of topic distillation are represented in two matrices: DK =[θ] D K,a D K matrix, where D is the number of users, and K is the number of topics. DK ij DK contains the number of times a word in u i s profiles (questions and the corresponding answers) has been assigned to topic z j. WK = [φ] W K,a W K matrix, where W is the number of unique words used in question-answer collection, and K is the number of topics. WK ij WK denotes the number of times unique word w i has been assigned to the specific topic z j. In these two matrices, matrix DK contains the number of times a word in a user (e.g., u i ) profiles has been assigned to a particular topic. We can row normalize it as DK such that DK i 1 =1foreachrowDK i.. Each row of matrix DK is the probability distribution of u i s interest over the K topics, e.g., each element DK ij denotes the probability that u i is interested in topic z j ( P (z j u i )=DK ij ).

Topical Authority Identification in Community Question Answering 625 2.2 PageRank for Authority Identification Based on the topics distilled in subsection 2.1, a directed graph G =(V,E) is formed with the topic-specific question-answer relationships among users. V is a set of nodes representing users (askers and answerers). A directed edge e E where e =(u i,u j ), u i V and u j V, indicates that user u j answers the questions of user u i.eachedgee ij E is associated with an affinity weight f(i j) between u i and u j. The weight is computed as follows: f(i j) = Q(i) A(j) (1) where Q(i) is the set of questions asked by u i, A(j) is the set of questions answered by u j. Two users are connected if their affinity weight is larger than 0 and we let f(i i) = 0 to avoid self transition. 4 The transition probability from u i to u j is then defined by normalizing the corresponding affinity weight as follows: p(i j) = { f(i j) V k=1 f(i k) if f 0 0 otherwise (2) where p(i j) is usually not equal to p(j i). We use the row-normalized matrix M = [ M ij ] V V to describe G with each entry corresponding to the transition probability. M ij = p(i j) (3) In order to make the graph fulfill the property of being aperiodic and M be a stochastic matrix, the rows with all zero elements are replaced by a smoothing vector with all elements set to 1/ V. Basedonthematrix M, the saliency score R(u i )foru i can be deduced from those of all other users linked with it and it can be formulated in a recursive manner as in the PageRank algorithm. R(u i )=λ R(u j ) M ji +(1 λ) 1 V j:u j u i where λ [0, 1] is a damping factor. The damping factor indicates that each vertex has a probability of (1 λ) to perform random jump to another vertex within this graph. The saliency score are obtained by running equation (4) iteratively until convergence. (4) 2.3 Topical PageRank for Authority Identification In equation (4), the second term is set to be the same value 1/ V for all vertices within the graph, which indicates that there are equal probabilities of random jump to all vertices. However, Haveliwala [14] and Nie et al. [15] proposed a 4 In CQA, the users cannot answer their own questions.

626 G. Zhou, K. Liu, and J. Zhao topical PageRank-like algorithm (TPR) and argued that the second term in equation (4) should be set to be non-uniformed. The assumption is that if we assign larger probabilities to some vertices, the final saliency score will prefer these vertices. The idea of TPR is to run PageRank for each topic separately. Each topicspecific PageRank prefers those users with high relevance to the corresponding topic. Formally, for a specific topic z, we will assign a topic-specific preference value p(u z) toeachuseru as its random jump probability u V p(u z) =1. The users who are interested in topic z will be assigned larger probabilities when performing the PageRank. Given a topic z, the TPR-like saliency score are defined as follows: R(u i z) =λ R(u j z) M ji +(1 λ)p(u i z) (5) j:u j u i The setting of preference value p(u i z) in equation (5) will have great influence to TPR. In this paper, we set p( z) =DZ.z, wheredz.z is the zth column of matrix DZ, which is the column normalized form of matrix DZ such that DZ.z 1 = 1. A large R(u z) indicates a user u is a good candidate authority in topic z. For implementation, the initial scores of all users are set to 1 and the iteration algorithm in equation (5) is used to compute the new scores of the users. Usually the convergence of the iteration is achieved when the difference between the scores computed at two successive iterations for any users falls below a given threshold (0.00001 in this paper). After ranking the users by using the TPR or other methods, we select top K users for each topic as topical candidate authorities. 3 Experiments 3.1 Data Set Yahoo! Answers web service supplies an API to allow web users to crawl the existing question answer archives and the corresponding user information from the website [17]. We crawl the data set from Yahoo! Answers, the data set consists of 237,083 resolved questions, and 593,107 answers posted by 286,053 users. Table 2 presents the statistics on the data set. In this paper, for all resolved questions, the information of each question includes: (1) Texts of question and the associated answers, with stop words being excluded 5 and the words being stemmed. 6 (2) User IDs of all questions and answers. (3) Users rating information (e.g., thumbs up, thumbs down, the best answers and so on.). 5 http://truereader.com/manuals/onix/stopwords1.html 6 http://tartarus.org/martin/porterstemmer/

Topical Authority Identification in Community Question Answering 627 Table 2. Yahoo! Answers data set Number of questions 237,083 Number of answers 593,107 Number of best answers 162,733 Number of total users 286,053 Number of askers 180,166 Number of answerers 135,441 Number of both askers and answerers 29,554 Since there is no available benchmark for authority identification for a given topic in CQA, we manually inspect the authority identification results. For each candidate authority u for topic z, we ask two annotators to check whether u is a real authority for the given topic. In this process, the annotators are given the top topic words and user profile. Each identified authority is voted by two annotators with label Yes (the user is a real authority for the given topic) or No (the user is not a real authority for the given topic). If a conflict happens, a third person will make judgement for the final result. The Cohen s Kappa coefficients of the Z topics range from 0.51 to 0.77, showing fair to good agreement. 3.2 Evaluation Metrics To evaluate the performance of authority identification, we use the three widely studied metrics in information retrieval. Mean Average Precision (MAP): This metric is the mean of the average precision scores for each topic. Mean Reciprocal Rank (MRR): This metric is the multiplicative inverse of the rank of the first retrieved authority for each topic. Average Precision@n (Avg. P@n): This metric denotes the average ratio of the relevant authorities in top n identified authorities for each topic. 3.3 Parameter Setting We have several parameters: i.e., Dirichlet hyper-parameters α, β, topicnumber Z, damping factor parameter λ used in PageRank. In this paper, we set Dirichlet priors α =50/Z,andβ =0.05 as Griffiths and Steyvers [12]. We run LDA with 200 iterations of Gibbs sampling. After trying a few different numbers of topics, we empirically set Z = 15. We choose these parameter settings because they give coherent and meaningful topics for our data set. For parameter λ, we conduct an experiment on a small development set to determine the best value among 0.1, 0.2,,0.9 in terms of MAP. This set is also extracted from Yahoo! Answers, and it is not included in the evaluation set. We find that λ =0.2 is the optimal parameter for PR, and TPR.

628 G. Zhou, K. Liu, and J. Zhao Table 3. Comparison of authority identification for different methods # Methods MAP MRR Avg. P@10 1 PR 0.435 0.726 0.331 2 HITS 0.397 0.704 0.266 3 InD 0.369 0.655 0.242 4 ER 0.481 0.773 0.385 5 TPR 0.506 0.781 0.388 3.4 Experimental Results Comparison with different methods To demonstrate the effectiveness of our proposed TPR method, comparisons against some previous work are also included: PageRank (PR): This method finds the authorities with only link structure taken into account [8]. HITS:Jurczyk and Agichtein [3] proposed to find authorities in CQA and estimated the ranking scores by using HITS algorithm. InDegree(InD):This method identifies the authorities based on the number of best answers described in Bouguessa et al. [2] ExpertiseRank (ER): Zhang et al. [7] proposed a PageRank-like algorithm called ExpertiseRanking to rank authorities in an expertise network considering how many users involved in asking and answering questions. Table 3 presents the comparison of authority identification for different methods. From this table, we can find that our proposed method significantly outperforms all previous works (row 1, row 2, row 3, and row 4 vs. row 5). 7 The results show the effectiveness of the propose method by considering the topic information users. 4 Conclusion and Future Work In this paper, we propose a topical rank method for authority identification in CQA. Compared to the traditional link analysis techniques, our proposed method is more effective because it finds the authorities by taking into account both the link structure and the topic information about users. We conduct experiments on real world data set from Yahoo! Answers. Experimental results show that our proposed method significantly outperforms the traditional link analysis techniques and achieves the state-of-the-art performance. Acknowledgements. This work was supported by the National Natural Science Foundation of China (No. 61070106), the National Basic Research Program of China (No. 2012CB316300), Tsinghua National Laboratory for Information 7 We perform a significant t-test. The comparisons between our method and previous works are significant at p<0.05.

Topical Authority Identification in Community Question Answering 629 Science and Technology (TNList) Cross-discipline Foundation and the Opening Project of Beijing Key Laboratory of Internet Culture and Digital Dissemination Research (No. 5026035403). We thank the anonymous reviewers for their insightful comments. References 1. Agichtein, E., Castillo, C., Donato, D.: Finding High-Quality Content in Social Media. In: Proceedings of WSDM, pp. 183 193 2. Bouguessa, M., Dumoulin, B., Wang, S.: Identifying authoritative actors in question-answering forums-the case of Yahoo! Answers. In: Proceedings of KDD, pp. 866 874 3. Jurczyk, P., Agichtein, E.: Discovering authorities in question answer communities by using link analysis. In: Proceedings of CIKM, pp. 919 922 4. Liu, J., Song, Y.-I., Lin, C.-Y.: Competition-based user expertise score estimation. In: Proceedings of SIGIR, pp. 425 434 5. Pal, A., Konstan, J.: Expert identification in community question answering: exploring question selection bias. In: Proceedings of CIKM, pp. 1505 1508 6. Kao, W., Liu, D., Wang, S.: Expert finding in question-answering websites: a novel hybrid approach. In: Proceedings of SAC, pp. 867 871 7. Zhang, J., Ackerman, M., Adamic, L.: Expertise networks in online commmunities: structure and algorithm. In: Proceedings of WWW 8. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web. Stanford Digtital Library Technologies Project 9. Kleinberg, J.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604 632 10. Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993 1022 11. Guo, J., Xu, S., Bao, S., Yu, Y.: Tapping on the potential of Q&A community by recommending answer providers. In: Proceedings of CIKM, pp. 921 930 12. Griffiths, T., Steyvers, M.: Finding scientific topics. The National Academy of Sciences 101, 5228 5235 13. Porteous, I., Newman, D., Ihler, A., Asuncion, A., Smyth, P., Welling, M.: Fast collapsed gibbs sampling for latent dirichlet allocation. In: Proceedings of KDD, pp. 569 577 14. Haveliwala. T. H.: Topic-sensitive pagerank. In: Proceedings of WWW 15. Nie, L., Davison, B.D., Qi, X.: Topic link analysis for web search. In: Proceedings of SIGIR 16. Li, B., King, I.: Routing questions to appropriate answerers in community question answering services. In: Proceedings of CIKM, pp. 1585 1588 17. Zhou, G., Cai, L., Zhao, J., Liu, K.: Phrase-based translation model for question retrieval in community question answer archives. In: Proceedings of ACL, pp. 653 662