COMMUNITY QUESTION ANSWERING (CQA) services, Improving Question Retrieval in Community Question Answering with Label Ranking

Improving Question Retrieval in Community Question Answering with Label Ranking Wei Wang, Baichuan Li Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong wangwei, bcli@cse.cuhk.edu.hk Irwin King AT&T Labs Research Florham Park, NJ 07932 USA irwin@research.att.com Abstract Community question answering services (CQA), which provides a platform for people with diverse backgrounds to share information or knowledge, has become an increasingly popular research topic recently. Question retrieval (QR) in CQA can automatically find the most relevant history questions which have been solved well from other users. Current QR approaches typically consider building diverse retrieval models but fail to analyze user intention. User intention such as detecting the fact, interacting with others, seeking reasons, etc. reflects what the user really want to know and enhance we integrate user intention analysis into QR. Firstly, we classify questions into several types according to uses intentions. In addition, it is worth noting that each question can be assigned to more than one type, since askers might have several intentions. Another practical problem is that there naturally exist some preference among the possible questions types. The more relevant type should be ranked higher than types which are not so relevant. Therefore, we propose to utilize label ranking method to do the question classification. Label ranking is a machine learning algorithm which aims to predict a ranking among all the possible labels. Secondly, based on the result of question classification, we integrate user intention with translation-based language models to explore whether user intention helps to improve the performance. We take intensive experiments with Yahoo data and the experimental results demonstrate that our proposed improved question retrieval can indeed enhance the performance of traditional question retrieval model. I. INTRODUCTION COMMUNITY QUESTION ANSWERING (CQA) services, draw increasing attention over recent years. It provides a free environment where people can voluntarily ask and answer questions. Unlike the traditional automatic information retrieval, a CQA site allows a peer-to-peer interaction and thereby answers from a large community would be more personalized and specific. Besides, these portals provide a question retrieval mechanism to return a list of resolved history questions. Therefore, if the returned questions exactly reflects the users intentions, the time needed for finding answers would be decreased considerably. Many methods have been applied for question retrieval recently [2], such as vector space model [13], the language model [9], the Okapi model [13], the translation model [1]. In addition, [13] compares the four retrieval models performance. Some researchers also propose to classify questions into several categories [17], which utilize a multi-class SVM TABLE I AN EXAMPLE OF QUESTION RETRIEVAL Can culture difference make or break a relationship? 1 2 3 Whether culture difference can make a relationship? 1 2 3 Why culture difference can make a relationship? 2 1 3 Any opinions about culture difference and relationship? 3 2 1 classifier to detect users intention. Instead of single classifier, they allow each question to be assigned to at most 5 labels with equal importance. However, the significance of labels for a question is different, and enhance the number of labels is not fixed. In this sense, it is more reasonable to assign a ranked list of labels to each query. That means labels which are more relevant with the question should be ranked higher than labels which are not so relevant. Table I gives an example. 1, 2 and 3 denotes three question types, Yes/No, Reason and Opinions separately. Different question shows different preferences among those three types. Traditional multiple type classification cannot solve this problems by giving same significance of different types, and therefore cannot find most similar questions effectively. There are two possible solutions for this problem. One is to assign each label a score and therefore a ranked list will show the preference of those labels. However, it is impracticable in our question retrieval problem due to the difficulty in scoring. When we ask a user to assign a score to each label based on the relevance between the label and question, they always cannot assess objectively. On the contrary, it is much easier for them to tell which label is better than which label. In view of this, label ranking is applied to solve the particular multilabel classification problem. Label ranking is a machine learning algorithm aims to learn a mapping from instances to a list of rankings over a finite set of predefined types. Not only does it concern the types of a question belongs to, but also their relative positions. By assigning a higher weight of some types, label ranking can help us to find more similar questions based on traditional question retrieval models. Our contributions are three-fold: 1) We divide questions into several types to detect users intention by doing multi-label classification. 2) We firstly use label ranking to analyze different types of questions.

3) We utilize multi-label classification to improve the performance of question retrieval in CQA Services. The rest of this paper is organized as follows. In the next section, we discuss related work. In Section III and Section IV, we provide the technological process of our proposed method, the detail of label ranking and the improved question retrieval methods. In Section V, we describe and analyze the data set, in Section VI, we use our improved question retrieval methods to compare the experimental results. At last, Section VII gives the conclusion of this paper. II. RELATED WORK Kinds of retrieval methods have been applied to or proposed for question retrieval in CQA services. Jeon et al. [14], [15] compare the performance of question retrieval for four popular retrieval methods, the vector space model, the Okapi model, the language model, and the translation model. The experimental results show that the translation model outperforms the other models. In the following, Xue et al. [21] propose a translation-based language model which combines the translation model and the language model together for question retrieval. Similar results are observed and the translation-based language model is demonstrated to perform the best. Recently, some studies utilize the category property of questions in CQA portals to propose categorybased retrieval methods [4], [9], [20] and these methods perform better than traditional retrieval models without considering category information of the questions. Label ranking [19], [8] is a complex predication task aims to learn a mapping from instances to a list of rankings over a finite set of predefined labels. It can be viewed as a general classification method. Since the result of label ranking is a ranking list of different labels, by proper setting a threshold we can obtain the results of multi-label classification and the first label in the ranking list is the predication result of single label classification. Many approaches for label ranking have been proposed recently. Log-linear models for label ranking [8] assumes that each instance in the training data is associated with a list of preferences over the labelset and learn a ranking function that induces a total order over the entire set of labels. As to the ranking by pairwise comparison, a binary preference model is learned for each pair of labels [11]. III. QUESTION CLASSIFICATION Figure. 1 describes the process of our proposed improved question retrieval model. When user issues a query, it will be classified to a ranked list by question classification. Based on the ranked list, a general score including the question retrieval score describes the relevance of two questions. A. Question Analysis Question classification aims to map the plain questions into several types and thereby add some constraints on the question retrieval. We follow the question taxonomy proposed in [17], which contains 11 categories, as shown in Table II. Confucius system proposed by [17] utilize the Fig. 1. Question and Answer Services classification system in real applications. Each question type describes an expected answer which reflects those askers intention. Therefore, a targeted question retrieval should pay attention to search similar questions in particular types. TABLE II TYPES OF QUESTIONS Type Purpose Expected answers REC Recommendation Suggested item and the reason FAC Seeking facts Data, location, or name YNO Yes/No decision Yes/No and reason HOW How-to question The instructions WHY Seeking reason The explanation TSL Translation Translated text in target language DEF Definition The definition of given entity OPN Seeking opinion Opinions of other users TRA Transportation Navigation instructions INT Interactive Discussion thread MAT Math problem Solution with steps As introduction describes, a considerable problem in our proposed model is preferences among different categories. The label describes the possible intentions of a particular question. The sequence in the ranking list reflects the similarity between question and a type. Ranked higher types are more relevant than types which are ranked lower. The more related a type to a query, the more likely it is that the types contains questions relevant to that query. Therefore, what we want to calculate are not only the types (or labels) corresponding a question, but also the orders in the type list. Question Classification: Given a set of instances X = {x i } l+p i=1, where x i R m, and whose corresponding label is Y = {y i } l+p i=1. Each y i stands for a permutation of all labels from L. We use y i (j) to denote one single type, where j {1,..., n}, and denotes the preference of different labels. n is the number of classes and l + p is the number of instances. l denotes the number of labeled data and p denotes the partial order data or incomplete data, which will be introduced in Section III-B. Question information analysis is a process of keyword retrieval since here the extracted keyword is always viewed as the important question topic and be utilized into the document retrieval. Document or question retrieval has been

rather consummate and mature in many research work. Those keywords which are extracted by the retrieval services are always some notional words, like noun, verbs or some phrase. However, questions posted in the CQA services are a relatively comprehensive sentence not only include those notional words but many functional words, like what, how, where, etc. Those functional words give us some information about the askers latent intentions as well as increase the difficulty of our question retrieval. In the next subsections, we will use several machine learning algorithms to solve the question classification in CQA. B. Instance Based Label Ranking Label ranking [7], [3] is the task of inferring a total order over a predefined set of types for each unlabeled instance, which not only focus on the multiple labels in the learning process, but also the relative orders among them. It can be viewed as a natural generalization of traditional classification. In this particular question classification problem, each new question would be attributed to a list of question types, which motives us to utilize label ranking algorithm to solve this problem. Among all the methods in label ranking, such as [8], [11], [19], we choose the instance based learning algorithm to train the classifier. Instance based learning is a family of algorithms that compares each new instance with the training data set stored in the memory. After getting a realtime judgement, the new instance could also be stored in the training data set. Compared with some other machine learning algorithms, instance based algorithm s advantages are twofold. First, training process does not need many training data which are always expensive to obtain in the CQA services. Second, its high suitability to continuously changing data set. Some other machine learning algorithms should have to retrain the entire model of all data set when a new instance is introduced, what instance based algorithm needs to do is to add it into the training set. To evaluate the predictive performance of the mapping function, a suitable loss function on Y is necessary. Diverse methods are used to calculate this distance, here we select a popular one in statistics, which is called Kendall s tau rank coefficient [16], [18]. Suppose y and z are two rankings, we define the distance between them as follows: D(y, z) = #{(i, j) (y(i) y(j))(z(i) z(j)) < 0}. (1) D(y, z) will be equal to 0 if the two lists are the same and n(n 1)/2 (where n is the list size), if one list is the reverse of the other. Often Kendall tau distance is normalized to [0,1] since it can be interpreted as a correlation measure. Therefore, D(y, z) = 0 if and only if i and j are in the same order, D(y, z) = 1 if and only if i and j are in the opposite order. Kendall s tau coefficient can be viewed as the moving steps between connected vertices of the polytope. A good character of Kendall s tau coefficient has the right invariant property. That means the distance does not depend on arbitrary re-labeling of the n labels. The Mallows model [7] has been utilized to solve the label ranking problem, which belongs to the exponential family. Given the model parameters z and θ, the probability of y can be expressed as follows: P (y θ, z) = 1 exp ( θd (y, z)), (2) φ (θ, z) where φ (θ, z) = y Y exp ( θd (y, z)) is a normalization constant. z is the distribution s model or center ranking, and θ 0 refers to the dispersion degree or in other words, how close the data cluster about the center. A simple situation of our ranking problem is full ranking. It means that given a set of instances x 1,..., x k X and a complete ranking set Y. All the rankings in Y are full permutations. For each new instance, we first find a host of similar instances and assume that those neighbors have or approximately have the same distribution. By further assuming the independence of them, the probability of the observations given the parameter z and θ becomes: P (y θ, z) = = k P (y i θ, z) (3) i=1 k i=1 exp ( θd (y i, z)) φ (θ, z) = exp ( θd (y i, π)) ( ) k, n j=1 1 exp( jθ) 1 exp( θ) where z is the parameter to be estimated and be output as the predicted label. We can utilize the maximum likelihood estimation (MLE) to solve this problem. However, a more common situation is incomplete observations or partial order of ranking labels. That means, we cannot get the whole preference of the orders. For each incomplete ranking y, we use E (y) to denote all possible permutations which do not conflict with y. This often happens in our CQA services, since it is unreasonable to assign a full ranking of labels to each question. We always focus on the most probable labels among the whole ranking permutation. By further assuming that the independence of the observations, the probability of y given neighbors {x i, y i } k i=l with the parameter z and θ becomes: P (y θ, z) = k P (E (y i ) θ, z) = i=1 k i=1 y E(y i ) ( n j=1 p (E (y i ) θ, z) 1 exp( jθ) 1 exp( θ) ) k. (4) Due to the difficulty in deriving the parameter z and θ, we cannot utilize MLE to estimate them. Therefore, we resort to expectation maximum algorithm to find answers iteratively. Starting from an initial center ranking z Y, the label information is estimated for the incomplete data by

comparing the distance between each possible extension and the center z (E-step). Moreover in the M-step, compute the new center ˆz of the distribution. The two steps should be repeated until the center will almost not change. The final center will be output as the prediction ranking for the query x. Algorithm 1 Algorithms of label ranking Inputs: query x X, training data T, integer k or threshold φ Outputs: label ranking estimation for x 1: find the k nearest neighbors or the similarity is larger than φ in T 2: get neighbors rankings σ = {σ 1...σ k } 3: calculate initial ˆπ from σ 4: for every ranking in σ do 5: if σ i is incomplete then 6: replace it with the most probable ranking given ˆπ 7: end if 8: end for 9: calculate ˆπ from the new σ 10: if π = ˆπ then 11: ˆπ π 12: go to step 4 13: else 14: return (ˆπ) 15: end if C. Some Other Learning Algorithms Besides the label ranking algorithm, we have tried two other machine learning algorithms: Naïve Bayes(NB) [5] and Support Vector Machine (SVM). NB often outperforms more sophisticated classification methods. Considering a data set with n classes, first we calculate the prior probability for each class. Since instances are well clustered, it is reasonable to assume that the more instances from one class in the vicinity of a new case x, the more likely that x belongs to that particular class. From this view, the final classification is produced under the Bayes rule. Then, all the labels present are ordered with respect to the predicted class probabilities for each instance. Besides NB, we also apply SVM algorithm [10] to train the classifier. As briefly introduced in Section III-A, each question can be assigned more than one class; therefore, we choose the one-against-rest strategy to train the classifier. It needs to build L binary classifier, {f 1, f 2,..., f L }, where L is the number of classes. In the process of training classifier f i, questions belong to that class i are viewed as the positive instances, other instances are viewed as the negative instances. For each new instance, we calculate the score of L classifiers, ordered the labels based on the scores. LIBSVM is applied to solve the multi-label classification and given the prediction of the label information for new questions. IV. IMPROVED QUESTION RETRIEVAL USING CLASSIFICATION In our improved question retrieval model, the similarity of two questions should involve two parts: the intention relevance and content relevance. The intention relevance describes their types. By applying Eq. (1), we define the intention relevance score INT q,u between two questions: INT q,u = 1 D(q, u) The content relevance refers to their concrete concrete words, like noun, verbs, adjective etc, which can be calculated by the question retrieval models. A. Question Retrieval Models There are various retrieval models have been proposed recently. In this paper, we compare the state-of-art model, called translation based language model [21] which combined the advantage of language model [14] and translation model[15]. Given query q and any question u, the score CON q,u can be calculated as follows: CON q,u = ((1 λ)(β T (w, w)p (w u)+ w q w u where P (w u) = (5) (1 β)p (w u)) + λp (w C)), (6) f w,u w u f w,u and P (w C) = f w,c w f. C w,c T (w, w) denotes the probability of the word w is the translation of the word w, β describes the tradeoff between the language model and translation model. B. Improved Question Retrieval Model Besides the content relevance, intention relevance describes the types of them. As the Table II introduces, questions coming from the same category would reflect the similar intention of the corresponding askers. By analyzing the question type, we can detect what the user really wants to know and therefore questions can be mapped from a bag of words into a sematic space. After training by the classifiers, each question would be given a list of labels and the orders describe the relevant degree. Then we can calculate the intention similarity between every question pair. After obtaining content relevance CON q,u and intention relevance INT q,u, we can calculate the general relevance of the query q and any history question u. Due to the different value scale, CON q,u and INT q,u should be normalized and then the general score is listed as follows: S q,u = γcon q,u + (1 γ)int q,u, (7) where γ is the combination weight which balances the content relevance and the intention relevance. V. EXPERIMENTAL SETUP In this section, we will introduce some analyses about our experimental data set and design to show the performance of our improved question retrieval model.

A. Experimental Objective Traditional retrieval models mainly focus on how to find relative history questions for a query in a question pool. However, questions can be naturally divided into several types, like recommendation, opinion, etc. Different types reflects diverse users intentions. Questions coming from the same type would probably have the highest similarity. Therefore, we try to separate the pool of questions into several sub groups with aim to increase the efficiency and accuracy of the traditional question retrieval models. B. Experimental Design As described in Section V-A, we design three experiments to demonstrate our idea. 1) We analyze our data set crawled from Yahoo! Answers website. The classification system of the Confucius system [17] is applied in our work. We assume that each query will be assigned to multiple labels and the number of labels is not fixed. This part of results are shown in Section V-C. 2) In the classification part, we apply three machine learning algorithms (Label Ranking, Naïve Bayes, and Support Vector Machine) to train the questions. To compare the results of those three algorithms, we divide the whole question set into two parts with equal size, training set and testing set. After that, classifier which has the best performance will be utilized in the third experiment. This part of results are shown in Section VI-A.3. 3) We will compare the results of our improved question retrieval model with the traditional question retrieval model. After the normalization, the combined score will be viewed the question s general score. This part of results are shown in Section VI-B.3. C. Data Set The questions from Yahoo! Answers are divided into 26 general categories by the website. Bases on this, our crawling job can be divided into three sub steps: firstly, we find the most popular queries. Yahoo! Answers API provides the get by category function to find the popular queries according to different categories, and then return the URL addresses of those queries. Secondly, based on those URL address, we utilize get question to find the corresponding questions details. Finally, those questions subject returned in the second step will be used as the search items in the question search function. Then we can get a list of related questions for each query as the baseline of our improved retrieval model. In our experiment, we have collected the top 50 interesting queries for each category, and top 50 related questions for each query. Another considerable problem is the quality of queries. Sometimes when we search a query in the CQA services, the system always tells us it cannot find any results. That means, no relative questions can be found in the history question pool. In our experiment, we collect the number of queries Fig. 2. Number of Questions in Each Category which have more than 10 related questions. The statistics results in our experiment is listed in Table 2. As Section III-A briefly introduced, we use the system in Table II to do the classification job. To verify it, 50 questions for each category are manually assigned to different questions types according to the classification system. To get reliable labels, three experienced persons take charge of the labeling task. The statistics result is shown in Table III. We can see that nearly each question is assigned to more than one type, especially there is one question has 6 labels. On average, each question has 3 or 4 labels. That demonstrate our assumption that each questions should be assigned to multiple labels and the number of labels is varied with different queries. TABLE III THE NUMBER OF LABELS 1 2 3 4 5 6 32 178 314 147 26 1 Then we conduct another statistics analysis to show the average number of types for each question. This result will be viewed as the ground truth of our question classification. Considering the multi-type questions, each question is assigned a list of labels which describes the preference among them. We use Borda count to calculate this distribution. Borda count is a relative simple sort voting method. Given a ranking list of n question type, the top ranked type gets n votes, the second gets n 1 votes, etc. In this view, we can calculate the general votes for each ranking list. The statistics result is listed in Table IV. It is noted that the opinion contains the most questions, and next was the interactive type. In addition, the number of people who want to get some translation and transportation occupy the least shares.

TABLE IV QUESTION DISTRIBUTION category REC FAC YNO HOW WHY TSL number 2140 1128 1806 2555 2328 47 category DEF OPN TRA INT MAT number 148 5145 10 2740 226 D. Question Feature To avoid the bias by different classifiers, we use the same question feature system in the three classification experiments. Considering the purpose of this paper is to detect the askers intentions, we utilize two groups of features listed as follows: 1) Bag of words. Each query or question consists of single words and symbols. The words dictionary is collected based on the whole data set we have crawled from Yahoo! Answers website. In this sense, each query is represented by a term frequency vector. 2) Special interrogatives and particular words. To emphasize some special words significance in our detecting task, we also extract various interrogative and particular words. For example, sentences started with the words do, can, did, does, etc. are probably belong to Yes/No type. Sentences including most, best might want to get some recommendations from answers. VI. EXPERIMENTAL RESULT AND COMMENTS In this section, we will show the experimental results of the question classification and the improved question retrieval. Questions are labeled manually by three experts and are divided into training set and test set with equal size. A. Question Classification To evaluate the classification performance, we compare three algorithms: Label Ranking, Naïve Bayes, and Support Vector Machines. 1) Parameter Selection: Both label ranking and Naïve Bayes are the instance based algorithms, therefore, we should select some neighbors which share the similar features for each new instance. As Section V-D introduces, question features can be divided into three parts: some particular words s 1, the bag of words s 2, the frequency of words s 3. For each feature pair, we utilize the cosine similarity to measure the similarity between two instances. s 2 describes the content of a question and s 3 describes the frequency of such words appeared in s 2. Therefore, only when two questions getting high scores in both s 2 and s 3, they might be similar with each other. The final measurement of similarity is s = s 1 +s 2 s 3. At last, s will be normalized between 0 and 1. Another considerable problem is the amount of neighbors in each iteration. The basic criteria is the higher similarity, the better to be a neighbor. In our experiment, we use a criteria score as the measurement. At first, the criteria will be preassigned as 0.5, if there are no instances fulfilling such requirement, the criteria will be divided by 2, etc. until we find some neighbors. For the support vector machine, we utilize LibSVM [6] to train the classifier. In order to compare the results of label ranking, we want to get a permutation sorted by the different types probabilities. Since there are 11 types in our classification system, 11 diverse models are trained to predict the types. Therefore, for each new instance, we can get 11 probabilities toward 11 types, and the list of types after sorting those results. We use RBF kernel e (γ u v 2) to train the classifier. γ will be set as 1/k, k is the number of features. The final result is listed as in Table V. 2) Evaluation: To avoid ambiguity, the discounted cumulative gain (DCG) [12] is used to measure our classification results instead of the kendall s tau coefficient. DCG is widely utilized measurement of web search engine algorithm. The premise of DCG is that highly related types ranked lower should be penalized. By calculating the graded relevance of a list, DCG measures the gain of a type based on its position in the result list. Suppose q 1 = {2 5 7 9} is the ideal value labeled by an expert and q 2 = {2 9 1 7 8 3 4 10 11 6 5} is the predicted result. We want to measure the quality of q 2. Based on the theory of DCG, the relevance score of q 1 is {0 1 0 0 0.5 0 0.75 0 0.25 0 0} and it will be viewed as the ground truth. After that, we can quickly get the relevance score of q 2 = {1 0.25 0 0.5 0 0 0 0.35 0 0 0.75}. Therefore, the DCG score can be calculated by Eq. (8). DCG = rel 1 + n rel i log i, (8) 2 i=2 where n is the number of type, in our problem n is predefined as 11. Based on Eq. (8), we can get the DCG score of q 1 and q 2 respectively. In the view of DCG scores are influenced by the length of query, so a normalization is necessary. Therefore, the normalized discounted cumulative gain or ndcg of a query q is computed as: ndcg q = DCG q IDCG q, (9) where IDCG is the DCG score of the ideal ranking, or q 1 in our example. It is noted that in a perfect prediction, the ndcg score will be equal to 1. In other occasions, the value of it will be between [0, 1]. Therefore, the higher value means the more accurate prediction. 3) Classification Result: After training by LR, NB, SVM, we calculate the ndcg scoreaccording to Eq. (8) and Eq. (9). The average value is listed in Table V. TABLE V COMPARISON RESULTS OF CLASSIFICATION MODELS Algorithm LR SVM NB NDCG 0.6824 0.5589 0.6390 We can see that among the three classifiers, LR shows the best performance and SVM performs the worst. That means label ranking can predict the most accurate value for the unlabeled data. Since our data set is randomly crawled from Yahoo! Answers, and it cover every category of it. Therefore,

it is safe to get the conclusion that LR is quite suitable to our problem of detecting user s intention. B. Classification Enhanced Question Retrieval Based on the experimental result, we choose the best performance classification algorithm label ranking to improve our performance because it shows the best performance in the classification part. 1) Parameter Selection: As Section IV-A introduces, the translation-based language model have two parameters: λ and β, the value of which are set between 0 and 1. λ is a smoothing parameter in the original language model and β controls the weights of language model and translation model. Given the classification score INT q,u and question score CON q,u, we will combine them together as a general score for each related question. As Section IV-B introduces, the parameter γ determines the weight of those two parts. Due to different measurements, the min-max normalization is used to get the new CON q,u : CON q,u = log(1 + CON q,u) v min v max v min, where CON q,u and CON q,u denote the new value and old value, v max and v min denote the maximum and minimum value of log(1 + CON q,u ), respectively. 2) Evaluation: Several questions retrieval metrics have been proposed to measurement the quality of related questions. We choose three measurement among them: mean average precision (MAP), mean reciprocal rank (MRR), and Precision at 10 (P@10). MAP: the mean value of average precision, where the average precision of a single query is the mean of the precision scores at each relevant item returned in a results list. MRR: the mean of the reciprocal rank of the highest ranking relevant document. P@10: the mean of the precision of the first ten questions retrieved. 3) Improved Question Retrieval Results: In Section V- C, we analyze the distribution of each category. Since the question retrieval measurement, MAP, MRR and P@10, should be labeled by human manually, we choose 4 categories which have the most eligible queries based on Fig. 2, which are Beauty & Style, Entertainment & Music, Family & Relationship and Pregnancy & Parenting to show the performance. Table VI, VII, VIII, IX show the comparison results. We also show the MAP variation trend of the 4 categories in Fig. 3. For the Beauty & Style category, we can see that at the beginning of increasing γ, the MAP accuracy has decreased slightly, however, with continuing increasing the weight of classification, the general performance is getting better. The climax of MAP occurs when γ = 0.8. The P@10 value does not change obviously, and the maximum value happens when γ = 0.9. The MRR situation is exactly the opposite, the MAP value MAP value 0.31 0.305 0.3 0.295 0.29 improved questions retrieval baseline 0.285 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 gamma value 0.298 0.296 0.294 0.292 0.29 0.288 0.286 0.284 (a) B&S 0.282 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 gamma value (c) F&R Fig. 3. Improved question retrieval base line MAP value MAP value 0.465 0.46 0.455 0.45 0.445 0.44 baseline improved question retrieval 0.435 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 gamma value 0.54 0.535 0.53 0.525 0.52 0.515 0.51 (b) E&M baseline improved question retrieval 0.505 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 gamma value (d) P&P MAP change based on 4 categories climax value occurs in the early stages when γ = 0, 0.1. It suggests that the increased question retrieval model does improve the traditional question retrieval performance generally. However, MRR decreases slightly. For the Entertainment & Music category, MAP change trend resembles an arch shape. The highest value occurs when γ = 0.3. The maximum value of P@10 and MRR occur respectively when γ = 0 and γ = 0.8, 0.9. For the Family & Relationship category, MAP, P@10 and MRR show similar change trend, all those measures grow with increasing γ. At last, for the Pregnancy & Parenting category, the maximum MAP occurs when γ = 0.2. Both P@10 and MRR show better performance at the beginning of increasing γ. From those 4 categories results, we can get the conclusion that our proposed question retrieval model can indeed improve the MAP performance, though different category s best performance happens at different stages. For the P@10 and MRR, the change trends are not obvious, but the general performances have been ameliorated after adding intention analysis. Besides, all those models outperform the baseline dramatically. VII. CONCLUSION Question retrieval plays an important role in CQA services. In this paper, we propose an improved question retrieval model which is capable of detecting users intentions associated with the traditional question retrieval. By classifying different questions into several groups, the search scope has been narrowed and derive more satisfactions. In the classification part, we assume each user would have more than one intention and each question would be assigned a list of labels. Different preference among those labels reflect the relevances degree. To predict those possible labels and the preference among them, we propose to use label ranking method which shows the best performance compared with Naïve Bayes and Support Vector Machine. We combine our label ranking method with the translation-based language

TABLE VI IMPROVED QUESTION RETRIEVAL WITH LABEL RANKING FOR BEAUTY & STYLE CATEGORY Metrics γ = 0 γ = 0.1 γ = 0.2 γ = 0.3 γ = 0.4 γ = 0.5 γ = 0.6 γ = 0.7 γ = 0.8 γ = 0.9 BaseLine MAP 0.3007 0.3006 0.3003 0.2997 0.2998 0.3004 0.3015 0.3017 0.3051 0.305 0.2862 P@10 0.4763 0.4763 0.4763 0.4763 0.4763 0.4763 0.4737 0.4711 0.4737 0.4789 0.4263 MRR 0.8117 0.8117 0.8073 0.8073 0.8073 0.7941 0.7941 0.7941 0.808 0.8103 0.8082 TABLE VII IMPROVED QUESTION RETRIEVAL WITH LABEL RANKING FOR ENTERTAINMENT & MUSIC CATEGORY Metrics γ = 0 γ = 0.1 γ = 0.2 γ = 0.3 γ = 0.4 γ = 0.5 γ = 0.6 γ = 0.7 γ = 0.8 γ = 0.9 BaseLine MAP 0.449 0.4564 0.4582 0.4584 0.4583 0.4583 0.4578 0.4566 0.4562 0.4536 0.4398 P@10 0.6583 0.6556 0.6556 0.6556 0.6556 0.6556 0.6528 0.65 0.65 0.6361 0.5667 MRR 0.8481 0.8319 0.8366 0.8366 0.8366 0.8366 0.8366 0.8366 0.8505 0.8505 0.8431 TABLE VIII IMPROVED QUESTION RETRIEVAL WITH LABEL RANKING FOR FAMILY & RELATIONSHIP CATEGORY Metrics γ = 0 γ = 0.1 γ = 0.2 γ = 0.3 γ = 0.4 γ = 0.5 γ = 0.6 γ = 0.7 γ = 0.8 γ = 0.9 BaseLine MAP 0.2868 0.2869 0.2876 0.2879 0.2878 0.2884 0.2894 0.2921 0.2918 0.2942 0.2839 P@10 0.4682 0.4682 0.4705 0.4727 0.4727 0.4727 0.4705 0.4682 0.475 0.475 0.3841 MRR 0.6569 0.6569 0.6569 0.6569 0.6607 0.6607 0.6645 0.6634 0.6588 0.6702 0.6244 TABLE IX IMPROVED QUESTION RETRIEVAL WITH LABEL RANKING FOR PREGNANCY & PARENTING CATEGORY Metrics γ = 0 γ = 0.1 γ = 0.2 γ = 0.3 γ = 0.4 γ = 0.5 γ = 0.6 γ = 0.7 γ = 0.8 γ = 0.9 BaseLine MAP 0.5294 0.5297 0.5306 0.5305 0.5303 0.5304 0.5279 0.5273 0.5261 0.5233 0.5071 P@10 0.6622 0.6622 0.6622 0.6622 0.6595 0.6595 0.6595 0.6568 0.6541 0.6486 0.6324 MRR 0.7854 0.7854 0.7854 0.7854 0.7854 0.7854 0.7853 0.7853 0.7718 0.7703 0.7839 model. Experimental results reveal that our proposed model can indeed improve the translation-based language model significantly. REFERENCES [1] A. Berger, R. Caruana, D. Cohn, D. Freitag, and V. Mittal. Bridging the lexical chasm: statistical approaches to answer-finding. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 192 199. ACM, 2000. [2] J. Bian, Y. Liu, E. Agichtein, and H. Zha. Finding the right facts in the crowd: factoid question answering over social media. In Proceeding of the 17th international conference on World Wide Web, pages 467 476. ACM, 2008. [3] K. Brinker. Active learning of label ranking functions. In Proceedings of the twenty-first international conference on Machine learning, page 17. ACM, 2004. [4] X. Cao, G. Cong, B. Cui, and C. S. Jensen. A generalized framework of exploring category information for question retrieval in community question answer archives. In WWW 10: Proceedings of the 19th international conference on World wide web, pages 201 210, New York, NY, USA, 2010. ACM. [5] B. Cestnik. Estimating probabilities: A crucial task in machine learning. In Proceedings of the Ninth European Conference on Artificial Intelligence, volume 1990, pages 147 9, 1990. [6] C. C. Chang and C. J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu. edu.tw/ cjlin/libsvm. [7] W. Cheng, J. Huhn, and E. Hullermeier. Decision tree and instancebased learning for label ranking. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 161 168. ACM, 2009. [8] O. Dekel, C. Manning, and Y. Singer. Log-linear models for label ranking. Advances in Neural Information Processing Systems, 16, 2003. [9] H. Duan, Y. Cao, C. Lin, and Y. Yu. Searching questions by identifying question topic and question focus. In Proceedings of ACL. Citeseer, 2008. [10] M. Hearst, S. Dumais, E. Osman, J. Platt, and B. Scholkopf. Support vector machines. IEEE Intelligent Systems and Their Applications, 13(4):18 28, 1998. [11] E. Hullermeier, J. Furnkranz, W. Cheng, and K. Brinker. Label ranking by learning pairwise preferences. Artificial Intelligence, 172(16-17):1897 1916, 2008. [12] K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4):422 446, 2002. [13] J. Jeon, W. Croft, and J. Lee. Finding similar questions in large question and answer archives. In Proceedings of the 14th ACM international conference on Information and knowledge management, pages 84 90. ACM, 2005. [14] J. Jeon, W. B. Croft, and J. H. Lee. Finding semantically similar questions based on their answers. In SIGIR 05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 617 618, New York, NY, USA, 2005. ACM. [15] J. Jeon, W. B. Croft, and J. H. Lee. Finding similar questions in large question and answer archives. In CIKM 05: Proceedings of the 14th ACM international conference on Information and knowledge management, pages 84 90, New York, NY, USA, 2005. ACM. [16] M. Kendall. A new measure of rank correlation. Biometrika, 30(1-2):81, 1938. [17] X. Si, E. Chang, and M. Sun. Confucius and Its Intelligent Disciples: Integrating Social with Search. Proceedings of the VLDB Endowment, 3(2), 2010. [18] C. Spearman. The proof and measurement of association between two things. The American journal of psychology, 100(3):441 471, 1987. [19] S. Vembu and T. Gartner. Label ranking algorithms: A survey. Preference Learning. Springer, 2009. [20] K. Wang, Z. Ming, and T.-S. Chua. A syntactic tree matching approach to finding similar questions in community-based qa services. In SIGIR 09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 187 194, New York, NY, USA, 2009. ACM. [21] X. Xue, J. Jeon, and W. B. Croft. Retrieval models for question and answer archives. In SIGIR 08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 475 482, New York, NY, USA, 2008. ACM.