COMMUNITY QUESTION ANSWERING (CQA) services, Improving Question Retrieval in Community Question Answering with Label Ranking

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "COMMUNITY QUESTION ANSWERING (CQA) services, Improving Question Retrieval in Community Question Answering with Label Ranking"

Transcription

1 Improving Question Retrieval in Community Question Answering with Label Ranking Wei Wang, Baichuan Li Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong wangwei, Irwin King AT&T Labs Research Florham Park, NJ USA Abstract Community question answering services (CQA), which provides a platform for people with diverse backgrounds to share information or knowledge, has become an increasingly popular research topic recently. Question retrieval (QR) in CQA can automatically find the most relevant history questions which have been solved well from other users. Current QR approaches typically consider building diverse retrieval models but fail to analyze user intention. User intention such as detecting the fact, interacting with others, seeking reasons, etc. reflects what the user really want to know and enhance we integrate user intention analysis into QR. Firstly, we classify questions into several types according to uses intentions. In addition, it is worth noting that each question can be assigned to more than one type, since askers might have several intentions. Another practical problem is that there naturally exist some preference among the possible questions types. The more relevant type should be ranked higher than types which are not so relevant. Therefore, we propose to utilize label ranking method to do the question classification. Label ranking is a machine learning algorithm which aims to predict a ranking among all the possible labels. Secondly, based on the result of question classification, we integrate user intention with translation-based language models to explore whether user intention helps to improve the performance. We take intensive experiments with Yahoo data and the experimental results demonstrate that our proposed improved question retrieval can indeed enhance the performance of traditional question retrieval model. I. INTRODUCTION COMMUNITY QUESTION ANSWERING (CQA) services, draw increasing attention over recent years. It provides a free environment where people can voluntarily ask and answer questions. Unlike the traditional automatic information retrieval, a CQA site allows a peer-to-peer interaction and thereby answers from a large community would be more personalized and specific. Besides, these portals provide a question retrieval mechanism to return a list of resolved history questions. Therefore, if the returned questions exactly reflects the users intentions, the time needed for finding answers would be decreased considerably. Many methods have been applied for question retrieval recently [2], such as vector space model [13], the language model [9], the Okapi model [13], the translation model [1]. In addition, [13] compares the four retrieval models performance. Some researchers also propose to classify questions into several categories [17], which utilize a multi-class SVM TABLE I AN EXAMPLE OF QUESTION RETRIEVAL Can culture difference make or break a relationship? Whether culture difference can make a relationship? Why culture difference can make a relationship? Any opinions about culture difference and relationship? classifier to detect users intention. Instead of single classifier, they allow each question to be assigned to at most 5 labels with equal importance. However, the significance of labels for a question is different, and enhance the number of labels is not fixed. In this sense, it is more reasonable to assign a ranked list of labels to each query. That means labels which are more relevant with the question should be ranked higher than labels which are not so relevant. Table I gives an example. 1, 2 and 3 denotes three question types, Yes/No, Reason and Opinions separately. Different question shows different preferences among those three types. Traditional multiple type classification cannot solve this problems by giving same significance of different types, and therefore cannot find most similar questions effectively. There are two possible solutions for this problem. One is to assign each label a score and therefore a ranked list will show the preference of those labels. However, it is impracticable in our question retrieval problem due to the difficulty in scoring. When we ask a user to assign a score to each label based on the relevance between the label and question, they always cannot assess objectively. On the contrary, it is much easier for them to tell which label is better than which label. In view of this, label ranking is applied to solve the particular multilabel classification problem. Label ranking is a machine learning algorithm aims to learn a mapping from instances to a list of rankings over a finite set of predefined types. Not only does it concern the types of a question belongs to, but also their relative positions. By assigning a higher weight of some types, label ranking can help us to find more similar questions based on traditional question retrieval models. Our contributions are three-fold: 1) We divide questions into several types to detect users intention by doing multi-label classification. 2) We firstly use label ranking to analyze different types of questions.

2 3) We utilize multi-label classification to improve the performance of question retrieval in CQA Services. The rest of this paper is organized as follows. In the next section, we discuss related work. In Section III and Section IV, we provide the technological process of our proposed method, the detail of label ranking and the improved question retrieval methods. In Section V, we describe and analyze the data set, in Section VI, we use our improved question retrieval methods to compare the experimental results. At last, Section VII gives the conclusion of this paper. II. RELATED WORK Kinds of retrieval methods have been applied to or proposed for question retrieval in CQA services. Jeon et al. [14], [15] compare the performance of question retrieval for four popular retrieval methods, the vector space model, the Okapi model, the language model, and the translation model. The experimental results show that the translation model outperforms the other models. In the following, Xue et al. [21] propose a translation-based language model which combines the translation model and the language model together for question retrieval. Similar results are observed and the translation-based language model is demonstrated to perform the best. Recently, some studies utilize the category property of questions in CQA portals to propose categorybased retrieval methods [4], [9], [20] and these methods perform better than traditional retrieval models without considering category information of the questions. Label ranking [19], [8] is a complex predication task aims to learn a mapping from instances to a list of rankings over a finite set of predefined labels. It can be viewed as a general classification method. Since the result of label ranking is a ranking list of different labels, by proper setting a threshold we can obtain the results of multi-label classification and the first label in the ranking list is the predication result of single label classification. Many approaches for label ranking have been proposed recently. Log-linear models for label ranking [8] assumes that each instance in the training data is associated with a list of preferences over the labelset and learn a ranking function that induces a total order over the entire set of labels. As to the ranking by pairwise comparison, a binary preference model is learned for each pair of labels [11]. III. QUESTION CLASSIFICATION Figure. 1 describes the process of our proposed improved question retrieval model. When user issues a query, it will be classified to a ranked list by question classification. Based on the ranked list, a general score including the question retrieval score describes the relevance of two questions. A. Question Analysis Question classification aims to map the plain questions into several types and thereby add some constraints on the question retrieval. We follow the question taxonomy proposed in [17], which contains 11 categories, as shown in Table II. Confucius system proposed by [17] utilize the Fig. 1. Question and Answer Services classification system in real applications. Each question type describes an expected answer which reflects those askers intention. Therefore, a targeted question retrieval should pay attention to search similar questions in particular types. TABLE II TYPES OF QUESTIONS Type Purpose Expected answers REC Recommendation Suggested item and the reason FAC Seeking facts Data, location, or name YNO Yes/No decision Yes/No and reason HOW How-to question The instructions WHY Seeking reason The explanation TSL Translation Translated text in target language DEF Definition The definition of given entity OPN Seeking opinion Opinions of other users TRA Transportation Navigation instructions INT Interactive Discussion thread MAT Math problem Solution with steps As introduction describes, a considerable problem in our proposed model is preferences among different categories. The label describes the possible intentions of a particular question. The sequence in the ranking list reflects the similarity between question and a type. Ranked higher types are more relevant than types which are ranked lower. The more related a type to a query, the more likely it is that the types contains questions relevant to that query. Therefore, what we want to calculate are not only the types (or labels) corresponding a question, but also the orders in the type list. Question Classification: Given a set of instances X = {x i } l+p i=1, where x i R m, and whose corresponding label is Y = {y i } l+p i=1. Each y i stands for a permutation of all labels from L. We use y i (j) to denote one single type, where j {1,..., n}, and denotes the preference of different labels. n is the number of classes and l + p is the number of instances. l denotes the number of labeled data and p denotes the partial order data or incomplete data, which will be introduced in Section III-B. Question information analysis is a process of keyword retrieval since here the extracted keyword is always viewed as the important question topic and be utilized into the document retrieval. Document or question retrieval has been

3 rather consummate and mature in many research work. Those keywords which are extracted by the retrieval services are always some notional words, like noun, verbs or some phrase. However, questions posted in the CQA services are a relatively comprehensive sentence not only include those notional words but many functional words, like what, how, where, etc. Those functional words give us some information about the askers latent intentions as well as increase the difficulty of our question retrieval. In the next subsections, we will use several machine learning algorithms to solve the question classification in CQA. B. Instance Based Label Ranking Label ranking [7], [3] is the task of inferring a total order over a predefined set of types for each unlabeled instance, which not only focus on the multiple labels in the learning process, but also the relative orders among them. It can be viewed as a natural generalization of traditional classification. In this particular question classification problem, each new question would be attributed to a list of question types, which motives us to utilize label ranking algorithm to solve this problem. Among all the methods in label ranking, such as [8], [11], [19], we choose the instance based learning algorithm to train the classifier. Instance based learning is a family of algorithms that compares each new instance with the training data set stored in the memory. After getting a realtime judgement, the new instance could also be stored in the training data set. Compared with some other machine learning algorithms, instance based algorithm s advantages are twofold. First, training process does not need many training data which are always expensive to obtain in the CQA services. Second, its high suitability to continuously changing data set. Some other machine learning algorithms should have to retrain the entire model of all data set when a new instance is introduced, what instance based algorithm needs to do is to add it into the training set. To evaluate the predictive performance of the mapping function, a suitable loss function on Y is necessary. Diverse methods are used to calculate this distance, here we select a popular one in statistics, which is called Kendall s tau rank coefficient [16], [18]. Suppose y and z are two rankings, we define the distance between them as follows: D(y, z) = #{(i, j) (y(i) y(j))(z(i) z(j)) < 0}. (1) D(y, z) will be equal to 0 if the two lists are the same and n(n 1)/2 (where n is the list size), if one list is the reverse of the other. Often Kendall tau distance is normalized to [0,1] since it can be interpreted as a correlation measure. Therefore, D(y, z) = 0 if and only if i and j are in the same order, D(y, z) = 1 if and only if i and j are in the opposite order. Kendall s tau coefficient can be viewed as the moving steps between connected vertices of the polytope. A good character of Kendall s tau coefficient has the right invariant property. That means the distance does not depend on arbitrary re-labeling of the n labels. The Mallows model [7] has been utilized to solve the label ranking problem, which belongs to the exponential family. Given the model parameters z and θ, the probability of y can be expressed as follows: P (y θ, z) = 1 exp ( θd (y, z)), (2) φ (θ, z) where φ (θ, z) = y Y exp ( θd (y, z)) is a normalization constant. z is the distribution s model or center ranking, and θ 0 refers to the dispersion degree or in other words, how close the data cluster about the center. A simple situation of our ranking problem is full ranking. It means that given a set of instances x 1,..., x k X and a complete ranking set Y. All the rankings in Y are full permutations. For each new instance, we first find a host of similar instances and assume that those neighbors have or approximately have the same distribution. By further assuming the independence of them, the probability of the observations given the parameter z and θ becomes: P (y θ, z) = = k P (y i θ, z) (3) i=1 k i=1 exp ( θd (y i, z)) φ (θ, z) = exp ( θd (y i, π)) ( ) k, n j=1 1 exp( jθ) 1 exp( θ) where z is the parameter to be estimated and be output as the predicted label. We can utilize the maximum likelihood estimation (MLE) to solve this problem. However, a more common situation is incomplete observations or partial order of ranking labels. That means, we cannot get the whole preference of the orders. For each incomplete ranking y, we use E (y) to denote all possible permutations which do not conflict with y. This often happens in our CQA services, since it is unreasonable to assign a full ranking of labels to each question. We always focus on the most probable labels among the whole ranking permutation. By further assuming that the independence of the observations, the probability of y given neighbors {x i, y i } k i=l with the parameter z and θ becomes: P (y θ, z) = k P (E (y i ) θ, z) = i=1 k i=1 y E(y i ) ( n j=1 p (E (y i ) θ, z) 1 exp( jθ) 1 exp( θ) ) k. (4) Due to the difficulty in deriving the parameter z and θ, we cannot utilize MLE to estimate them. Therefore, we resort to expectation maximum algorithm to find answers iteratively. Starting from an initial center ranking z Y, the label information is estimated for the incomplete data by

4 comparing the distance between each possible extension and the center z (E-step). Moreover in the M-step, compute the new center ˆz of the distribution. The two steps should be repeated until the center will almost not change. The final center will be output as the prediction ranking for the query x. Algorithm 1 Algorithms of label ranking Inputs: query x X, training data T, integer k or threshold φ Outputs: label ranking estimation for x 1: find the k nearest neighbors or the similarity is larger than φ in T 2: get neighbors rankings σ = {σ 1...σ k } 3: calculate initial ˆπ from σ 4: for every ranking in σ do 5: if σ i is incomplete then 6: replace it with the most probable ranking given ˆπ 7: end if 8: end for 9: calculate ˆπ from the new σ 10: if π = ˆπ then 11: ˆπ π 12: go to step 4 13: else 14: return (ˆπ) 15: end if C. Some Other Learning Algorithms Besides the label ranking algorithm, we have tried two other machine learning algorithms: Naïve Bayes(NB) [5] and Support Vector Machine (SVM). NB often outperforms more sophisticated classification methods. Considering a data set with n classes, first we calculate the prior probability for each class. Since instances are well clustered, it is reasonable to assume that the more instances from one class in the vicinity of a new case x, the more likely that x belongs to that particular class. From this view, the final classification is produced under the Bayes rule. Then, all the labels present are ordered with respect to the predicted class probabilities for each instance. Besides NB, we also apply SVM algorithm [10] to train the classifier. As briefly introduced in Section III-A, each question can be assigned more than one class; therefore, we choose the one-against-rest strategy to train the classifier. It needs to build L binary classifier, {f 1, f 2,..., f L }, where L is the number of classes. In the process of training classifier f i, questions belong to that class i are viewed as the positive instances, other instances are viewed as the negative instances. For each new instance, we calculate the score of L classifiers, ordered the labels based on the scores. LIBSVM is applied to solve the multi-label classification and given the prediction of the label information for new questions. IV. IMPROVED QUESTION RETRIEVAL USING CLASSIFICATION In our improved question retrieval model, the similarity of two questions should involve two parts: the intention relevance and content relevance. The intention relevance describes their types. By applying Eq. (1), we define the intention relevance score INT q,u between two questions: INT q,u = 1 D(q, u) The content relevance refers to their concrete concrete words, like noun, verbs, adjective etc, which can be calculated by the question retrieval models. A. Question Retrieval Models There are various retrieval models have been proposed recently. In this paper, we compare the state-of-art model, called translation based language model [21] which combined the advantage of language model [14] and translation model[15]. Given query q and any question u, the score CON q,u can be calculated as follows: CON q,u = ((1 λ)(β T (w, w)p (w u)+ w q w u where P (w u) = (5) (1 β)p (w u)) + λp (w C)), (6) f w,u w u f w,u and P (w C) = f w,c w f. C w,c T (w, w) denotes the probability of the word w is the translation of the word w, β describes the tradeoff between the language model and translation model. B. Improved Question Retrieval Model Besides the content relevance, intention relevance describes the types of them. As the Table II introduces, questions coming from the same category would reflect the similar intention of the corresponding askers. By analyzing the question type, we can detect what the user really wants to know and therefore questions can be mapped from a bag of words into a sematic space. After training by the classifiers, each question would be given a list of labels and the orders describe the relevant degree. Then we can calculate the intention similarity between every question pair. After obtaining content relevance CON q,u and intention relevance INT q,u, we can calculate the general relevance of the query q and any history question u. Due to the different value scale, CON q,u and INT q,u should be normalized and then the general score is listed as follows: S q,u = γcon q,u + (1 γ)int q,u, (7) where γ is the combination weight which balances the content relevance and the intention relevance. V. EXPERIMENTAL SETUP In this section, we will introduce some analyses about our experimental data set and design to show the performance of our improved question retrieval model.

5 A. Experimental Objective Traditional retrieval models mainly focus on how to find relative history questions for a query in a question pool. However, questions can be naturally divided into several types, like recommendation, opinion, etc. Different types reflects diverse users intentions. Questions coming from the same type would probably have the highest similarity. Therefore, we try to separate the pool of questions into several sub groups with aim to increase the efficiency and accuracy of the traditional question retrieval models. B. Experimental Design As described in Section V-A, we design three experiments to demonstrate our idea. 1) We analyze our data set crawled from Yahoo! Answers website. The classification system of the Confucius system [17] is applied in our work. We assume that each query will be assigned to multiple labels and the number of labels is not fixed. This part of results are shown in Section V-C. 2) In the classification part, we apply three machine learning algorithms (Label Ranking, Naïve Bayes, and Support Vector Machine) to train the questions. To compare the results of those three algorithms, we divide the whole question set into two parts with equal size, training set and testing set. After that, classifier which has the best performance will be utilized in the third experiment. This part of results are shown in Section VI-A.3. 3) We will compare the results of our improved question retrieval model with the traditional question retrieval model. After the normalization, the combined score will be viewed the question s general score. This part of results are shown in Section VI-B.3. C. Data Set The questions from Yahoo! Answers are divided into 26 general categories by the website. Bases on this, our crawling job can be divided into three sub steps: firstly, we find the most popular queries. Yahoo! Answers API provides the get by category function to find the popular queries according to different categories, and then return the URL addresses of those queries. Secondly, based on those URL address, we utilize get question to find the corresponding questions details. Finally, those questions subject returned in the second step will be used as the search items in the question search function. Then we can get a list of related questions for each query as the baseline of our improved retrieval model. In our experiment, we have collected the top 50 interesting queries for each category, and top 50 related questions for each query. Another considerable problem is the quality of queries. Sometimes when we search a query in the CQA services, the system always tells us it cannot find any results. That means, no relative questions can be found in the history question pool. In our experiment, we collect the number of queries Fig. 2. Number of Questions in Each Category which have more than 10 related questions. The statistics results in our experiment is listed in Table 2. As Section III-A briefly introduced, we use the system in Table II to do the classification job. To verify it, 50 questions for each category are manually assigned to different questions types according to the classification system. To get reliable labels, three experienced persons take charge of the labeling task. The statistics result is shown in Table III. We can see that nearly each question is assigned to more than one type, especially there is one question has 6 labels. On average, each question has 3 or 4 labels. That demonstrate our assumption that each questions should be assigned to multiple labels and the number of labels is varied with different queries. TABLE III THE NUMBER OF LABELS Then we conduct another statistics analysis to show the average number of types for each question. This result will be viewed as the ground truth of our question classification. Considering the multi-type questions, each question is assigned a list of labels which describes the preference among them. We use Borda count to calculate this distribution. Borda count is a relative simple sort voting method. Given a ranking list of n question type, the top ranked type gets n votes, the second gets n 1 votes, etc. In this view, we can calculate the general votes for each ranking list. The statistics result is listed in Table IV. It is noted that the opinion contains the most questions, and next was the interactive type. In addition, the number of people who want to get some translation and transportation occupy the least shares.

6 TABLE IV QUESTION DISTRIBUTION category REC FAC YNO HOW WHY TSL number category DEF OPN TRA INT MAT number D. Question Feature To avoid the bias by different classifiers, we use the same question feature system in the three classification experiments. Considering the purpose of this paper is to detect the askers intentions, we utilize two groups of features listed as follows: 1) Bag of words. Each query or question consists of single words and symbols. The words dictionary is collected based on the whole data set we have crawled from Yahoo! Answers website. In this sense, each query is represented by a term frequency vector. 2) Special interrogatives and particular words. To emphasize some special words significance in our detecting task, we also extract various interrogative and particular words. For example, sentences started with the words do, can, did, does, etc. are probably belong to Yes/No type. Sentences including most, best might want to get some recommendations from answers. VI. EXPERIMENTAL RESULT AND COMMENTS In this section, we will show the experimental results of the question classification and the improved question retrieval. Questions are labeled manually by three experts and are divided into training set and test set with equal size. A. Question Classification To evaluate the classification performance, we compare three algorithms: Label Ranking, Naïve Bayes, and Support Vector Machines. 1) Parameter Selection: Both label ranking and Naïve Bayes are the instance based algorithms, therefore, we should select some neighbors which share the similar features for each new instance. As Section V-D introduces, question features can be divided into three parts: some particular words s 1, the bag of words s 2, the frequency of words s 3. For each feature pair, we utilize the cosine similarity to measure the similarity between two instances. s 2 describes the content of a question and s 3 describes the frequency of such words appeared in s 2. Therefore, only when two questions getting high scores in both s 2 and s 3, they might be similar with each other. The final measurement of similarity is s = s 1 +s 2 s 3. At last, s will be normalized between 0 and 1. Another considerable problem is the amount of neighbors in each iteration. The basic criteria is the higher similarity, the better to be a neighbor. In our experiment, we use a criteria score as the measurement. At first, the criteria will be preassigned as 0.5, if there are no instances fulfilling such requirement, the criteria will be divided by 2, etc. until we find some neighbors. For the support vector machine, we utilize LibSVM [6] to train the classifier. In order to compare the results of label ranking, we want to get a permutation sorted by the different types probabilities. Since there are 11 types in our classification system, 11 diverse models are trained to predict the types. Therefore, for each new instance, we can get 11 probabilities toward 11 types, and the list of types after sorting those results. We use RBF kernel e (γ u v 2) to train the classifier. γ will be set as 1/k, k is the number of features. The final result is listed as in Table V. 2) Evaluation: To avoid ambiguity, the discounted cumulative gain (DCG) [12] is used to measure our classification results instead of the kendall s tau coefficient. DCG is widely utilized measurement of web search engine algorithm. The premise of DCG is that highly related types ranked lower should be penalized. By calculating the graded relevance of a list, DCG measures the gain of a type based on its position in the result list. Suppose q 1 = { } is the ideal value labeled by an expert and q 2 = { } is the predicted result. We want to measure the quality of q 2. Based on the theory of DCG, the relevance score of q 1 is { } and it will be viewed as the ground truth. After that, we can quickly get the relevance score of q 2 = { }. Therefore, the DCG score can be calculated by Eq. (8). DCG = rel 1 + n rel i log i, (8) 2 i=2 where n is the number of type, in our problem n is predefined as 11. Based on Eq. (8), we can get the DCG score of q 1 and q 2 respectively. In the view of DCG scores are influenced by the length of query, so a normalization is necessary. Therefore, the normalized discounted cumulative gain or ndcg of a query q is computed as: ndcg q = DCG q IDCG q, (9) where IDCG is the DCG score of the ideal ranking, or q 1 in our example. It is noted that in a perfect prediction, the ndcg score will be equal to 1. In other occasions, the value of it will be between [0, 1]. Therefore, the higher value means the more accurate prediction. 3) Classification Result: After training by LR, NB, SVM, we calculate the ndcg scoreaccording to Eq. (8) and Eq. (9). The average value is listed in Table V. TABLE V COMPARISON RESULTS OF CLASSIFICATION MODELS Algorithm LR SVM NB NDCG We can see that among the three classifiers, LR shows the best performance and SVM performs the worst. That means label ranking can predict the most accurate value for the unlabeled data. Since our data set is randomly crawled from Yahoo! Answers, and it cover every category of it. Therefore,

7 it is safe to get the conclusion that LR is quite suitable to our problem of detecting user s intention. B. Classification Enhanced Question Retrieval Based on the experimental result, we choose the best performance classification algorithm label ranking to improve our performance because it shows the best performance in the classification part. 1) Parameter Selection: As Section IV-A introduces, the translation-based language model have two parameters: λ and β, the value of which are set between 0 and 1. λ is a smoothing parameter in the original language model and β controls the weights of language model and translation model. Given the classification score INT q,u and question score CON q,u, we will combine them together as a general score for each related question. As Section IV-B introduces, the parameter γ determines the weight of those two parts. Due to different measurements, the min-max normalization is used to get the new CON q,u : CON q,u = log(1 + CON q,u) v min v max v min, where CON q,u and CON q,u denote the new value and old value, v max and v min denote the maximum and minimum value of log(1 + CON q,u ), respectively. 2) Evaluation: Several questions retrieval metrics have been proposed to measurement the quality of related questions. We choose three measurement among them: mean average precision (MAP), mean reciprocal rank (MRR), and Precision at 10 MAP: the mean value of average precision, where the average precision of a single query is the mean of the precision scores at each relevant item returned in a results list. MRR: the mean of the reciprocal rank of the highest ranking relevant document. the mean of the precision of the first ten questions retrieved. 3) Improved Question Retrieval Results: In Section V- C, we analyze the distribution of each category. Since the question retrieval measurement, MAP, MRR and should be labeled by human manually, we choose 4 categories which have the most eligible queries based on Fig. 2, which are Beauty & Style, Entertainment & Music, Family & Relationship and Pregnancy & Parenting to show the performance. Table VI, VII, VIII, IX show the comparison results. We also show the MAP variation trend of the 4 categories in Fig. 3. For the Beauty & Style category, we can see that at the beginning of increasing γ, the MAP accuracy has decreased slightly, however, with continuing increasing the weight of classification, the general performance is getting better. The climax of MAP occurs when γ = 0.8. The value does not change obviously, and the maximum value happens when γ = 0.9. The MRR situation is exactly the opposite, the MAP value MAP value improved questions retrieval baseline gamma value (a) B&S gamma value (c) F&R Fig. 3. Improved question retrieval base line MAP value MAP value baseline improved question retrieval gamma value (b) E&M baseline improved question retrieval gamma value (d) P&P MAP change based on 4 categories climax value occurs in the early stages when γ = 0, 0.1. It suggests that the increased question retrieval model does improve the traditional question retrieval performance generally. However, MRR decreases slightly. For the Entertainment & Music category, MAP change trend resembles an arch shape. The highest value occurs when γ = 0.3. The maximum value of and MRR occur respectively when γ = 0 and γ = 0.8, 0.9. For the Family & Relationship category, MAP, and MRR show similar change trend, all those measures grow with increasing γ. At last, for the Pregnancy & Parenting category, the maximum MAP occurs when γ = 0.2. Both and MRR show better performance at the beginning of increasing γ. From those 4 categories results, we can get the conclusion that our proposed question retrieval model can indeed improve the MAP performance, though different category s best performance happens at different stages. For the and MRR, the change trends are not obvious, but the general performances have been ameliorated after adding intention analysis. Besides, all those models outperform the baseline dramatically. VII. CONCLUSION Question retrieval plays an important role in CQA services. In this paper, we propose an improved question retrieval model which is capable of detecting users intentions associated with the traditional question retrieval. By classifying different questions into several groups, the search scope has been narrowed and derive more satisfactions. In the classification part, we assume each user would have more than one intention and each question would be assigned a list of labels. Different preference among those labels reflect the relevances degree. To predict those possible labels and the preference among them, we propose to use label ranking method which shows the best performance compared with Naïve Bayes and Support Vector Machine. We combine our label ranking method with the translation-based language

8 TABLE VI IMPROVED QUESTION RETRIEVAL WITH LABEL RANKING FOR BEAUTY & STYLE CATEGORY Metrics γ = 0 γ = 0.1 γ = 0.2 γ = 0.3 γ = 0.4 γ = 0.5 γ = 0.6 γ = 0.7 γ = 0.8 γ = 0.9 BaseLine MAP MRR TABLE VII IMPROVED QUESTION RETRIEVAL WITH LABEL RANKING FOR ENTERTAINMENT & MUSIC CATEGORY Metrics γ = 0 γ = 0.1 γ = 0.2 γ = 0.3 γ = 0.4 γ = 0.5 γ = 0.6 γ = 0.7 γ = 0.8 γ = 0.9 BaseLine MAP MRR TABLE VIII IMPROVED QUESTION RETRIEVAL WITH LABEL RANKING FOR FAMILY & RELATIONSHIP CATEGORY Metrics γ = 0 γ = 0.1 γ = 0.2 γ = 0.3 γ = 0.4 γ = 0.5 γ = 0.6 γ = 0.7 γ = 0.8 γ = 0.9 BaseLine MAP MRR TABLE IX IMPROVED QUESTION RETRIEVAL WITH LABEL RANKING FOR PREGNANCY & PARENTING CATEGORY Metrics γ = 0 γ = 0.1 γ = 0.2 γ = 0.3 γ = 0.4 γ = 0.5 γ = 0.6 γ = 0.7 γ = 0.8 γ = 0.9 BaseLine MAP MRR model. Experimental results reveal that our proposed model can indeed improve the translation-based language model significantly. REFERENCES [1] A. Berger, R. Caruana, D. Cohn, D. Freitag, and V. Mittal. Bridging the lexical chasm: statistical approaches to answer-finding. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages ACM, [2] J. Bian, Y. Liu, E. Agichtein, and H. Zha. Finding the right facts in the crowd: factoid question answering over social media. In Proceeding of the 17th international conference on World Wide Web, pages ACM, [3] K. Brinker. Active learning of label ranking functions. In Proceedings of the twenty-first international conference on Machine learning, page 17. ACM, [4] X. Cao, G. Cong, B. Cui, and C. S. Jensen. A generalized framework of exploring category information for question retrieval in community question answer archives. In WWW 10: Proceedings of the 19th international conference on World wide web, pages , New York, NY, USA, ACM. [5] B. Cestnik. Estimating probabilities: A crucial task in machine learning. In Proceedings of the Ninth European Conference on Artificial Intelligence, volume 1990, pages 147 9, [6] C. C. Chang and C. J. Lin. LIBSVM: a library for support vector machines, Software available at edu.tw/ cjlin/libsvm. [7] W. Cheng, J. Huhn, and E. Hullermeier. Decision tree and instancebased learning for label ranking. In Proceedings of the 26th Annual International Conference on Machine Learning, pages ACM, [8] O. Dekel, C. Manning, and Y. Singer. Log-linear models for label ranking. Advances in Neural Information Processing Systems, 16, [9] H. Duan, Y. Cao, C. Lin, and Y. Yu. Searching questions by identifying question topic and question focus. In Proceedings of ACL. Citeseer, [10] M. Hearst, S. Dumais, E. Osman, J. Platt, and B. Scholkopf. Support vector machines. IEEE Intelligent Systems and Their Applications, 13(4):18 28, [11] E. Hullermeier, J. Furnkranz, W. Cheng, and K. Brinker. Label ranking by learning pairwise preferences. Artificial Intelligence, 172(16-17): , [12] K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4): , [13] J. Jeon, W. Croft, and J. Lee. Finding similar questions in large question and answer archives. In Proceedings of the 14th ACM international conference on Information and knowledge management, pages ACM, [14] J. Jeon, W. B. Croft, and J. H. Lee. Finding semantically similar questions based on their answers. In SIGIR 05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages , New York, NY, USA, ACM. [15] J. Jeon, W. B. Croft, and J. H. Lee. Finding similar questions in large question and answer archives. In CIKM 05: Proceedings of the 14th ACM international conference on Information and knowledge management, pages 84 90, New York, NY, USA, ACM. [16] M. Kendall. A new measure of rank correlation. Biometrika, 30(1-2):81, [17] X. Si, E. Chang, and M. Sun. Confucius and Its Intelligent Disciples: Integrating Social with Search. Proceedings of the VLDB Endowment, 3(2), [18] C. Spearman. The proof and measurement of association between two things. The American journal of psychology, 100(3): , [19] S. Vembu and T. Gartner. Label ranking algorithms: A survey. Preference Learning. Springer, [20] K. Wang, Z. Ming, and T.-S. Chua. A syntactic tree matching approach to finding similar questions in community-based qa services. In SIGIR 09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages , New York, NY, USA, ACM. [21] X. Xue, J. Jeon, and W. B. Croft. Retrieval models for question and answer archives. In SIGIR 08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages , New York, NY, USA, ACM.

Subordinating to the Majority: Factoid Question Answering over CQA Sites

Subordinating to the Majority: Factoid Question Answering over CQA Sites Journal of Computational Information Systems 9: 16 (2013) 6409 6416 Available at http://www.jofcis.com Subordinating to the Majority: Factoid Question Answering over CQA Sites Xin LIAN, Xiaojie YUAN, Haiwei

More information

Comparing IPL2 and Yahoo! Answers: A Case Study of Digital Reference and Community Based Question Answering

Comparing IPL2 and Yahoo! Answers: A Case Study of Digital Reference and Community Based Question Answering Comparing and : A Case Study of Digital Reference and Community Based Answering Dan Wu 1 and Daqing He 1 School of Information Management, Wuhan University School of Information Sciences, University of

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

Incorporating Participant Reputation in Community-driven Question Answering Systems

Incorporating Participant Reputation in Community-driven Question Answering Systems Incorporating Participant Reputation in Community-driven Question Answering Systems Liangjie Hong, Zaihan Yang and Brian D. Davison Department of Computer Science and Engineering Lehigh University, Bethlehem,

More information

Learning to Suggest Questions in Online Forums

Learning to Suggest Questions in Online Forums Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence Learning to Suggest Questions in Online Forums Tom Chao Zhou 1, Chin-Yew Lin 2,IrwinKing 3, Michael R. Lyu 1, Young-In Song 2

More information

Exploiting Bilingual Translation for Question Retrieval in Community-Based Question Answering

Exploiting Bilingual Translation for Question Retrieval in Community-Based Question Answering Exploiting Bilingual Translation for Question Retrieval in Community-Based Question Answering Guangyou Zhou, Kang Liu and Jun Zhao National Laboratory of Pattern Recognition Institute of Automation, Chinese

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

Improving Question Retrieval in Community Question Answering Using World Knowledge

Improving Question Retrieval in Community Question Answering Using World Knowledge Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Improving Question Retrieval in Community Question Answering Using World Knowledge Guangyou Zhou, Yang Liu, Fang

More information

Question Routing by Modeling User Expertise and Activity in cqa services

Question Routing by Modeling User Expertise and Activity in cqa services Question Routing by Modeling User Expertise and Activity in cqa services Liang-Cheng Lai and Hung-Yu Kao Department of Computer Science and Information Engineering National Cheng Kung University, Tainan,

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

A Generalized Framework of Exploring Category Information for Question Retrieval in Community Question Answer Archives

A Generalized Framework of Exploring Category Information for Question Retrieval in Community Question Answer Archives A Generalized Framework of Exploring Category Information for Question Retrieval in Community Question Answer Archives Xin Cao 1, Gao Cong 1, 2, Bin Cui 3, Christian S. Jensen 1 1 Department of Computer

More information

Finding Similar Questions in Large Question and Answer Archives

Finding Similar Questions in Large Question and Answer Archives Finding Similar Questions in Large Question and Answer Archives Jiwoon Jeon, W. Bruce Croft and Joon Ho Lee Center for Intelligent Information Retrieval, Computer Science Department University of Massachusetts,

More information

Joint Relevance and Answer Quality Learning for Question Routing in Community QA

Joint Relevance and Answer Quality Learning for Question Routing in Community QA Joint Relevance and Answer Quality Learning for Question Routing in Community QA Guangyou Zhou, Kang Liu, and Jun Zhao National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy

More information

Automatic Web Page Classification

Automatic Web Page Classification Automatic Web Page Classification Yasser Ganjisaffar 84802416 yganjisa@uci.edu 1 Introduction To facilitate user browsing of Web, some websites such as Yahoo! (http://dir.yahoo.com) and Open Directory

More information

2014, IJARCSSE All Rights Reserved Page 629

2014, IJARCSSE All Rights Reserved Page 629 Volume 4, Issue 6, June 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Improving Web Image

More information

REVIEW ON QUERY CLUSTERING ALGORITHMS FOR SEARCH ENGINE OPTIMIZATION

REVIEW ON QUERY CLUSTERING ALGORITHMS FOR SEARCH ENGINE OPTIMIZATION Volume 2, Issue 2, February 2012 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: A REVIEW ON QUERY CLUSTERING

More information

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Department of Computer Engineering, YMCA University of Science & Technology, Faridabad,

More information

The Use of Categorization Information in Language Models for Question Retrieval

The Use of Categorization Information in Language Models for Question Retrieval The Use of Categorization Information in Language Models for Question Retrieval Xin Cao, Gao Cong, Bin Cui, Christian S. Jensen and Ce Zhang Department of Computer Science, Aalborg University, Denmark

More information

Term extraction for user profiling: evaluation by the user

Term extraction for user profiling: evaluation by the user Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,

More information

INTERNATIONAL JOURNAL OF ADVANCES IN COMPUTING AND INFORMATION TECHNOLOGY An International online open access peer reviewed journal

INTERNATIONAL JOURNAL OF ADVANCES IN COMPUTING AND INFORMATION TECHNOLOGY An International online open access peer reviewed journal INTERNATIONAL JOURNAL OF ADVANCES IN COMPUTING AND INFORMATION TECHNOLOGY An International online open access peer reviewed journal Research Article ISSN 2277 9140 ABSTRACT Web page categorization based

More information

Blog Post Extraction Using Title Finding

Blog Post Extraction Using Title Finding Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School

More information

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different

More information

Automatic Identification of user goals in web search based on classification of click through results

Automatic Identification of user goals in web search based on classification of click through results Automatic Identification of user goals in web search based on classification of click through results Synopsis of the Thesis to be submitted for the Award of the Degree of Masters of Technology in Computer

More information

WEB PAGE CATEGORISATION BASED ON NEURONS

WEB PAGE CATEGORISATION BASED ON NEURONS WEB PAGE CATEGORISATION BASED ON NEURONS Shikha Batra Abstract: Contemporary web is comprised of trillions of pages and everyday tremendous amount of requests are made to put more web pages on the WWW.

More information

Citation Context Sentiment Analysis for Structured Summarization of Research Papers

Citation Context Sentiment Analysis for Structured Summarization of Research Papers Citation Context Sentiment Analysis for Structured Summarization of Research Papers Niket Tandon 1,3 and Ashish Jain 2,3 ntandon@mpi-inf.mpg.de, ashish.iiith@gmail.com 1 Max Planck Institute for Informatics,

More information

Approaches to Exploring Category Information for Question Retrieval in Community Question-Answer Archives

Approaches to Exploring Category Information for Question Retrieval in Community Question-Answer Archives Approaches to Exploring Category Information for Question Retrieval in Community Question-Answer Archives 7 XIN CAO and GAO CONG, Nanyang Technological University BIN CUI, Peking University CHRISTIAN S.

More information

Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior

Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior N.Jagatheshwaran 1 R.Menaka 2 1 Final B.Tech (IT), jagatheshwaran.n@gmail.com, Velalar College of Engineering and Technology,

More information

Learning to Recognize Reliable Users and Content in Social Media with Coupled Mutual Reinforcement

Learning to Recognize Reliable Users and Content in Social Media with Coupled Mutual Reinforcement Learning to Recognize Reliable Users and Content in Social Media with Coupled Mutual Reinforcement Jiang Bian College of Computing Georgia Institute of Technology jbian3@mail.gatech.edu Eugene Agichtein

More information

RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS

RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS ISBN: 978-972-8924-93-5 2009 IADIS RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS Ben Choi & Sumit Tyagi Computer Science, Louisiana Tech University, USA ABSTRACT In this paper we propose new methods for

More information

Cross-Validation. Synonyms Rotation estimation

Cross-Validation. Synonyms Rotation estimation Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

PRODUCT REVIEW RANKING SUMMARIZATION

PRODUCT REVIEW RANKING SUMMARIZATION PRODUCT REVIEW RANKING SUMMARIZATION N.P.Vadivukkarasi, Research Scholar, Department of Computer Science, Kongu Arts and Science College, Erode. Dr. B. Jayanthi M.C.A., M.Phil., Ph.D., Associate Professor,

More information

Searching Questions by Identifying Question Topic and Question Focus

Searching Questions by Identifying Question Topic and Question Focus Searching Questions by Identifying Question Topic and Question Focus Huizhong Duan 1, Yunbo Cao 1,2, Chin-Yew Lin 2 and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China, 200240 {summer, yyu}@apex.sjtu.edu.cn

More information

E-commerce Transaction Anomaly Classification

E-commerce Transaction Anomaly Classification E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce

More information

Weiwei Cheng & Eyke Hüllermeier

Weiwei Cheng & Eyke Hüllermeier Weiwei Cheng & Eyke Hüllermeier Knowledge Engineering & Bioinformatics Lab Department of Mathematics and Computer Science University of Marburg, Germany Multilabel Classification cloud sky tree 1/16 What

More information

Introducing diversity among the models of multi-label classification ensemble

Introducing diversity among the models of multi-label classification ensemble Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and

More information

Guest Editors Introduction: Machine Learning in Speech and Language Technologies

Guest Editors Introduction: Machine Learning in Speech and Language Technologies Guest Editors Introduction: Machine Learning in Speech and Language Technologies Pascale Fung (pascale@ee.ust.hk) Department of Electrical and Electronic Engineering Hong Kong University of Science and

More information

Routing Questions for Collaborative Answering in Community Question Answering

Routing Questions for Collaborative Answering in Community Question Answering Routing Questions for Collaborative Answering in Community Question Answering Shuo Chang Dept. of Computer Science University of Minnesota Email: schang@cs.umn.edu Aditya Pal IBM Research Email: apal@us.ibm.com

More information

BALANCE LEARNING TO RANK IN BIG DATA. Guanqun Cao, Iftikhar Ahmad, Honglei Zhang, Weiyi Xie, Moncef Gabbouj. Tampere University of Technology, Finland

BALANCE LEARNING TO RANK IN BIG DATA. Guanqun Cao, Iftikhar Ahmad, Honglei Zhang, Weiyi Xie, Moncef Gabbouj. Tampere University of Technology, Finland BALANCE LEARNING TO RANK IN BIG DATA Guanqun Cao, Iftikhar Ahmad, Honglei Zhang, Weiyi Xie, Moncef Gabbouj Tampere University of Technology, Finland {name.surname}@tut.fi ABSTRACT We propose a distributed

More information

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin Network Big Data: Facing and Tackling the Complexities Xiaolong Jin CAS Key Laboratory of Network Data Science & Technology Institute of Computing Technology Chinese Academy of Sciences (CAS) 2015-08-10

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School

More information

Paper Classification for Recommendation on Research Support System Papits

Paper Classification for Recommendation on Research Support System Papits IJCSNS International Journal of Computer Science and Network Security, VOL.6 No.5A, May 006 17 Paper Classification for Recommendation on Research Support System Papits Tadachika Ozono, and Toramatsu Shintani,

More information

Personalized Hierarchical Clustering

Personalized Hierarchical Clustering Personalized Hierarchical Clustering Korinna Bade, Andreas Nürnberger Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, D-39106 Magdeburg, Germany {kbade,nuernb}@iws.cs.uni-magdeburg.de

More information

Domain Classification of Technical Terms Using the Web

Domain Classification of Technical Terms Using the Web Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection

Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection Jian Qu, Nguyen Le Minh, Akira Shimazu School of Information Science, JAIST Ishikawa, Japan 923-1292

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel Yahoo! Research New York, NY 10018 goel@yahoo-inc.com John Langford Yahoo! Research New York, NY 10018 jl@yahoo-inc.com Alex Strehl Yahoo! Research New York,

More information

Search Result Optimization using Annotators

Search Result Optimization using Annotators Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,

More information

Probabilistic topic models for sentiment analysis on the Web

Probabilistic topic models for sentiment analysis on the Web University of Exeter Department of Computer Science Probabilistic topic models for sentiment analysis on the Web Chenghua Lin September 2011 Submitted by Chenghua Lin, to the the University of Exeter as

More information

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute

More information

1 Maximum likelihood estimation

1 Maximum likelihood estimation COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

More information

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Building Enriched Document Representations using Aggregated Anchor Text

Building Enriched Document Representations using Aggregated Anchor Text Building Enriched Document Representations using Aggregated Anchor Text Donald Metzler, Jasmine Novak, Hang Cui, and Srihari Reddy Yahoo! Labs 1 http://www.search-engines-book.com/ http://research.microsoft.com/

More information

Emoticon Smoothed Language Models for Twitter Sentiment Analysis

Emoticon Smoothed Language Models for Twitter Sentiment Analysis Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence Emoticon Smoothed Language Models for Twitter Sentiment Analysis Kun-Lin Liu, Wu-Jun Li, Minyi Guo Shanghai Key Laboratory of

More information

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Usha Nandini D 1, Anish Gracias J 2 1 ushaduraisamy@yahoo.co.in 2 anishgracias@gmail.com Abstract A vast amount of assorted

More information

Personalization of Web Search With Protected Privacy

Personalization of Web Search With Protected Privacy Personalization of Web Search With Protected Privacy S.S DIVYA, R.RUBINI,P.EZHIL Final year, Information Technology,KarpagaVinayaga College Engineering and Technology, Kanchipuram [D.t] Final year, Information

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Improving Web Page Retrieval using Search Context from Clicked Domain Names

Improving Web Page Retrieval using Search Context from Clicked Domain Names Improving Web Page Retrieval using Search Context from Clicked Domain Names Rongmei Li School of Electrical, Mathematics, and Computer Science University of Twente P.O.Box 217, 7500 AE, Enschede, the Netherlands

More information

A survey on click modeling in web search

A survey on click modeling in web search A survey on click modeling in web search Lianghao Li Hong Kong University of Science and Technology Outline 1 An overview of web search marketing 2 An overview of click modeling 3 A survey on click models

More information

Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures ~ Spring~r Table of Contents 1. Introduction.. 1 1.1. What is the World Wide Web? 1 1.2. ABrief History of the Web

More information

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines , 22-24 October, 2014, San Francisco, USA Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines Baosheng Yin, Wei Wang, Ruixue Lu, Yang Yang Abstract With the increasing

More information

A Survey on Product Aspect Ranking

A Survey on Product Aspect Ranking A Survey on Product Aspect Ranking Charushila Patil 1, Prof. P. M. Chawan 2, Priyamvada Chauhan 3, Sonali Wankhede 4 M. Tech Student, Department of Computer Engineering and IT, VJTI College, Mumbai, Maharashtra,

More information

Boosting Bookmark Category Web Page Classification Accuracy using Multiple Clustering Approaches

Boosting Bookmark Category Web Page Classification Accuracy using Multiple Clustering Approaches Boosting Bookmark Category Web Page Classification Accuracy using Multiple Clustering Approaches Chris Staff Department of Artificial Intelligence University of Malta Email: chris.staff@um.edu.mt Abstract

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

A Relation Extraction Method between Related Concepts using Web Search

A Relation Extraction Method between Related Concepts using Web Search DEIM Forum 2010 C1-2 Web 565-0871 1-5 113-8656 7-3-1 E-mail: {shirakawa.masumi,hara,nishio}@ist.osaka-u.ac.jp, nakayama@cks.u-tokyo.ac.jp, eiji.aramaki@gmail.com Web Wikipedia is-a a-part-of Wikipedia

More information

Online Ensembles for Financial Trading

Online Ensembles for Financial Trading Online Ensembles for Financial Trading Jorge Barbosa 1 and Luis Torgo 2 1 MADSAD/FEP, University of Porto, R. Dr. Roberto Frias, 4200-464 Porto, Portugal jorgebarbosa@iol.pt 2 LIACC-FEP, University of

More information

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India

More information

Projektgruppe. Categorization of text documents via classification

Projektgruppe. Categorization of text documents via classification Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction

More information

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Erkan Er Abstract In this paper, a model for predicting students performance levels is proposed which employs three

More information

An Evaluation of Machine Learning-based Methods for Detection of Phishing Sites

An Evaluation of Machine Learning-based Methods for Detection of Phishing Sites An Evaluation of Machine Learning-based Methods for Detection of Phishing Sites Daisuke Miyamoto, Hiroaki Hazeyama, and Youki Kadobayashi Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma,

More information

Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment

Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment 2009 10th International Conference on Document Analysis and Recognition Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment Ahmad Abdulkader Matthew R. Casey Google Inc. ahmad@abdulkader.org

More information

A Classification-based Approach to Question Answering in Discussion Boards

A Classification-based Approach to Question Answering in Discussion Boards A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong and Brian D. Davison Department of Computer Science and Engineering Lehigh University Bethlehem, PA 18015 USA {lih307,davison}@cse.lehigh.edu

More information

ENHANCED WEB IMAGE RE-RANKING USING SEMANTIC SIGNATURES

ENHANCED WEB IMAGE RE-RANKING USING SEMANTIC SIGNATURES International Journal of Computer Engineering & Technology (IJCET) Volume 7, Issue 2, March-April 2016, pp. 24 29, Article ID: IJCET_07_02_003 Available online at http://www.iaeme.com/ijcet/issues.asp?jtype=ijcet&vtype=7&itype=2

More information

Fast Matching of Binary Features

Fast Matching of Binary Features Fast Matching of Binary Features Marius Muja and David G. Lowe Laboratory for Computational Intelligence University of British Columbia, Vancouver, Canada {mariusm,lowe}@cs.ubc.ca Abstract There has been

More information

Introduction to Statistical Machine Learning

Introduction to Statistical Machine Learning CHAPTER Introduction to Statistical Machine Learning We start with a gentle introduction to statistical machine learning. Readers familiar with machine learning may wish to skip directly to Section 2,

More information

Less naive Bayes spam detection

Less naive Bayes spam detection Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:h.m.yang@tue.nl also CoSiNe Connectivity Systems

More information

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD. Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.

More information

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications Mining Signatures in Healthcare Data Based on Event Sequences and its Applications Siddhanth Gokarapu 1, J. Laxmi Narayana 2 1 Student, Computer Science & Engineering-Department, JNTU Hyderabad India 1

More information

Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Models vs. Patterns Models A model is a high level, global description of a

More information

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet Muhammad Atif Qureshi 1,2, Arjumand Younus 1,2, Colm O Riordan 1,

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

TRTML - A Tripleset Recommendation Tool based on Supervised Learning Algorithms

TRTML - A Tripleset Recommendation Tool based on Supervised Learning Algorithms TRTML - A Tripleset Recommendation Tool based on Supervised Learning Algorithms Alexander Arturo Mera Caraballo 1, Narciso Moura Arruda Júnior 2, Bernardo Pereira Nunes 1, Giseli Rabello Lopes 1, Marco

More information

System Behavior Analysis by Machine Learning

System Behavior Analysis by Machine Learning CSC456 OS Survey Yuncheng Li raingomm@gmail.com December 6, 2012 Table of contents 1 Motivation Background 2 3 4 Table of Contents Motivation Background 1 Motivation Background 2 3 4 Scenarios Motivation

More information

Forecasting stock markets with Twitter

Forecasting stock markets with Twitter Forecasting stock markets with Twitter Argimiro Arratia argimiro@lsi.upc.edu Joint work with Marta Arias and Ramón Xuriguera To appear in: ACM Transactions on Intelligent Systems and Technology, 2013,

More information

Automatically Selecting Answer Templates to Respond to Customer Emails

Automatically Selecting Answer Templates to Respond to Customer Emails Automatically Selecting Answer Templates to Respond to Customer Emails Rahul Malik, L. Venkata Subramaniam+ and Saroj Kaushik Dept. of Computer Science and Engineering, Indian Institute of Technology,

More information

On the Feasibility of Answer Suggestion for Advice-seeking Community Questions about Government Services

On the Feasibility of Answer Suggestion for Advice-seeking Community Questions about Government Services 21st International Congress on Modelling and Simulation, Gold Coast, Australia, 29 Nov to 4 Dec 2015 www.mssanz.org.au/modsim2015 On the Feasibility of Answer Suggestion for Advice-seeking Community Questions

More information

Web Page Categorization based on Document Structure

Web Page Categorization based on Document Structure 1 Web Page Categorization based on Document Structure Arul Prakash Asirvatham arul@gdit.iiit.net Kranthi Kumar. Ravi kranthi@gdit.iiit.net Centre for Visual Information Technology International Institute

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

Dynamics of Genre and Domain Intents

Dynamics of Genre and Domain Intents Dynamics of Genre and Domain Intents Shanu Sushmita, Benjamin Piwowarski, and Mounia Lalmas University of Glasgow {shanu,bpiwowar,mounia}@dcs.gla.ac.uk Abstract. As the type of content available on the

More information

An Introduction to Data Mining

An Introduction to Data Mining An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

More information

Construction Algorithms for Index Model Based on Web Page Classification

Construction Algorithms for Index Model Based on Web Page Classification Journal of Computational Information Systems 10: 2 (2014) 655 664 Available at http://www.jofcis.com Construction Algorithms for Index Model Based on Web Page Classification Yangjie ZHANG 1,2,, Chungang

More information

Client Perspective Based Documentation Related Over Query Outcomes from Numerous Web Databases

Client Perspective Based Documentation Related Over Query Outcomes from Numerous Web Databases Beyond Limits...Volume: 2 Issue: 2 International Journal Of Advance Innovations, Thoughts & Ideas Client Perspective Based Documentation Related Over Query Outcomes from Numerous Web Databases B. Santhosh

More information

Design call center management system of e-commerce based on BP neural network and multifractal

Design call center management system of e-commerce based on BP neural network and multifractal Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(6):951-956 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Design call center management system of e-commerce

More information

SEARCHING QUESTION AND ANSWER ARCHIVES

SEARCHING QUESTION AND ANSWER ARCHIVES SEARCHING QUESTION AND ANSWER ARCHIVES A Dissertation Presented by JIWOON JEON Submitted to the Graduate School of the University of Massachusetts Amherst in partial fulfillment of the requirements for

More information

Support Vector Machines for Dynamic Biometric Handwriting Classification

Support Vector Machines for Dynamic Biometric Handwriting Classification Support Vector Machines for Dynamic Biometric Handwriting Classification Tobias Scheidat, Marcus Leich, Mark Alexander, and Claus Vielhauer Abstract Biometric user authentication is a recent topic in the

More information