# Analyzing and Predicting Question Quality in Community Question Answering Services

Save this PDF as:

Size: px
Start display at page:

## Transcription

3 Table 2: Rule base for the ground truth setting NTA RM Table 3: Summar of questions in four levels Level Count 53,806 62,192 69,836 52,715 ing results are not congruent with different seeds. In spite of this, the size of each cluster varies sharpl from less than 10 to more than 50,000. Having consulted domain experts, we resort to expert based reasoning. The Pearson Correlations between each of the two variables are calculated and NT and NA are correlated (0.500) but either NT or NA shows little correlation with RM ( and respectivel). Therefore, we first normalize and average the values of NT and NA and then convert them into an integer in a scale from 1 to 4 (NTA hereafter, with 4 the highest qualit) using three equidistant cutting points of 0.75 (top 25%), 0.50 and 0.25 to assign each band roughl the same amount of questions. At the same time, RM is also transformed into 1 to 4 scale data using such approach. After that, two scale data are reasoned based on the rule base (see Table 2), which comes from consensus among the authors and domain experts. In the end, all questions are labeled as from level 1 to level 4, with level 4 the highest qualit questions. Table 3 summarizes questions with levels and the are taen as the ground truth. 4. STDY ONE: FACTORS AFFECTING QESTION QALITY In CQA portals, asers are posting questions on different topics and as such asers and topics are probabl the main sources of varied question qualit. However, we now little about the contribution of asers and topics to the question qualit. Here, we are concerning which factors have the major impacts on question qualit and we use the subcategories under Entertainment & Music as various topics. We do not select different categories as topics in that: first, we observe that the majorit of users onl as questions in a ver few categories, thus choosing subcategories as topics are more representative; second, different subcategories also reflect various topics, for instance, music, movies, polls and surves are three distinctive aspects of entertainment. Stud one is designed thus. We first select the two most popular subcategories 4 (namel, Music and Movies, see Table 1) as two representative topics in stud one and then chec their distributions of question qualit. Next, we trac 4 The subcategor Polls & Surves is not chosen since this subcategor is used to elicit public opinion and we observe questions in this subcategor usuall receive much more answers than others. Count Level (a) Count Level (b) Figure 2: Distributions of question qualit in three topics. (a) Music; (b) Movies. Table 4: Summar of question qualit for different asers. ser Music Movies Mean Std Mean Std asers with at least five questions in both these two subcategories and test question qualit of these questions. Figure 2 presents the histograms of question qualit of Music, and Movies. We can find that the distributions of question qualit in Music and Movies are close: the number of questions increases with question qualit decreases from level four to level one; the proportions of each level s questions are similar. The difference lies in that the proportion of questions in level two of Movies is larger. This observation tells us topics onl cannot distinguish good questions from bad ones. To investigate the influence of asers, we select a total of 22 asers who have ased at least 5 questions in the two sub-categories. Mean and standard deviation of the question qualit are reported in Table 4. Our observations are: 1) Different asers own various question qualities at the same topic. For instance, question qualit of user 8 is much higher than that of user 16; 2) The question qualit of the same aser on various topics have great differences. E.g., user 14 ass man good questions about Movies, but his/her question qualit in Music is poor. Therefore, we find that it is the interaction between aser and topics which plas the

4 most import role in distinguishing good questions from bad ones. To sum up, stud one examines the effects of asers and topics on question qualit. We observe that topics themselves cannot determine question qualit, and the interaction between asers and topics is the most important factor affecting question qualit. This observation motivates us to design a novel algorithm to predict question qualit in the next stud. 5. STDY TWO: PREDICTION OF QES- TION QALITY Stud one has uncovered the main factors of question qualit, but it is taen place when questions are resolved. In stud two, we have an even more challenging prediction wor: estimating question qualit right after a question is posted but still not answered b an answerers. Motivating b the result of stud one, we model the relationshipsamong questions, topics and asers as a bipartite graph model. Figure 3 shows one example, where u 1, u 2, and u 3 ass five questions (q 1,...,q 5) in three topics (t 1,t 2, and t 3). Each edge lining an aser and a question represents the question ased b the aser and each rectangle denotes a topic. In the example, we now that u 1 ass q 1 and q 3, and q 2 is in topic t 1. Here topics are represented b subcategories or categories in CQA portals. The ideas of our algorithm are straightforward: 1. As for the same topics, questions with similar structures and expressions will have identical qualit and users with same profiles will embrace approximate asing expertise. 2. As for different topics, users abilities to as good questions are not equivalent and such abilities are constant within a particular period. 3. Each question s qualit is estimated from the qualities of similar questions and the aser s abilities to as good questions in that topic. Meanwhile, each aser s abilit of asing good questions at one topic is estimated from his/her question qualit and similar asers asing abilities in that topic. Based on the these, we propose a graph-based SSL algorithm called Mutual Reinforcement Label Propagation (MRLP) to predict question qualit in CQA service. Before introducing MRLP, we first give the formal definitions of question qualit and users asing expertise. Definition 5.1 (Question qualit). Question q i s qualit is represented b ˆq i, which refers to its abilit to attract user attention, get answering attempts and receive the best answer efficientl. It ranges from 0 to 1. The higher value is, the higher qualit the question has. Definition 5.2 (Asing expertise). ser u j s asing expertise in topic t is represented b û j, which reflects the user s abilit to as high qualit questions within that topic. û j ranges from 0 to 1. It is worth noting that û j models the effect of interaction of the aser and the topic. 5.1 MRLP Supposetherearemaserswhoasnquestionsinttopics, let 1, 2,..., t denote the vectors (m 1) of asers asing Figure 3: A to example. Left: asers; Right: questions in various topics. expertise in these topics, and Q(n 1) denote the vector of question qualit, we define a m n matrix E, where e ij = 1(i 1,m,j 1,n) means u i ass q j, otherwise e ij = 0. From E we get E : E ij e ij = n =1 e. (1) i For the question part of the bipartite graph, we create edges between an two questions within same topics. The weightfortheedgeliningq i andq j isrepresentedbw(q i,q j), which is calculated from the cosine similarit between the features of two questions x i and x j: w(q i,q j) = exp( xi xj 2 ), (2) λ 2 q where λ q is a weighting parameter. w(q i,q j) is set to be 0 if q i and q j belong to two different topics. In addition, we define w(q i,q i) = 0. Then, we define an n n probabilistic transition matrix N: w(q i,q j) N ij = P(q i q j) = n =1 w(qi,q ), (3) where N ij is the probabilit of transit from q i to q j. Similarl, we create edges between an two asers who have ased questions in the same topic(s) for aser part of the graph with λ a as the weighting parameter using Eq. (2). In addition, we define a m m probabilistic transition matrix M lie N in Eq. (3). For topic t, given some nown labels of and/or Q, we describe the MRLP in Alg. 1. The equation at line 3 estimates users asing expertise from their neighbors and their questions qualities. Correspondingl, the equation at line 4 calculates questions qualit on topic from their neighbors and their asers asing expertise. Repeating MRLP times, all questions qualities and asers asing expertise are estimated. Now, we prove the convergence of the MRLP. Suppose there are l labeled data and u unlabeled data for questions qualities together with x labeled data and unlabeled data for asers asing expertise, i.e., Q = ˆq 1,...,ˆq l,ˆq l+1,...,ˆq l+u T and = u 1,...,u x,u (x+1),...,u (x+) T. Thus, We can split E,E T,M and N into four parts: E E = xl E xu M = E l E u,e T = E T xl E T l E T xu E T u, Mxx M x Nll N,N = lu. M x M N ul N uu

5 Thus, we get x Mxx M = α x x E M x M +(1 α) xl E xu Q l E c l E u Q u and Q l Q u c+1 c+1 Nll N +(1 β) = β lu Q E T E T l xl N ul N uu Q l x u E T c xu Eu T Since x and Q l are clamped to manual labels in each iteration, we now onl consider and Q u. From the above two equations we get: αm (1 α)e u = Q u c+1 + Let αm A = (1 β)eu T we get where Q u 0 Q u n βn uu (1 β)eu T αmxx +(1 α)e lq l βn ul Q l +(1 β)e T xu x (1 α)e u βn uu = A n Q u 0,b =. Q u c, c. c αmx x +(1 α)e lq l βn ul Q l +(1 β)e T xu x n +( A i 1 )b, i=1 are the initial values for unlabeled asers and questions. The following proof is similar to the one in Chapter 2 of 25. Since M, N, E and E T are row normalized (each row of E T onl contains one 1, others are 0 ), M, N uu, E u, and E T u are sub-matrixes of them, So +u γ < 1, A ij γ, i = 1,..., +u. j j=1 A n ij = j = γ n A n 1 i A n 1 i γ A n 1 i A j j A j Therefore the sum of each row of A converges to zero, thus A n Q 0. Finall we get u 0 = (I A) 1 b, Q u which are fixed values. 5.2 Experimental setup To verif the effectiveness of the MRLP in predicting question qualit, we experiment with the data described in Section 3. For each topic of Music and Movies, we choose questions of those asers who ased at least 10 questions in that topic. Since our goal is to distinguish high qualit questions from low qualit ones, we follow the common binar classification setting in the previous wor 19, 15, 1., Algorithm 1 MRLP-ST Input: user asing expertise vector 0, question qualit vector Q 0, E, transition matrixes M and N, weighting coefficients α and β, some manual labels of 0 and/or Q 0. 1: Set c = 0. 2: while not convergence do 3: Propagate user expertise. c+1 = α M c + (1 α) E Q c. 4: Propagate question qualit. Q c+1 = β N Q c +(1 β) E T c+1, where E T is the transpose of E. 5: Clamp the labeled data of c+1 and Q c+1. 6: Set c = c+1. 7: end while Table 5: Summar of data in stud two Music Movies # Questions 7,373 1,076 # High-Qualit Questions 3, # Low-Qualit Questions 3, # Asers Thus, we tae questions of level 3 and level 4 as high qualit ones and the other questions as low qualit ones. Table 5 summarizes the data. To get prediction performance at different training levels, we adjust the training rates from 10% to 90% in our experiments. For each rate we select the corresponding proportion of earlier posted questions as training data and the others as testing data Selected features Referring to the wor of 1 and 2, we adopt the features in Table 6 to construct graphs and train classifiers. The are divided into question-related and aser-related features. Question-related features are extracted from question text including subject and content; aser-related features come from asers profiles. For features such as POS entrop, we use the tool OpenNLP 5 to conduct toenization, detect sentences and annotate the part-of-speech tags. In addition, we utilize the Microsoft Office Word Primar Interop Reference 6 to detect tpo errors. We also report the information gain of each feature in Table 6. It is found that all features information gains are small, which means these features are not so salient to question qualit. In addition, aser-related features are more crucial than question-related features since their information gains are higher. As for question-related features, space densit and subject length are the most important ones Methods compared We compare the MRLP with the following methods: Logistic Regression: Shah et al. 19 appl logistic regression model to predict answer qualit in Yahoo! Answers. Here we adopt the same approach to predict question qualit with question-related features onl (LR Q), and both question-related and aser-related features (LR QA). These two methods are treated as baselines

7 Table 7: Different methods performance with question-related features onl versus both question-related and user-related features (Music: α = 0.2, β = 0.2; Movies: α = 0.8, β = 0.1) Methods Accurac under training rate (%) Music Movies LR Q LR QA HF Q HF QA SGBT Q SGBT QA MRLP Sensitivit LR_Q LR_QA HF_Q HF_QA SGBT_Q SGBT_QA MRLP Traing rate(%) Figure 4: Sensitivit versus training rate across various methods in Music topics. Even the training rate is set to be 90%, there are still more than 35% of questions not correctl classified. The reason is that question text and aser profile features are not salient features of question qualit, as shown in Table 6. Since all features information gains are less than 0.05, it is ver hard to mae satisfing prediction using these features Question-related features vs. aser-related features Comparing LR Q, HF Q, and SGBT Q with LR QA, HF QA and SGBT QA from Table 7, we find that with aser-related features the accurac of prediction is substantiall higher than the same methods without using aserrelated features in Music. However, there seems to be a decrease of accurac if aser-related features are used in Movies, fewer asers in Movies ma explain this special case. Figures 4, 5, 6, and 7 give more details. In specific, utilizing aser-related features increases the Sensitivit of SGBT and the Specificit of LR and HF in Music, and enhance the Sensitivit of LR and HF in Movies. However, it decreases the Sensitivit of HF and Specificit of SGBT in Music and the Specificit of LR and HF in Movies Mixture vs. separation of user-related features ComparingLR QA, HF QAandSGBT QAwithMRLP which all use question-related and user-related features, MRLP performs the best on Accurac. When looing at the Sensitivit in Fig. 4 and Fig. 6, the Specificitin Fig. 5 and Fig. 7, MRLP is more balanced in Sensitivit and Specificit than other algorithms. For instance, LR Q has the highest Specificit for Movies but the lowest Sensitivit, which means it Specificit LR_Q 0.45 LR_QA HF_Q HF_QA 0.4 SGBT_Q SGBT_QA MRLP Traing rate(%) Figure 5: Specificit versus training rate across various methods in Music Sensitivit LR_Q 0.2 LR_QA HF_Q HF_QA 0.1 SGBT_Q SGBT_QA MRLP Traing rate(%) Figure 6: Sensitivit versus training rate across various methods in Movies Specificit LR_Q LR_QA HF_Q HF_QA 0.5 SGBT_Q SGBT_QA MRLP Traing rate(%) Figure 7: Specificit versus training rate across various methods in Movies

### Analyzing and Predicting Question Quality in Community Question Answering Services

WWW 2012 CQA'12 Worshop Analzing and Predicting Question Qualit in Communit Question Answering Services Baichuan Li 1,TanJin 1, Michael R. Lu 1,IrwinKing 2,1, and Barle Ma 1 1 The Chinese niversit of Hong

### Subordinating to the Majority: Factoid Question Answering over CQA Sites

Journal of Computational Information Systems 9: 16 (2013) 6409 6416 Available at http://www.jofcis.com Subordinating to the Majority: Factoid Question Answering over CQA Sites Xin LIAN, Xiaojie YUAN, Haiwei

### Learning to Recognize Reliable Users and Content in Social Media with Coupled Mutual Reinforcement

Learning to Recognize Reliable Users and Content in Social Media with Coupled Mutual Reinforcement Jiang Bian College of Computing Georgia Institute of Technology jbian3@mail.gatech.edu Eugene Agichtein

### Joint Relevance and Answer Quality Learning for Question Routing in Community QA

Joint Relevance and Answer Quality Learning for Question Routing in Community QA Guangyou Zhou, Kang Liu, and Jun Zhao National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy

### Incorporating Participant Reputation in Community-driven Question Answering Systems

Incorporating Participant Reputation in Community-driven Question Answering Systems Liangjie Hong, Zaihan Yang and Brian D. Davison Department of Computer Science and Engineering Lehigh University, Bethlehem,

### LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph

### COMMUNITY QUESTION ANSWERING (CQA) services, Improving Question Retrieval in Community Question Answering with Label Ranking

Improving Question Retrieval in Community Question Answering with Label Ranking Wei Wang, Baichuan Li Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, N.T., Hong

### Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data

Client Based Power Iteration Clustering Algorithm to Reduce Dimensionalit in Big Data Jaalatchum. D 1, Thambidurai. P 1, Department of CSE, PKIET, Karaikal, India Abstract - Clustering is a group of objects

### Question Routing by Modeling User Expertise and Activity in cqa services

Question Routing by Modeling User Expertise and Activity in cqa services Liang-Cheng Lai and Hung-Yu Kao Department of Computer Science and Information Engineering National Cheng Kung University, Tainan,

### Topical Authority Identification in Community Question Answering

Topical Authority Identification in Community Question Answering Guangyou Zhou, Kang Liu, and Jun Zhao National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences 95

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms K-means and its variants Hierarchical clustering

### An Evaluation of Classification Models for Question Topic Categorization

An Evaluation of Classification Models for Question Topic Categorization Bo Qu, Gao Cong, Cuiping Li, Aixin Sun, Hong Chen Renmin University, Beijing, China {qb8542,licuiping,chong}@ruc.edu.cn Nanyang

### K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

K-Means Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster

### Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? K-Means Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar

### MapReduce Approach to Collective Classification for Networks

MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/4 What is

### Detecting Promotion Campaigns in Community Question Answering

Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) Detecting Promotion Campaigns in Community Question Answering Xin Li, Yiqun Liu, Min Zhang, Shaoping

### 2.7 Applications of Derivatives to Business

80 CHAPTER 2 Applications of the Derivative 2.7 Applications of Derivatives to Business and Economics Cost = C() In recent ears, economic decision making has become more and more mathematicall oriented.

### Improving Question Retrieval in Community Question Answering Using World Knowledge

Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Improving Question Retrieval in Community Question Answering Using World Knowledge Guangyou Zhou, Yang Liu, Fang

### Distributed Regression For Heterogeneous Data Sets 1

Distributed Regression For Heterogeneous Data Sets 1 Yan Xing, Michael G. Madden, Jim Duggan, Gerard Lyons Department of Information Technology National University of Ireland, Galway Ireland {yan.xing,

### For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall

Cluster Validation Cluster Validit For supervised classification we have a variet of measures to evaluate how good our model is Accurac, precision, recall For cluster analsis, the analogous question is

Will my Question be Answered? Predicting Question Answerability in Community Question-Answering Sites Gideon Dror, Yoelle Maarek and Idan Szpektor Yahoo! Labs, MATAM, Haifa 31905, Israel {gideondr,yoelle,idan}@yahoo-inc.com

### PULLING OUT OPINION TARGETS AND OPINION WORDS FROM REVIEWS BASED ON THE WORD ALIGNMENT MODEL AND USING TOPICAL WORD TRIGGER MODEL

Journal homepage: www.mjret.in ISSN:2348-6953 PULLING OUT OPINION TARGETS AND OPINION WORDS FROM REVIEWS BASED ON THE WORD ALIGNMENT MODEL AND USING TOPICAL WORD TRIGGER MODEL Utkarsha Vibhute, Prof. Soumitra

### A Classification-based Approach to Question Answering in Discussion Boards

A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong and Brian D. Davison Department of Computer Science and Engineering Lehigh University Bethlehem, PA 18015 USA {lih307,davison}@cse.lehigh.edu

### Finding the Right Facts in the Crowd: Factoid Question Answering over Social Media

Finding the Right Facts in the Crowd: Factoid Question Answering over Social Media ABSTRACT Jiang Bian College of Computing Georgia Institute of Technology Atlanta, GA 30332 jbian@cc.gatech.edu Eugene

Routing Questions for Collaborative Answering in Community Question Answering Shuo Chang Dept. of Computer Science University of Minnesota Email: schang@cs.umn.edu Aditya Pal IBM Research Email: apal@us.ibm.com

### Blog Post Extraction Using Title Finding

Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School

### A Web Page Classification Algorithm Based on Feature Selection

Journal of Information & Computational Science 12:4 (2015) 1549 1556 March 1, 2015 Available at http://www.joics.com A Web Page Classification Algorithm Based on Feature Selection Hongfang Zhou a,, Jie

### Study of Euclidean and Manhattan Distance Metrics using Simple K-Means Clustering

Study of and Distance Metrics using Simple K-Means Clustering Deepak Sinwar #1, Rahul Kaushik * # Assistant Professor, * M.Tech Scholar Department of Computer Science & Engineering BRCM College of Engineering

### Method of Fault Detection in Cloud Computing Systems

, pp.205-212 http://dx.doi.org/10.14257/ijgdc.2014.7.3.21 Method of Fault Detection in Cloud Computing Systems Ying Jiang, Jie Huang, Jiaman Ding and Yingli Liu Yunnan Key Lab of Computer Technology Application,

### Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines

, 22-24 October, 2014, San Francisco, USA Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines Baosheng Yin, Wei Wang, Ruixue Lu, Yang Yang Abstract With the increasing

### MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph Janani K 1, Narmatha S 2 Assistant Professor, Department of Computer Science and Engineering, Sri Shakthi Institute of

### Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

### Design call center management system of e-commerce based on BP neural network and multifractal

Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(6):951-956 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Design call center management system of e-commerce

### Learning with Local and Global Consistency

Learning with Local and Global Consistency Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 7276 Tuebingen, Germany

### The Big Picture. Correlation. Scatter Plots. Data

The Big Picture Correlation Bret Hanlon and Bret Larget Department of Statistics Universit of Wisconsin Madison December 6, We have just completed a length series of lectures on ANOVA where we considered

### Learning with Local and Global Consistency

Learning with Local and Global Consistency Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 7276 Tuebingen, Germany

### Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

### Incorporate Credibility into Context for the Best Social Media Answers

PACLIC 24 Proceedings 535 Incorporate Credibility into Context for the Best Social Media Answers Qi Su a,b, Helen Kai-yun Chen a, and Chu-Ren Huang a a Department of Chinese & Bilingual Studies, The Hong

### Learning to Suggest Questions in Online Forums

Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence Learning to Suggest Questions in Online Forums Tom Chao Zhou 1, Chin-Yew Lin 2,IrwinKing 3, Michael R. Lyu 1, Young-In Song 2

### Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

, pp.273-284 http://dx.doi.org/10.14257/ijdta.2015.8.5.24 Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network Gengxin Sun 1, Sheng Bin 2 and

JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 28, 83-97 (2012) Aggregate Two-Way Co-Clustering of Ads and User Data for Online Advertisements * Department of Computer Science and Information Engineering

### Face Recognition in Low-resolution Images by Using Local Zernike Moments

Proceedings of the International Conference on Machine Vision and Machine Learning Prague, Czech Republic, August14-15, 014 Paper No. 15 Face Recognition in Low-resolution Images by Using Local Zernie

### Comparing IPL2 and Yahoo! Answers: A Case Study of Digital Reference and Community Based Question Answering

Comparing and : A Case Study of Digital Reference and Community Based Answering Dan Wu 1 and Daqing He 1 School of Information Management, Wuhan University School of Information Sciences, University of

### REVIEW ON QUERY CLUSTERING ALGORITHMS FOR SEARCH ENGINE OPTIMIZATION

Volume 2, Issue 2, February 2012 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: A REVIEW ON QUERY CLUSTERING

### A Novel Method to Detect Fraud Ranking For Mobile Apps

A Novel Method to Detect Fraud Ranking For Mobile Apps Geedikapally Rajesh Kumar M.Tech Student Department of CSE St.Peter's Engineering College. Abstract: Ranking fraud in the mobile App market refers

### Question Quality in Community Question Answering Forums: A Survey

Question Quality in Community Question Answering Forums: A Survey ABSTRACT Antoaneta Baltadzhieva Tilburg University P.O. Box 90153 Tilburg, Netherlands a baltadzhieva@yahoo.de Community Question Answering

### The Graph of a Linear Equation

4.1 The Graph of a Linear Equation 4.1 OBJECTIVES 1. Find three ordered pairs for an equation in two variables 2. Graph a line from three points 3. Graph a line b the intercept method 4. Graph a line that

### Automatic Web Page Classification

Automatic Web Page Classification Yasser Ganjisaffar 84802416 yganjisa@uci.edu 1 Introduction To facilitate user browsing of Web, some websites such as Yahoo! (http://dir.yahoo.com) and Open Directory

### Sentence Ordering based on Cluster Adjacency in Multi-Document Summarization

Sentence Ordering based on Cluster Adjacency in Multi-Document Summarization Ji Donghong, Nie Yu Institute for Infocomm Research Singapore, 119613 {dhji, ynie}@i2r.a-star.edu.sg ABSTRACT In this paper,

### An Implementation of Leaf Recognition System using Leaf Vein and Shape

An Implementation of Leaf Recognition System using Leaf Vein and Shape Kue-Bum Lee and Kwang-Seok Hong College of Information and Communication Engineering, Sungkyunkwan University, 300, Chunchun-dong,

### Cluster Analysis: Basic Concepts and Algorithms

Cluster Analsis: Basic Concepts and Algorithms What does it mean clustering? Applications Tpes of clustering K-means Intuition Algorithm Choosing initial centroids Bisecting K-means Post-processing Strengths

### INTERNATIONAL JOURNAL OF ADVANCES IN COMPUTING AND INFORMATION TECHNOLOGY An International online open access peer reviewed journal

INTERNATIONAL JOURNAL OF ADVANCES IN COMPUTING AND INFORMATION TECHNOLOGY An International online open access peer reviewed journal Research Article ISSN 2277 9140 ABSTRACT Web page categorization based

### Booming Up the Long Tails: Discovering Potentially Contributive Users in Community-Based Question Answering Services

Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media Booming Up the Long Tails: Discovering Potentially Contributive Users in Community-Based Question Answering Services

### Random forest algorithm in big data environment

Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest

### 4. GPCRs PREDICTION USING GREY INCIDENCE DEGREE MEASURE AND PRINCIPAL COMPONENT ANALYIS

4. GPCRs PREDICTION USING GREY INCIDENCE DEGREE MEASURE AND PRINCIPAL COMPONENT ANALYIS The GPCRs sequences are made up of amino acid polypeptide chains. We can also call them sub units. The number and

### Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

### PRODUCT REVIEW RANKING SUMMARIZATION

PRODUCT REVIEW RANKING SUMMARIZATION N.P.Vadivukkarasi, Research Scholar, Department of Computer Science, Kongu Arts and Science College, Erode. Dr. B. Jayanthi M.C.A., M.Phil., Ph.D., Associate Professor,

### Quality-Aware Collaborative Question Answering: Methods and Evaluation

Quality-Aware Collaborative Question Answering: Methods and Evaluation ABSTRACT Maggy Anastasia Suryanto School of Computer Engineering Nanyang Technological University magg0002@ntu.edu.sg Aixin Sun School

### Drug Store Sales Prediction

Drug Store Sales Prediction Chenghao Wang, Yang Li Abstract - In this paper we tried to apply machine learning algorithm into a real world problem drug store sales forecasting. Given store information,

### Six Sigma applied in inventory management Biao Hu 1,a, Yun Tian 2,b

Advanced Engineering Forum Online: 2011-09-09 ISSN: 2234-991X, Vol. 1, pp 355-359 doi:10.4028/www.scientific.net/aef.1.355 2011 Trans Tech Publications, Switzerland Six Sigma applied in inventor management

### Research and Implementation of Real-time Automatic Web Page Classification System

3rd International Conference on Material, Mechanical and Manufacturing Engineering (IC3ME 2015) Research and Implementation of Real-time Automatic Web Page Classification System Weihong Han 1, a *, Weihui

### HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan

### Personalizing Image Search from the Photo Sharing Websites

Personalizing Image Search from the Photo Sharing Websites Swetha.P.C, Department of CSE, Atria IT, Bangalore swethapc.reddy@gmail.com Aishwarya.P Professor, Dept.of CSE, Atria IT, Bangalore aishwarya_p27@yahoo.co.in

### Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

### Predicting Movie Revenue from IMDb Data

Predicting Movie Revenue from IMDb Data Steven Yoo, Robert Kanter, David Cummings TA: Andrew Maas 1. Introduction Given the information known about a movie in the week of its release, can we predict the

### Image Content-Based Email Spam Image Filtering

Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among

### Educational Social Network Group Profiling: An Analysis of Differentiation-Based Methods

Educational Social Network Group Profiling: An Analysis of Differentiation-Based Methods João Emanoel Ambrósio Gomes 1, Ricardo Bastos Cavalcante Prudêncio 1 1 Centro de Informática Universidade Federal

### Data Mining Clustering. Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering Toon Calders Sheets are based on the those provided b Tan, Steinbach, and Kumar. Introduction to Data Mining What is Cluster Analsis? Finding groups of objects such that the objects

### 1 o Semestre 2007/2008

Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction

### Apriori algorithm for economic data mining in sports industry

COMPUTER MODELLING & NEW TECHNOLOGIES 014 18(1C) 451-455 Apriori algorithm for economic data mining in sports industry Abstract Yaguang Xiang* Sports Institute, West Anhui University, Liu'an, 3701,Anhui,

### Social Prediction in Mobile Networks: Can we infer users emotions and social ties?

Social Prediction in Mobile Networks: Can we infer users emotions and social ties? Jie Tang Tsinghua University, China 1 Collaborate with John Hopcroft, Jon Kleinberg (Cornell) Jinghai Rao (Nokia), Jimeng

### A Semi-supervised Ensemble Approach for Mining Data Streams

JOURNAL OF COMPUTERS, VOL. 8, NO. 11, NOVEMBER 2013 2873 A Semi-supervised Ensemble Approach for Mining Data Streams Jing Liu 1,2, Guo-sheng Xu 1,2, Da Xiao 1,2, Li-ze Gu 1,2, Xin-xin Niu 1,2 1.Information

### Chapter 16, Part C Investment Portfolio. Risk is often measured by variance. For the binary gamble L= [, z z;1/2,1/2], recall that expected value is

Chapter 16, Part C Investment Portfolio Risk is often measured b variance. For the binar gamble L= [, z z;1/,1/], recall that epected value is 1 1 Ez = z + ( z ) = 0. For this binar gamble, z represents

### Exploiting Bilingual Translation for Question Retrieval in Community-Based Question Answering

Exploiting Bilingual Translation for Question Retrieval in Community-Based Question Answering Guangyou Zhou, Kang Liu and Jun Zhao National Laboratory of Pattern Recognition Institute of Automation, Chinese

### Construction Algorithms for Index Model Based on Web Page Classification

Journal of Computational Information Systems 10: 2 (2014) 655 664 Available at http://www.jofcis.com Construction Algorithms for Index Model Based on Web Page Classification Yangjie ZHANG 1,2,, Chungang

### A semi-supervised Spam mail detector

A semi-supervised Spam mail detector Bernhard Pfahringer Department of Computer Science, University of Waikato, Hamilton, New Zealand Abstract. This document describes a novel semi-supervised approach

### An Introduction to Data Mining

An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

### The Artificial Prediction Market

The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

### Domain Classification of Technical Terms Using the Web

Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using

### Clustering Technique in Data Mining for Text Documents

Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor

### Research on News Video Multi-topic Extraction and Summarization

International Journal of New Technology and Research (IJNTR) ISSN:2454-4116, Volume-2, Issue-3, March 2016 Pages 37-39 Research on News Video Multi-topic Extraction and Summarization Di Li, Hua Huo Abstract

### Study on Human Performance Reliability in Green Construction Engineering

Study on Human Performance Reliability in Green Construction Engineering Xiaoping Bai a, Cheng Qian b School of management, Xi an University of Architecture and Technology, Xi an 710055, China a xxpp8899@126.com,

### A Logistic Regression Approach to Ad Click Prediction

A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi kondakin@usc.edu Satakshi Rana satakshr@usc.edu Aswin Rajkumar aswinraj@usc.edu Sai Kaushik Ponnekanti ponnekan@usc.edu Vinit Parakh

### Teaching in School of Electronic, Information and Electrical Engineering

Introduction to Teaching in School of Electronic, Information and Electrical Engineering Shanghai Jiao Tong University Outline Organization of SEIEE Faculty Enrollments Undergraduate Programs Sample Curricula

### E-commerce Transaction Anomaly Classification

E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce

### Probabilistic topic models for sentiment analysis on the Web

University of Exeter Department of Computer Science Probabilistic topic models for sentiment analysis on the Web Chenghua Lin September 2011 Submitted by Chenghua Lin, to the the University of Exeter as

### Approaches to Exploring Category Information for Question Retrieval in Community Question-Answer Archives

Approaches to Exploring Category Information for Question Retrieval in Community Question-Answer Archives 7 XIN CAO and GAO CONG, Nanyang Technological University BIN CUI, Peking University CHRISTIAN S.

### Model for Voter Scoring and Best Answer Selection in Community Q&A Services

Model for Voter Scoring and Best Answer Selection in Community Q&A Services Chong Tong Lee *, Eduarda Mendes Rodrigues 2, Gabriella Kazai 3, Nataša Milić-Frayling 4, Aleksandar Ignjatović *5 * School of

### Topic and Trend Detection in Text Collections using Latent Dirichlet Allocation

Topic and Trend Detection in Text Collections using Latent Dirichlet Allocation Levent Bolelli 1, Şeyda Ertekin 2, and C. Lee Giles 3 1 Google Inc., 76 9 th Ave., 4 th floor, New York, NY 10011, USA 2

### A Comparative Study on Sentiment Classification and Ranking on Product Reviews

A Comparative Study on Sentiment Classification and Ranking on Product Reviews C.EMELDA Research Scholar, PG and Research Department of Computer Science, Nehru Memorial College, Putthanampatti, Bharathidasan

### Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management

21st International Congress on Modelling and Simulation, Gold Coast, Australia, 29 Nov to 4 Dec 2015 www.mssanz.org.au/modsim2015 On the Feasibility of Answer Suggestion for Advice-seeking Community Questions

### Cross-validation for detecting and preventing overfitting

Cross-validation for detecting and preventing overfitting Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures.

Ranking Community Answers by Modeling Question-Answer Relationships via Analogical Reasoning Xin-Jing Wang Microsoft Research Asia 4F Sigma, 49 Zhichun Road Beijing, P.R.China xjwang@microsoft.com Xudong

### Web-Search Ranking with Initialized Gradient Boosted Regression Trees

JMLR: Workshop and Conference Proceedings 14 (2011) 77 89 Yahoo! Learning to Rank Challenge Web-Search Ranking with Initialized Gradient Boosted Regression Trees Ananth Mohan Zheng Chen Kilian Weinberger

### MATH 564 Project Report. Analysis of Desktop Virtualization Capacity with. Linear Regression Model

MATH 564 Project Report Analsis of Desktop Virtualization Capacit with Linear Regression Model Hongwei Jin CWID:A20288745 Dec. 1 st, 2012 1. Problem Describe a) Background Information At the beginning,

### SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute