# Extracting Problem and Resolution Information from Online Discussion Forums

Save this PDF as:

Size: px
Start display at page:

## Transcription

5 Since discussion forums primarily center around a specific domain, with discussion threads on topics in that domain, it is also possible to leverage a domain specific ontology or dictionary, to help identify parts of threads with useful information. For instance, we could leverage the Linux Kernel Glossary 4 or the SAP Glossary 5 to generate features for classification based on the presence or absence of these glossary terms, when extracting information from each of these discussion forums respectively. In the next section, we outline our approach to extract problem and resolution related information. 3.3 Problem Formulation We approach the problem of extracting problem and resolution-related information from discussion forums by studying the basic structure of the threads and noting their general composition and characteristics. The formal description of the problem is as below. A discussion forum thread T, is composed of message posts M 1 through M n where n is the length of the discussion thread. This can be represented as, T =< M 1, M 2, M 3, M 4, M 5,...M n > Each message post M i where i takes values in the range of 1 to n, and is generally composed of a greeting, message description and a note of thanks. The part of the message that is important with respect to extraction of problem and resolution-related information is the message description. The greetings may help determine which post the message author is replying to in the discussion thread. Thus, a message post can be represented as M i, where, M i =< m 1, m 2, m 3, m 4, m 5...m j > where m j is a sentence and the values of j are in the range of 1 to the total number of meaningful sentences that add up to form the message. Of these meaningful sentences, only some are actually relevant to the problem or resolution. Thus, the problem or resolution-related information is a subset of the sentences in M i. Now, let M i be the set that is composed of all the problem or resolution-related information in a message post, such that M i M i and m k M i is the sentence that has potentially relevant and useful problem or resolutionrelated information. Also, every sentence in a message post is made up of multiple text snippets or ngrams, of which some maybe relevant to the problem or resolution and some may not. This can be defined as, m k =< mp 1, mp 2, mp 3, mp 4, mp 5...mp q > where q is the total number of ngrams of a predetermined length in m k. Out of all these parts, only some may be relevant to the problem or resolution. Thus we have, m k m k, and mp r m k is the text snippet that has relevant or useful information. Our task can now be defined as follows: Sentence-level classification: For each T, find every m k that belongs to M i and classify it as problem- or resolution-related. In most cases, each m k that belongs to a M i will be either problem- or resolutionrelated. However, there could be instances where M i is a combination of both problem- and resolution-related sentences m k. Phrase-level classification: For each T, find every mp r that belongs to m k and classify it as problem- or resolution-related. We leverage contextual information resulting from specific discussion flows within discussion threads for extracting useful information from them. The sequential order of message posts in a thread, along with the fact that, in most cases, a message post is dependent on the previous message post allows us to create effective features for classification. In fact, the availability of contextual information is key to modeling the problem as a sequence labeling task, where the aim is to identify sections of useful information in the thread, in the context of other information that is present in the thread. We now proceed to describe the CRF model and its application to classify as problem or resolution-related, each m k in case of sentence-level classification, and each mp r in case of phrase-level classification. 4 Sequence Labeling using Conditional Random Fields There are several established techniques to perform sequence labeling of text data. In this paper, we propose to use a supervised approach using CRFs for sequence labeling of the discussion threads. CRFs are discriminative models, basically an extension of logistic regression and a special case of log-linear models. They compute probability of a label sequence given an observation sequence, assuming that the current label (state) depends only upon the previous label (state) and the observation, as given below: P (Y X, W ) = exp( j k w jf j (y k 1, y k, X) (1) Z(X)) where Y = {y 1, y 2,..., y m } denote the label sequence and X = {x 1, x 2,..., x n } denote the input sequence, f j (.) denote j th feature function over its arguments and w j denote the weight for the j th feature function. Z(X) is a normalization factor over all label sequences, defined as follows: Z(X) = exp( w j f j (y k 1, y k, X)) y Y T j k where Y T is the set of all label sequences. Inference to find the most likely state sequence given the observation sequence, given as below, can be performed

8 Message class train test number of threads number of labeled sentences PROBLEM RESOLUTION STEP Table 1: Distribution of the message classes in the training and test sets Solutions, SAP NetWeaver, Business Intelligence, ABAP Development, etc. The discussions on various topics, within the forum, are organized into threads and can be arbitrarily long, often spanning many pages. To train the CRF, each thread of discussion is provided as an instance, after manually labeling it at a sentence-level. Manual labeling was done using the Stanford Manual Annotation Tool 5. Table 1 gives the number threads and labeled instances in the training and tests sets that were used, along with the distribution of the labels. 5.2 CRF implementation - MALLET To conduct our experiments, we use the CRF implementation in the MALLET toolkit [12] which is trained with limited-memory quasi-newton. MALLET is a Java-based package for machine learning applications to text, including sequence tagging. For training our data set, each labeled example is represented by a line, which has the format: feature 1 feature 2... feature n label While testing, for each data point, the features constructed are entered in a single line, which is then passed as input to the trained model. Furthermore, the CRF model used is trained using feature induction. Feature induction for CRFs includes automatically creating a set of useful features and feature conjunctions. McCallum [11] describes an implementation, where feature induction works by iteratively considering sets of candidate singleton and conjunction features that are created from the initially defined set of singleton features as well as the set of current model features. The loglikelihood gain on the training data is then measured for each feature individually. The candidates included in the current set of model features are only the ones that cause the highest gain. Our experiments indicated that using feature induction improved performance over just using all defined singleton features. 5.3 Evaluation metric Precision: Precision is a measure of the accuracy provided that a specific class has been predicted. Precision for class i : tar.gz Precision Recall F1 PROBLEM RESOLUTION STEP Figure 4: Phrase-level model: Baseline - Results T P i P recision i = T P i + F P i where T P i and F P i are the number of True Positives and False Positives respectively, for class i. Recall: Recall is a measure of the ability of a prediction model to select instances of a certain class from a data set. Recall for class i : where F N i class i. F measure: T P i Recall i = T P i + F N i is the number of False Negatives for The F measure can be interpreted as a weighted average of the precision and recall. It is defined as: F 1 = 2 6 Results and discussion precision recall precision + recall This section describes the experiments done on the forum data using both phrase-level and sentence-level annotations. We provide empirical evidence that using sentence-level annotations and labeling sentences by performing sequence tagging using CRFs provides better results than the phraselevel model. For sequence tagging, we use the MAL- LET implementation of CRFs and limited-memory quasi- Newton training (L-BFGS) [12]. In addition, we utilized MALLET s feature induction capability. Treating the problem as a sequence tagging task allows us to utilize contextual information present in preceding messages in the thread during training. We annotate the sentences in each thread in the data set with multiple annotations described in Section 4.1. We distinguish two levels of annotation, viz., phrase-level and sentence-level. The phrase-level annotation involves assigning labels only to logical chunks of a sentence which generally span a couple of words. On the other hand, sentence-level annotations involve labeling the entire sentence. To start with, we annotated a subset of data with all 17 annotations described in Section 3.3 in a sentence-level model. We trained a linear chain CRF on 977 sentences and

9 Features used PROBLEM RESOLUTION STEP Precision Recall F1 Precision Recall F1 Baseline Baseline + Normalization Baseline + POS tags Baseline + Normalization + POS tags Figure 5: Sentence-level model: Accuracy values for different features Problem Resolution Step Predicted Negative Predicted Positive Actual Negative TN: 803 FP: 80 Actual Positive FN: 49 TP: 125 Predicted Negative Predicted Positive Actual Negative TN: 749 FP: 169 Actual Positive FN: 44 TP: 95 Figure 6: Confusion Matrix for PROBLEM and RESOLU- TION STEP tested the obtained model on 1057 sentences. Results for this experiment gave an average accuracy of 34.7%. The precision and recall values were high for some labels and very low for others. For example, high recall values were observed for labels like RESOLUTION STEP (77.5%), GREETING (73.6%) and MESSAGE AUTHOR (49.1%). In order to simplify the labeling task we then reduced the number of labels to just three: PROBLEM, RESOLU- TION STEP and IGNORE. All the experiments described henceforth are on data annotated with just these three labels. For the experiments for phrase-level classification, we annotated only the main parts (phrases) of the sentences with the PROBLEM and RESOLUTION STEP labels. The model was then trained on 395 phrases and tested on 168. Even though the classification task could achieve respectable precision, since the recall is quite low, the F1 values are not acceptable, as given in Figure 4. For the experiments for sentence-level classification, the training and test set with the 17 labels were collapsed to contain only the main labels, viz. PROBLEM and RES- OLUTION STEP, with the rest of it labeled as IGNORE. The baseline for this classification task used just the words in the sentence as features. Accuracy values for this experiment is given in Figure 5. As can be observed from this and the values in Figure 4, there is a substantial improvement in precision, recall and F1 measures for the sentence-level model when compared to the phrase-level model. The second experiment at the sentence-level classification used thread length normalization feature in addition to the baseline. Length normalization essentially quantizes the position of the sentence within the thread and encodes it as BEGIN, MIDDLE or END in the input list of features for the sentence. From Figure 5, it can be seen that, adding length normalization tags improve the accuracy values for both PROBLEM and RESOLUTION STEP. The next experiment was done by Part-of-Speech (POS) tagging the threads using only noun, verb, adverb and adjective forms. The POS tagging was done using Stanford NLP Group s Log-linear POS Tagger [19]. Figure 5 shows that using POS tags as feature improves the classification accuracy for both the labels. In the final experiment for sentence-level classification, we combine all the three features i.e., words, length normalization tags and POS tags in the input feature vector. This leads to improvement in the accuracy values for both the labels when compared to the previous experiments, as seen in the precision, recall and F1 values given in Figure 5. A confusion matrix presented in Figure 6 allows us to study the True Negatives, False Positives, False Negatives, and True Positives for the final experiment using the combination of all three features. In the classification of instances as PROBLEM, we can observe that the False Negatives are 49. Out of this 49, 23 are instances of PROBLEM that were incorrectly classified as RESOLUTION STEP. In this case, the number of False Positives are 80 with 51 instances of IGNORE getting classified as PROBLEM and 27 instances of RESOLUTION STEP being classified as PROBLEM. Similarly, in classification of instances of RESOLU- TION STEP, the total number of False Negatives are 44. Out of the 44, 27 are instances of RESOLUTION STEP being incorrectly classified as PROBLEM. Furthermore, in this case the False Positive rate is 169 out of which 66 are instances of IGNORE being classified as RESOLUTION STEP and 23 instances of PROBLEM being classified as

10 RESOLUTION STEP. In general, the False Positives seem to be high, thus degrading the precision. This could be because of the category IGNORE. This category contains instances which may contain irrelevant background information about the problem or personal experience of the authors, greetings, certain irrelevant suggestions etc. Since the category is not clearly distinguished by specific features from the instances labeled as PROBLEM or RESOLUTION STEP, it mainly contributes to the increase in number of False Positives. From the experiments described above, the best results are obtained when using all of the three features in the input vector to train a linear CRF with the feature induction provided by MALLET [12]. The sentence-level model performs better compared to the phrase-level model as it is able to gain better contextual information from the entire sentence leading to increased accuracy during sequence tagging. 7 Conclusions Online discussion forums are web communities where users collaborate to solve problems, and discuss and exchange ideas. A large fraction of discussions are problem solving threads where many members of the forum collaborate to solve an issue faced by a member. Archived discussion forum threads document a history of multiple problems solved in the past along with resolution details. They provide an excellent source of information for users seeking solutions to the same problem in the future. In this paper, we studied the basic structure and information flow pattern in a discussion thread and proposed a method using CRFs to extract problem and resolutionrelated information by modeling the task as a sequence labeling problem. The extracted information enables easy access to solution steps for problems, thus providing improved information access in help desk scenarios. We presented results from classification experiments at a sentence as well as phrase-level. Our experiments showed improved accuracy while using a combination of features, such as words, POS tags and normalization based on the discussion thread structure, in the sentence-level model. Sequence labeling of problem, resolution and ignore, using CRFs, gave us F1 scores of 66% for problem related information and 58.8% for resolution-related information. This is an improvement over the baseline of 58.2% and 49.5% respectively, where the model is trained using only the words. Future work to improve the extraction, is to include the meta-data associated with discussion threads like author specific information, and use of domain specific ontologies. Also, studying the usefulness of our method in a real help desk scenario will be an interesting direction of work. References [1] T. Baldwin, D. Martinez, and R. B. Penman. Automatic thread classification for linux user forum information access. In Proceedings of the 12th Australasian Document Computing Symposium, pages 72 79, [2] M. Cakir, F. Xhafa, N. Zhou, and G. Stahl. Threadbased analysis of patterns of collaborative interaction in chat. Artificial Intelligence in Education: Supporting Learning through Intelligent and Socially Informed Technology, [3] V. R. Carvalho and W. W. Cohen. Learning to extract signature and reply lines from . In Proceedings Conference on and Anti-Spam, [4] G. Cong, L. Wang, C.-Y. Lin, Y.-I. Song, and Y. Sun. Finding question-answer pairs from online forums. In Proceedings of SIGIR, International Conference on Research and Development in Information Retrieval, [5] S. Ding, G. Cong, C.-Y. Lin, and X. Zhu. Using conditional random fields to extract contexts and answers of questions from online forums. In Proceedings ACL-HLT, Association for Computational Linguistics, pages , [6] J. L. Elsas and J. G. Carbonell. It pays to be picky: an evaluation of thread retrieval in online forums. In Proceedings of SIGIR, International Conference on Research and Development in Information Retrieval, [7] J. L. Elsas and N. Glance. Shopping for top forums: Discovering online discussion for product research. In 1st Workshop on Social Media Analytics, [8] N. Glance and M. Siegler. Deriving marketing intelligence from online discussion. In Proceedings of the 11th ACM SIGKDD, International conference on Knowledge Discovery in Data Mining, [9] A. C. Graesser, M. A. Gernsbacher, and S. R. Goldman. Handbook of Discourse Processes. Lawrence Erlbaum Associates, [10] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML, International Conference on Machine Learning, [11] A. McCallum. Eciently inducing features of conditional random elds and Preethi. In Conference on Uncertainty in Articial Intelligence, [12] A. K. McCallum. Mallet: A machine learning for language toolkit [13] A. Medem, M.-I. Akodjenou, and R. Teixeira. Troubleminer: Mining network trouble tickets. In Proceedings 1st IFIP/IEEE international workshop on Management of the Future Internet, 2009.

11 [14] F. Peng and A. McCallum. accurate information extraction from research papers using conditional random fields. In Proceedings of HLT-NAACL, North American Association for Computational Linguistics, [15] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. In Proceedings of the IEEE, pages , [16] J. Seo, W. B. Croft, and D. A. Smith. Online community search using thread structure. In Proceedings of CIKM, International Conference on Information and Knowledge Management, [17] Q. Shao, Y. Chen, S. Tao, X. Yan, and N. Anerousis. easyticket: A ticket routing recommendation engine for enterprise problem resolution. In Proccedings of VLDB, International Conference on Very Large Data Bases, [18] L. Shrestha and K. McKeown. Detection of questionanswer pairs in conversations. In Proceedings International Conference On Computational Linguistics, [19] K. Toutanova and C. D. Manning. Enriching the knowledge sources used in a maximum entropy partof-speech tagger. In Proceedings Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, [20] N. Wanas, M. El-saban, H. Ashour, and W. Ammar. Automatic scoring of online discussion posts. In Proceedings of the 2nd ACM Workshop on Information Credibility on the Web, [21] X. Wei, A. Sailer, R. Mahindru, and G. Kar. automatic structuring of it problem ticket data for enhanced problem resolution. In IFIP/IEEE International Symposium on Integrated Network Management, 2007.

### Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

### Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

### Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers

Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers Sonal Gupta Christopher Manning Natural Language Processing Group Department of Computer Science Stanford University Columbia

### Automatic Web Page Classification

Automatic Web Page Classification Yasser Ganjisaffar 84802416 yganjisa@uci.edu 1 Introduction To facilitate user browsing of Web, some websites such as Yahoo! (http://dir.yahoo.com) and Open Directory

### Web Content Mining. Dr. Ahmed Rafea

Web Content Mining Dr. Ahmed Rafea Outline Introduction The Web: Opportunities & Challenges Techniques Applications Introduction The Web is perhaps the single largest data source in the world. Web mining

### Phrases. Topics for Today. Phrases. POS Tagging. ! Text transformation. ! Text processing issues

Topics for Today! Text transformation Word occurrence statistics Tokenizing Stopping and stemming Phrases Document structure Link analysis Information extraction Internationalization Phrases! Many queries

### Automated Concept Extraction to aid Legal ediscovery Review

Automated Concept Extraction to aid Legal ediscovery Review Prasad M Deshpande IBM Research India prasdesh@in.ibm.com Sachindra Joshi IBM Research India jsachind@in.ibm.com Thomas Hampp IBM Software Germany

### Citation Context Sentiment Analysis for Structured Summarization of Research Papers

Citation Context Sentiment Analysis for Structured Summarization of Research Papers Niket Tandon 1,3 and Ashish Jain 2,3 ntandon@mpi-inf.mpg.de, ashish.iiith@gmail.com 1 Max Planck Institute for Informatics,

### Web Content Summarization Using Social Bookmarking Service

ISSN 1346-5597 NII Technical Report Web Content Summarization Using Social Bookmarking Service Jaehui Park, Tomohiro Fukuhara, Ikki Ohmukai, and Hideaki Takeda NII-2008-006E Apr. 2008 Jaehui Park School

### Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

### Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

### Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

### Micro blogs Oriented Word Segmentation System

Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,

### Guest Editors Introduction: Machine Learning in Speech and Language Technologies

Guest Editors Introduction: Machine Learning in Speech and Language Technologies Pascale Fung (pascale@ee.ust.hk) Department of Electrical and Electronic Engineering Hong Kong University of Science and

### Conditional Random Fields: An Introduction

Conditional Random Fields: An Introduction Hanna M. Wallach February 24, 2004 1 Labeling Sequential Data The task of assigning label sequences to a set of observation sequences arises in many fields, including

### SmartDispatch: Enabling Efficient Ticket Dispatch in an IT Service Environment

SmartDispatch: Enabling Efficient Ticket Dispatch in an IT Service Environment Shivali Agarwal IBM Research Bengaluru, India shivaaga@in.ibm.com Renuka Sindhgatta IBM Research Bengaluru, India renuka.sr@in.ibm.com

### Projektgruppe. Categorization of text documents via classification

Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction

### Robust Sentiment Detection on Twitter from Biased and Noisy Data

Robust Sentiment Detection on Twitter from Biased and Noisy Data Luciano Barbosa AT&T Labs - Research lbarbosa@research.att.com Junlan Feng AT&T Labs - Research junlan@research.att.com Abstract In this

### Automated Content Analysis of Discussion Transcripts

Automated Content Analysis of Discussion Transcripts Vitomir Kovanović v.kovanovic@ed.ac.uk Dragan Gašević dgasevic@acm.org School of Informatics, University of Edinburgh Edinburgh, United Kingdom v.kovanovic@ed.ac.uk

### Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System Athira P. M., Sreeja M. and P. C. Reghuraj Department of Computer Science and Engineering, Government Engineering

21st International Congress on Modelling and Simulation, Gold Coast, Australia, 29 Nov to 4 Dec 2015 www.mssanz.org.au/modsim2015 On the Feasibility of Answer Suggestion for Advice-seeking Community Questions

### Bayesian Spam Filtering

Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating

### Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

### Summarizing Online Forum Discussions Can Dialog Acts of Individual Messages Help?

Summarizing Online Forum Discussions Can Dialog Acts of Individual Messages Help? Sumit Bhatia 1, Prakhar Biyani 2 and Prasenjit Mitra 2 1 IBM Almaden Research Centre, 650 Harry Road, San Jose, CA 95123,

### Resolving Common Analytical Tasks in Text Databases

Resolving Common Analytical Tasks in Text Databases The work is funded by the Federal Ministry of Economic Affairs and Energy (BMWi) under grant agreement 01MD15010B. Database Systems and Text-based Information

### Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

### An Introduction to Data Mining

An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

### Predictive Coding Defensibility and the Transparent Predictive Coding Workflow

Predictive Coding Defensibility and the Transparent Predictive Coding Workflow Who should read this paper Predictive coding is one of the most promising technologies to reduce the high cost of review by

### Context Aware Predictive Analytics: Motivation, Potential, Challenges

Context Aware Predictive Analytics: Motivation, Potential, Challenges Mykola Pechenizkiy Seminar 31 October 2011 University of Bournemouth, England http://www.win.tue.nl/~mpechen/projects/capa Outline

### Hexaware E-book on Predictive Analytics

Hexaware E-book on Predictive Analytics Business Intelligence & Analytics Actionable Intelligence Enabled Published on : Feb 7, 2012 Hexaware E-book on Predictive Analytics What is Data mining? Data mining,

### The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of

### INTERNATIONAL JOURNAL OF ADVANCES IN COMPUTING AND INFORMATION TECHNOLOGY An International online open access peer reviewed journal

INTERNATIONAL JOURNAL OF ADVANCES IN COMPUTING AND INFORMATION TECHNOLOGY An International online open access peer reviewed journal Research Article ISSN 2277 9140 ABSTRACT Web page categorization based

### Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for

### Predict Influencers in the Social Network

Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

### Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged

### Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

### Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Ling-Xiang Tang 1, Andrew Trotman 2, Shlomo Geva 1, Yue Xu 1 1Faculty of Science and Technology, Queensland University

### Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

### CHAPTER 1 INTRODUCTION

1 CHAPTER 1 INTRODUCTION Exploration is a process of discovery. In the database exploration process, an analyst executes a sequence of transformations over a collection of data structures to discover useful

### A Classification-based Approach to Question Answering in Discussion Boards

A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong and Brian D. Davison Department of Computer Science and Engineering Lehigh University Bethlehem, PA 18015 USA {lih307,davison}@cse.lehigh.edu

### Mining Signatures in Healthcare Data Based on Event Sequences and its Applications

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications Siddhanth Gokarapu 1, J. Laxmi Narayana 2 1 Student, Computer Science & Engineering-Department, JNTU Hyderabad India 1

### Application of Data Mining based Malicious Code Detection Techniques for Detecting new Spyware

Application of Data Mining based Malicious Code Detection Techniques for Detecting new Spyware Cumhur Doruk Bozagac Bilkent University, Computer Science and Engineering Department, 06532 Ankara, Turkey

### Hotel Review Search. CS410 Class Project. Abstract: Introduction: Shruthi Sudhanva (sudhanv2) Sinduja Subramaniam (ssbrmnm2) Guided by Hongning Wang

Hotel Review Search CS410 Class Project Shruthi Sudhanva (sudhanv2) Sinduja Subramaniam (ssbrmnm2) Guided by Hongning Wang Abstract: Hotel reviews are available in abundance today. This project focuses

### 131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

### Social Media Analytics

Social Media Analytics Raghu Krishnapuram and Jitendra Ajmera IBM Research - India 2011 IBM Corporation Convergence of Social and Analytic Technologies Transform the Way the World Operates Socially Synergistic

### Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or

### Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach. Pranali Chilekar 1, Swati Ubale 2, Pragati Sonkambale 3, Reema Panarkar 4, Gopal Upadhye 5 1 2 3 4 5

### Research of Postal Data mining system based on big data

3rd International Conference on Mechatronics, Robotics and Automation (ICMRA 2015) Research of Postal Data mining system based on big data Xia Hu 1, Yanfeng Jin 1, Fan Wang 1 1 Shi Jiazhuang Post & Telecommunication

### Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

### Musical Pitch Identification

Musical Pitch Identification Manoj Deshpande Introduction Musical pitch identification is the fundamental problem that serves as a building block for many music applications such as music transcription,

### Solution Documentation for Custom Development

Version: 1.0 August 2008 Solution Documentation for Custom Development Active Global Support SAP AG 2008 SAP AGS SAP Standard Solution Documentation for Custom Page 1 of 53 1 MANAGEMENT SUMMARY... 4 2

### A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

### Paper Classification for Recommendation on Research Support System Papits

IJCSNS International Journal of Computer Science and Network Security, VOL.6 No.5A, May 006 17 Paper Classification for Recommendation on Research Support System Papits Tadachika Ozono, and Toramatsu Shintani,

### Word Completion and Prediction in Hebrew

Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology

### Some Research Challenges for Big Data Analytics of Intelligent Security

Some Research Challenges for Big Data Analytics of Intelligent Security Yuh-Jong Hu hu at cs.nccu.edu.tw Emerging Network Technology (ENT) Lab. Department of Computer Science National Chengchi University,

### Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

### Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

### Predictive Coding Defensibility and the Transparent Predictive Coding Workflow

WHITE PAPER: PREDICTIVE CODING DEFENSIBILITY........................................ Predictive Coding Defensibility and the Transparent Predictive Coding Workflow Who should read this paper Predictive

### CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet Muhammad Atif Qureshi 1,2, Arjumand Younus 1,2, Colm O Riordan 1,

### Three Methods for ediscovery Document Prioritization:

Three Methods for ediscovery Document Prioritization: Comparing and Contrasting Keyword Search with Concept Based and Support Vector Based "Technology Assisted Review-Predictive Coding" Platforms Tom Groom,

### Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 9, Issue 8 (January 2014), PP. 19-24 Comparative Analysis of EM Clustering Algorithm

### An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them

An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them Vangelis Karkaletsis and Constantine D. Spyropoulos NCSR Demokritos, Institute of Informatics & Telecommunications,

### Subordinating to the Majority: Factoid Question Answering over CQA Sites

Journal of Computational Information Systems 9: 16 (2013) 6409 6416 Available at http://www.jofcis.com Subordinating to the Majority: Factoid Question Answering over CQA Sites Xin LIAN, Xiaojie YUAN, Haiwei

### MapReduce Approach to Collective Classification for Networks

MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty

### SPATIAL DATA CLASSIFICATION AND DATA MINING

, pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

### Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

### The Scientific Data Mining Process

Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

### Machine Learning Log File Analysis

Machine Learning Log File Analysis Research Proposal Kieran Matherson ID: 1154908 Supervisor: Richard Nelson 13 March, 2015 Abstract The need for analysis of systems log files is increasing as systems

### A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks Text Analytics World, Boston, 2013 Lars Hard, CTO Agenda Difficult text analytics tasks Feature extraction Bio-inspired

### T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

### Using Artificial Intelligence to Manage Big Data for Litigation

FEBRUARY 3 5, 2015 / THE HILTON NEW YORK Using Artificial Intelligence to Manage Big Data for Litigation Understanding Artificial Intelligence to Make better decisions Improve the process Allay the fear

### I, ONLINE KNOWLEDGE TRANSFER FREELANCER PORTAL M. & A.

ONLINE KNOWLEDGE TRANSFER FREELANCER PORTAL M. Micheal Fernandas* & A. Anitha** * PG Scholar, Department of Master of Computer Applications, Dhanalakshmi Srinivasan Engineering College, Perambalur, Tamilnadu

### Spam detection with data mining method:

Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,

### A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences

### Sentiment Analysis on Big Data

SPAN White Paper!? Sentiment Analysis on Big Data Machine Learning Approach Several sources on the web provide deep insight about people s opinions on the products and services of various companies. Social

### III. DATA SETS. Training the Matching Model

A Machine-Learning Approach to Discovering Company Home Pages Wojciech Gryc Oxford Internet Institute University of Oxford Oxford, UK OX1 3JS Email: wojciech.gryc@oii.ox.ac.uk Prem Melville IBM T.J. Watson

### Term extraction for user profiling: evaluation by the user

Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,

### Blog Post Extraction Using Title Finding

Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School

### A Logistic Regression Approach to Ad Click Prediction

A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi kondakin@usc.edu Satakshi Rana satakshr@usc.edu Aswin Rajkumar aswinraj@usc.edu Sai Kaushik Ponnekanti ponnekan@usc.edu Vinit Parakh

### Sentiment analysis on tweets in a financial domain

Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International

### Developing a Collaborative MOOC Learning Environment utilizing Video Sharing with Discussion Summarization as Added-Value

, pp. 397-408 http://dx.doi.org/10.14257/ijmue.2014.9.11.38 Developing a Collaborative MOOC Learning Environment utilizing Video Sharing with Discussion Summarization as Added-Value Mohannad Al-Mousa 1

### A Learning Based Method for Super-Resolution of Low Resolution Images

A Learning Based Method for Super-Resolution of Low Resolution Images Emre Ugur June 1, 2004 emre.ugur@ceng.metu.edu.tr Abstract The main objective of this project is the study of a learning based method

### International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

### Machine Learning with MATLAB David Willingham Application Engineer

Machine Learning with MATLAB David Willingham Application Engineer 2014 The MathWorks, Inc. 1 Goals Overview of machine learning Machine learning models & techniques available in MATLAB Streamlining the

### MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph Janani K 1, Narmatha S 2 Assistant Professor, Department of Computer Science and Engineering, Sri Shakthi Institute of

### PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS Project Project Title Area of Abstract No Specialization 1. Software

### Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

### A Comparative Study on Sentiment Classification and Ranking on Product Reviews

A Comparative Study on Sentiment Classification and Ranking on Product Reviews C.EMELDA Research Scholar, PG and Research Department of Computer Science, Nehru Memorial College, Putthanampatti, Bharathidasan

### Mining Navigation Histories for User Need Recognition

Mining Navigation Histories for User Need Recognition Fabio Gasparetti and Alessandro Micarelli and Giuseppe Sansonetti Roma Tre University, Via della Vasca Navale 79, Rome, 00146 Italy {gaspare,micarel,gsansone}@dia.uniroma3.it

### Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project Ahmet Suerdem Istanbul Bilgi University; LSE Methodology Dept. Science in the media project is funded

### Highly Efficient ediscovery Using Adaptive Search Criteria and Successive Tagging [TREC 2010]

1. Introduction Highly Efficient ediscovery Using Adaptive Search Criteria and Successive Tagging [TREC 2010] by Ron S. Gutfinger 12/3/2010 1.1. Abstract The most costly component in ediscovery is the

### II. RELATED WORK. Sentiment Mining

Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract

Volume 4, Issue 6, June 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Improving Web Image

### Combining Global and Personal Anti-Spam Filtering

Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized

### Distinguishing Opinion from News Katherine Busch

Distinguishing Opinion from News Katherine Busch Abstract Newspapers have separate sections for opinion articles and news articles. The goal of this project is to classify articles as opinion versus news

### Email Set Up Instructions

Email Set Up Instructions Email Support...1 Important Fist Steps for all Users...2 DO YOU KNOW YOUR USERNAME AND PASSWORD?...2 Install the Despaminator CompanyV Certificate...2 What is Your Mail Client?...2

### Clustering Connectionist and Statistical Language Processing

Clustering Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised

### Data Mining Analytics for Business Intelligence and Decision Support

Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing