Classification of Virtual Investing-Related Community Postings



Similar documents
The Scientific Data Mining Process

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Feature Subset Selection in Spam Detection

SOPS: Stock Prediction using Web Sentiment

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Environmental Remote Sensing GEOG 2021

Social Media Mining. Data Mining Essentials

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Customer Classification And Prediction Based On Data Mining Technique

Electronic Brokerages for Online Investing

Data Mining - Evaluation of Classifiers

CLASSIFYING SERVICES USING A BINARY VECTOR CLUSTERING ALGORITHM: PRELIMINARY RESULTS

Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

SPATIAL DATA CLASSIFICATION AND DATA MINING

Standardization and Its Effects on K-Means Clustering Algorithm

Applying Machine Learning to Stock Market Trading Bryce Taylor

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement

Big Data, Socio- Psychological Theory, Algorithmic Text Analysis, and Predicting the Michigan Consumer Sentiment Index

Supervised Learning (Big Data Analytics)

Unlocking Value from. Patanjali V, Lead Data Scientist, Tiger Analytics Anand B, Director Analytics Consulting,Tiger Analytics

Analyzing survey text: a brief overview

Data Mining Part 5. Prediction

Using Artificial Intelligence to Manage Big Data for Litigation

Chapter 6. The stacking ensemble approach

Active Learning SVM for Blogs recommendation

II. RELATED WORK. Sentiment Mining

PHASE ESTIMATION ALGORITHM FOR FREQUENCY HOPPED BINARY PSK AND DPSK WAVEFORMS WITH SMALL NUMBER OF REFERENCE SYMBOLS

Introduction to Learning & Decision Trees

Comparison of K-means and Backpropagation Data Mining Algorithms

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Evolutionary Detection of Rules for Text Categorization. Application to Spam Filtering

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

Review on Financial Forecasting using Neural Network and Data Mining Technique

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Chapter Managing Knowledge in the Digital Firm

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Categorical Data Visualization and Clustering Using Subjective Factors

Qualitative Corporate Dashboards for Corporate Monitoring Peng Jia and Miklos A. Vasarhelyi 1

NICE MULTI-CHANNEL INTERACTION ANALYTICS

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Search Result Optimization using Annotators

PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY

Experiments in Web Page Classification for Semantic Web

Internet of Things, data management for healthcare applications. Ontology and automatic classifications

Principal Agent Model. Setup of the basic agency model. Agency theory and Accounting. 2 nd session: Principal Agent Problem

Table of Contents. Chapter No. 1 Introduction 1. iii. xiv. xviii. xix. Page No.

How To Write A Summary Of A Review

Computational Linguistics and Learning from Big Data. Gabriel Doyle UCSD Linguistics

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Why Semantic Analysis is Better than Sentiment Analysis. A White Paper by T.R. Fitz-Gibbon, Chief Scientist, Networked Insights

Diagnosis Code Assignment Support Using Random Indexing of Patient Records A Qualitative Feasibility Study

Image Compression through DCT and Huffman Coding Technique

An evolutionary learning spam filter system

Keywords: Regression testing, database applications, and impact analysis. Abstract. 1 Introduction

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Sentiment analysis on tweets in a financial domain

2695 P a g e. IV Semester M.Tech (DCN) SJCIT Chickballapur Karnataka India

Search and Information Retrieval

Towards applying Data Mining Techniques for Talent Mangement

Non-Parametric Spam Filtering based on knn and LSA

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Employer Health Insurance Premium Prediction Elliott Lui

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Performance Evaluation of Online Image Compression Tools

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Self Organizing Maps for Visualization of Categories

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Machine Learning using MapReduce

Employee Survey Analysis

Chapter 7. Feature Selection. 7.1 Introduction

On the effect of data set size on bias and variance in classification learning

Spam detection with data mining method:

Recognizing Informed Option Trading

An Overview of Knowledge Discovery Database and Data mining Techniques

Regular Expressions with Nested Levels of Back Referencing Form a Hierarchy

the general concept down to the practical steps of the process.

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Galaxy Morphological Classification

Analytics on Big Data

Neural Networks for Sentiment Detection in Financial Text

Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam

Data Warehousing and Data Mining in Business Applications

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

ecommerce Web-Site Trust Assessment Framework Based on Web Mining Approach

RRSS - Rating Reviews Support System purpose built for movies recommendation

Component visualization methods for large legacy software in C/C++

Latest Results on outlier ensembles available at (Clickable Link) Outlier Ensembles.

Intelligent Search for Answering Clinical Questions Coronado Group, Ltd. Innovation Initiatives

Chapter 20: Data Analysis

Bb 2. Targeting Segmenting and Profiling How to generate leads and get new customers I N S I G H T. Profiling. What is Segmentation?

F. Aiolli - Sistemi Informativi 2007/2008

UTILIZING COMPOUND TERM PROCESSING TO ADDRESS RECORDS MANAGEMENT CHALLENGES

Improving Student Question Classification

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

A Review of Data Mining Techniques

Transcription:

Classification of Virtual Investing-Related Community Postings Balaji Rajagopalan * Oakland University rajagopa@oakland.edu Matthew Wimble Oakland University mwwimble@oakland.edu Prabhudev Konana * University of Texas at Austin pkonana@utexas.edu Chan-Gun Lee University of Texas at Austin cglee@utexas.edu ABSTRACT The rapid growth of online investing and virtual investing-related communities (VICs) has a wide-raging impact on research, practice and policy. Given the enormous volume of postings on VICs, automated classification of messages to extract relevance is critical. Classification is complicated by three factors: (a) the amount of irrelevant messages or "noise" messages (e.g., spam, insults), (b) the highly unstructured nature of the text (e.g., abbreviations), and finally, and (c) the wide variation in relevancy for a given firm. We develop and validate an approach based on a variety of classifiers to identify: (1)"noisy" messages that bear no relevance to the topic, (2) messages containing no sentiment about the investment, but are relevant to the topic, and (3) messages containing sentiment and are relevant. Preliminary results show sufficient promise to classify messages. Keywords Sentiment extraction, virtual communities, text mining, message boards, online investing, genetic algorithms Proceedings of the Tenth Americas Conference on Information Systems, ew ork, ew ork, August 24 1

ITRODUCTIO Over 2 million investors trade online in the U.S. The rapid growth is attributed to low costs, convenience, easy access to information, and control (Konana and Balasubramanian, 24). The increase in do-it-yourself investors is also associated with intense use of virtual investing-related communities (VICs) such as those on ahoo!, and Morningstar. VICs provide platforms to seek, disseminate, and discuss stock-related information. As a first step in the research to understand how information is created and diffused within VICs, this study seeks to develop and validate a mechanism to automatically classify the VIC message postings. The challenges are numerous: the anonymity and ease of posting information are conducive for significant irrelevant messages or noise. oise may come from insults, unsolicited advertisements (i.e. spams), and digressions. Further, it is critical to explicate the sentiment - the thought, view, or attitude, expressed in the message. Das and Chen (21) were one of the first to attempt to extract the emotive content in the messages. In this paper, we build upon their work and test a methodology to extract the relevance and sentiment of messages within the context of VICs. Classifying the emotive content of messages posted on VICs poses several problems. The first problem is the nature of the classification itself. Messages can be noise, perfectly relevant, or ambiguous. The subjective nature of some messages may lead to disagreement among readers as to whether a given message is truthful, important, or reliable. For instance, it is common to find messages with postings XZ sucks (XZ refers to some stock symbol) without any elaboration. While it appears such messages would be noise the above message also seem to provide some useful information. Such problems have been encountered in previous text classification research. Foltz et al (1999) used classifiers for automated grading of student projects where there was disagreement among three human graders. Second, messages generally do not observe proper grammatical rules and spelling, and therefore, readability analysis measures on their own may be less effective. Users use abbreviations for many words (e.g., u for you, L8er for Later ) and generally ignore spelling errors. Third, the problem is further compounded by semantic differences based on the context. For example in drug companies much of the conversation centers on potential products that are in the pipeline of research and development. However, in the energy sectors the term pipeline takes on a very different connotation. Each industry and company has rather different combinations of words that are often used within relevant conversations. Thus, finding a common set of words for classification is challenging. METHODOLOG We adopt the general multi-algorithmic approach of Das and Chen (21) in our study. However, we differentiate our classification method in several ways. First, we attempt to develop classifiers that are more generic with applicability to a broad range of virtual investing-related postings as opposed to developing classifiers for individual stocks. Second, we propose to classify the messages along a set of 3 categories oise, Relevant (also referred to as o Signal), and Signal. As a comparison, Das and Chen (21) focused on Signal and oise only. We consider a message to be noise if the content is spam, or completed unrelated to message board topic. We consider a message to be relevant (o Signal) if the content relates to the stock in particular and/or the market in general with implications for the stock. We consider a message to be a signal carrier if and only if it is relevant and a discernable sentiment (positive or negative) is expressed toward the stock. Through the process of manually examining a random sample of several hundred messages we discovered the need for the third category relevant (but no signal content). Third, we design and develop a classifier based on readability analysis, with theoretical foundations in reading and writing, to classify the messages. Fourth, we apply evolutionary computing methods genetic algorithms to induce classification rule sets. Fifth, Das and Chen (21) did not classify every message, a significant portion of the messages were grouped into an unclassified category. The methodology to extract relevance was carried out in four steps: Sample Selection and Preparation, Classifier Development, Testing & Validation, and Application. Sample selection involved random selection of messages (482 messages) across several message boards. Sample preparation was carried out by manual coding of each message into one of the three categories. Two graduate students were briefed on the criteria for classifying messages into the three categories. Inter-rater reliability (> 8%) of their classification indicated a high degree of consensus. The small number of messages that were classified differently by the students was revisited and a consensus reached regarding their categorization. The second step of classifier development involved designing five classifiers based on different theoretical underpinnings. As mentioned earlier, this multi-algorithmic approach is consistent with earlier attempts [Das and Chen, 21]. The five relevant extraction models Lexicon-based Classifier (LBC), Readability-based Classifier (RBC), Weighted Lexicon Classifier (WLC), Vector Distance Classifier (VDC) and Differential Weights Lexicon Classifier (DWLC) are described in Proceedings of the Tenth Americas Conference on Information Systems, ew ork, ew ork, August 24 2

the next section. A sixth classifier combining the outputs of each of the five classifiers was developed along the lines proposed by Das and Chen (21). The third step involved testing the classifier on a subset of the sample quarantined and not used for inducing the rule sets. Classification rates were then examined and the classifiers refined to improve relevant extraction accuracy. The final step involved applying the classification method to a larger data set and reporting the categorization distribution. Relevance Extraction Models In this section we detail the five classification mechanisms implemented for this study. For three of the classifiers we tried two different approaches to analyze the same inputs. We also detail a sixth classifier that effectively combines the output of the five classifiers to categorize the messages. Lexicon-based Classifier (LBC) LBCs have been effectively used in earlier studies (Das and Chen, 21). To design a LBC, we first developed a set of frequently occurring keywords for the three categories oise (C 1 ), Relevant (C 2 ), and Signal (C 3 ). LBCs categorize a message m l, where l = {1, 2,.M}, by matching the message content against this set of keywords for each category and classifying the message as belonging to a category with the highest degree of matches. We implemented two versions of the LBC. The first implementation used simple counts of keywords such that message m l will belong to a category C i if it has the max[ n ( k )], where n k ) represents the number of keyword matches for the i th category. In instances where two or more i ( i categories tie for max[ n ( k )], we choose lowest index i. i Formally, Category(m l ) = C i where i is the index for which Count m, Key ) Max { Count( m, Key )}, where ( l ij = k = 1,2, 3 l kj Keyij is the keyword j in the keyword list for Category i. Count( m l, Keyij ) returns the number of occurrences of Keyij in the message m l. For example, Count( I am too tired, too, too ) returns 2. In the second implementation, we induced a decision set to fit an if then decision tree (Figure 1) using a genetic algorithm to optimize the total # correctly classified by encoding the solution set as a 1 element array, with the 1 elements being (a) the 3 variables to be used encoded as an ordinal number {1,2,3}, (b) the decision point represented by an integer bounded by the maximum and minimum values of the set, and (c) the results of the dependant if then encoded as an integer {,1,2} representing the message category. If osignal > If Signal >2 If Signal > 1 2 2 Figure 1: Induced decision set for LBC Readability-based Classifier (RBC) This classifier is based upon research on readability analysis. The initial rationale was that the RBC could detect noise if noisy messages were written more hastily than relevant messages. The idea was that somehow the degree of thought or haste would somehow be expressed in readability or grade level measurements. We then induced a decision set in the same manner described in the LBC section using word count, percentage of unique words, and number of unique words. The result was that two decision sets performed equally well on the training set and we elected to try both sets independently (Figure 2 & 3). Proceedings of the Tenth Americas Conference on Information Systems, ew ork, ew ork, August 24 3

>66.3 >2 >167.5 If UW > 5.5 If UW% >47% >12 2 1 2 2 1 Figure 2. Induced decision set #1 for RBC Figure 3. Induced decision set #2 for RBC Weighted Lexicon Classifier (WLC) This classifier, as its name suggests, is a variation of LBC. WLC overcomes a limitation of LBC that relies on absolute counts to categorize. In the case of WLC, this bias is adjusted for by taking the ratio of number of keywords in the message n( k ) to message size. To eliminate the keyword size bias, WLC bases its classification on i Max [ ], where l represents the l word count for message l. More formally, Category(m l ) = C i where i is the index for which Count( ml, Keyij ) Count( ml, Keykj ) = Maxk = 1,2,3{ }. We then induced a decision set in the same manner described in the l k LBC section using the percentage measures (Figure 4). If osignal% >.332% If Signal% >.596% If Signal % > 1.7% 1 2 2 Vector Distance classifier (VDC) Figure 4: Induced decision set for PLC We implement VDC as described in Chen and Das (21). This method treats each message as a word vector in D- dimensional space where D represents the size of the keyword list. The proximity between a message m l and grammar rule G j is computed by the angle of the Vector(m l ) and Vector(G j ) where Vector(V) is the D-dimensional word vector for V. Message m l belongs to category of G k for which the computed angle is the minimum, which means that the proximity is the maximum. Formally, Category(m l ) = the category of G k where G k is a grammar rule and k is the index for which Cos(Vector(m l ), Vector(G k )) = Max j = 1...Sizeof(Grammar) { Cos(Vector(m l ), Vector(G j )) } where Cos (V 1, V 2 ) = Differential Weights Lexicon Classifier (DWLC) V 1 V V V This classifier represents another variation of the LBC. By assigning differential weights to each word in the lexicon (keyword list for each category) this mechanism recognizes the varying importance of each keyword in classification and overcomes the equal weight bias in LBC. In DWLC, message m l will belong to category C i if it has the 1 2 2 Proceedings of the Tenth Americas Conference on Information Systems, ew ork, ew ork, August 24 4

Max Weight n( k )], where Weight ij represents the weight for the j th keyword in the i th category and n k ) counts the [ ij ij number of keyword matches for the j th keyword in the i th category. More formally, Category(m l ) = C i where i is the index for which Weight Count m, Key ) = Max = { Weight Count( m, Key )}. We then induced a decision tree (Figure 5) ij ( l ij k,2, 3 kj 1 l kj in the same manner described in the LBC section using the weighted measures. ( ij If osignal > 2 If Signal% > 2 If Signal % > 2 1 2 2 Combined Majority Voting Classifier (CVC) Figure 5: Decision set for DWLC A sixth classifier is designed by combining the outputs of the five classifiers using a simple majority voting mechanism. If we assume that each classifier categorizes (votes) message m l as belonging to category C i, then this combination classifier simply relies on the number of votes each message gets to determine which category the m l belongs to. So for example, if three of the five classifiers voted for m l belonging to C 1, by simple majority principle message m l is categorized as belonging to C 1. RESULTS AD DISCUSSIO Classification results for LBC, WLC, and DWLC classifiers are presented based on two approaches - induced decision trees (Figure 6) and word-count based (Figure 7). We also present the results for the 2.7 million messages (Figure 8). 9% 1% 8% 7% 6% 5% 4% 3% Total ratio oise o Signal 9% 8% 7% 6% 5% 4% 3% Total ratio oise o Signal 2% 1% % LBC RBC WLC VDC DWLC CVC Signal 2% 1% % LBC RBC WLC VDC DWLC CVC Signal Figure 6: Classifier Performance (decision trees approach) Figure 7: Classifier Performance (Word count approach) Proceedings of the Tenth Americas Conference on Information Systems, ew ork, ew ork, August 24 5

1.% 9.% 8.% 7.% 6.% 5.% 4.% 3.% 2.% 1.%.% LBC RBC WLC VDC DWLC CMVC oise o Signal Signal Performance Analysis Figure 8: Extracting Relevance: Classification results for 2.7M messages Overall, initial results show that a decision tree approach offers better performance than the word count when we have a limited keyword set. For the noise classification RBC, DWLC and VDC performed well. RBC was designed as a noise detection mechanism and results indicate utility for that purpose. The DWLC performed significantly better than the LBC or WLC and this suggests that some words, such as profanity, are stronger indicators of noise than others. Signal classification rates were highest for the RBC, but the results are misleading when only the correct percentage is considered since the RBC failed to classify any relevant but no signal messages. Classification accuracy for no signal messages was less than 3% for all classifier besides RBC. Poor performance for no signal messages is probably due to the difficulty of differentiating no signal from signal messages. Combined classifier performed well in identifying signals, but did relatively poorly on no signal. Several design factors provide significant challenges to out approach they include a limited keyword list, inherent message ambiguity, increased complexity of 3-category problem and the forced classification of all messages. Based on this exploratory study potential improvements to the approach could include word list expansion, altering the voting mechanism, and more sophisticated word matching techniques Conclusions This study aimed at developing a classifier to automate extracting relevance from free text messages on stock bulletin boards. Preliminary results indicate sufficient promise for the proposed approach. We are further refining our technique by improving word set, and integrating well-known algorithms for similar words matching, namely, soundex indexing and edit distance. ACKOWLEDGMETS * This material is based upon work supported by the ational Science Foundation (SF) under Grant o. 21917 to Balaji Rajagopalan and SF grant 218988 to Prabhudev Konana. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the SF. REFERECES 1. Das, Sanjiv R. and Chen, Mike. (21) ahoo! for Amazon: Sentiment parsing from small talk on the web. Proceedings of the 8th Asia Pacific Finance Association Annual Conference, 21. 2. Foltz, P. W., Laham, D. & Landauer, T. K. (1999). Automated Essay Scoring: Applications to Educational Technology. In proceedings of EdMedia '99. 3. Holland, J. (2) Building Blocks, Cohort Genetic Algorithms, and Hyperplane-Defined Functions, Evolutionary Computation 8(4): 373-391 4. Konana and Balasubramanian, Social-Economic-Psychological Model of Technology Adoption and Usage: An Application in Online Investing., Forthcoming, Decision Support Systems (24). 5. Wysocki, P.D., (1999) Cheap Talk on the Web: The Determinants of Postings on Stock Message Boards Working Paper o.9825, University of Michigan Proceedings of the Tenth Americas Conference on Information Systems, ew ork, ew ork, August 24 6