An Insight Of Sentiment Analysis In The Financial News

Available online at www.globalilluminators.org GlobalIlluminators FULL PAPER PROCEEDING Multidisciplinary Studies Full Paper Proceeding ICMRP-2014, Vol. 1, 278-291 ISBN: 978-969-9948-08-4 ICMRP 2014 An Insight Of Sentiment Analysis In The Financial News Sepideh Foroozan Yazdnai 1, Masrah Azrifah Azmi Murad 2, Nurfadhlina binti Mohd Sharef 3, Yashwant Prasad Singh 4, Ahmed Razman bin Abdul Latiff 5 1,2,3,5 University Putra Malaysia (UPM) 4 Manav Rachna Colege of Engineering Sector-43, Surajkund Road Faridabad, India Abstract With enlargement of Web 2 and the advent of social networks, blogs, and online news sources, analysts have to process enormous amounts of real-time, unstructured data. For example, predicting the stock market trends and sentiment by the financial news is one of these instances. Financial news can be of various types, such as recent earning statements, information about latest products, declaration of profits by a company, and similar issues. These sources usually include the key factors, which will affect the stock market in different ways, for instance, effect on stock returns, volatility of price and also future firm earnings. Therefore, there is a vital need to discover approaches to find sentiment and polarity from these corpora of text. Obviously, this is a part in which sentiment analysis tool and its techniques can be employed to obtain the main concept of text by extracting important keywords from the financial news. Despite the large number of recent publications on sentiment mining in financial news, there are still many problems in this regard. For example, whole news articles may not be useful for analysis or mining, because most of the stock market news includes a comparison of some companies or perhaps even parts of the economy. Hence, improved techniques for the separation and determination of the sentiment and polarity of words, sentence, and phrase in order to extract proper expressions as features for sentiment analysis with high accuracy seems necessary. This paper provides a review of current sentiment analysis techniques involving machine learning and text mining for financial domain in ord er to predict the stock market from financial news 2014 The Authors. Published by Global Illuminators. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the Scientific & Review committee of ICMRP-2014. Keywords: Sentiment Analysis and Classification; Financial News; Machine Learning; Stock Mining. Market; Text *All correspondence related to this article should be directed to, Sepideh Foroozan Yazdnai, University Putra Malaysia (UPM)Email: foroozan.sepideh@gmail.com 2014 The Authors. Published by Global Illuminators. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the Scientific & Review committee of ICMRP-2014. 278 Introduction In recent years, a huge amount of information is accessible for investment and research analysis in text format. Investors and researchers can simply get access to desired

information through a variety of channels on the Internet. According to the Efficient Market Hypothesis (Fama, 1965), all available information is reflected in market prices. Hence, news, particularly financial news plays an essential role for investors when judging about stock price. This is because of collection of the vital information contained in the news as the firm s fundamentals and prospect of other market participants. Financial news consists of qualitative and quantitative information of various types and from diverse sources, such as corporate disclosures, news article and so forth. Most prior researches have used text mining techniques to analyze the incoming news. Traditionally, text categorization tracks to classify documents by topic. Accordingly, structure of topic-based classification can be as user and application dependent, which leads to unrelated classification that can differ from one domain to other domains (Pang & Lee, 2008). Various researchers have investigated the prediction of stock prices using text mining of financial news and the directional accuracy of the forecast varying from 45% to 60% (Mittermayer, 2004)(Schumaker, Zhang, Huang, & Rochelle, 2009) in terms of accuracy and consequently they are not ideal. In the recent decade, researchers have been interested in automatically detecting sentiment in texts in last few years. Sentiment analysis is a kind of subjectivity analysis that seeks standpoints in text. and distinguishes polarity or semantic orientation by analysis of words and phrases. Unlike traditional classification, in sentiment classification, we often have reasonably few classes such as (positive or negative) that generalize across many domains. Moreover, templates for sentiment-oriented information extraction sometimes generalize across different domains, so that the set of fields (such as holder, type, strength and so on) for each sentiment extraction are similar regardless of the topic. This review paper mainly presents machine learning methods to solve sentiment analysis of financial news. RELATED WORK The authors in (Koppel & Shtrimberg, 2006) proposed a model based on lexical features that could distinguish good and bad news with accuracy of about 70%. In fact, they suggested a new method for generating labeled data for sentiment analysis. According to current price changes: the price of a stock was mentioned at the opening of a market after a news item was released; and, the price of the stock was noted at the closing of the market on the day before a news item was published. Basically, for a news item to be labeled as a positive example, its positive price change must be greater than a given threshold (10% if the stock price climb 10% or more and as negative if the stock price fall 7.8% or more) and be in excess of the on the whole S&P (Standard & Poor's) 500 index change. The authors have used all words that appeared at least sixty times in the corpus, eliminating function words with the exception of some relevant words such as below, up, above and down. The result has shown that there were no markers for positive stories, which were specified by the absence of 279

negative markers. As a consequence, recall for positive stories were high but precision much lower. The methodology consisted of 100 features that were selected with the highest information gain in the training and linear SVM, Naïve Bayes, and decision tree to learn a model. The purpose of research in (Généreux, Poibeau, & Koppel, 2011) was to propose a model based on work by (Koppel & Shtrimberg, 2006). The proposed work investigated the subjective use of languages in financial news about companies traded in public and also validated an automated labeling system. Unlike the previous research, different types of feature selection were used for analyzing. These researchers handled short financial news items for firms by the vocabulary to make explicit on the direction of future market. Furthermore, this research investigates how the sentimental vocabulary can be extracted as automatically from financial news for classification. A framework called, AZFin Text (Arizona Financial Text System) was proposed to examine discrete stock prediction by a text processing techniques and Support Vector Machine (SVM) Regression to partition articles by similar industry and compared the result for quantitative and human stock pricing experts (Schumaker et al., 2009). In this research, each financial news article is represented using four textual methods: Bag of Words, None Phrase, Named Entities and Proper Nouns. In this design, the extracted features are limited to three or more occurrences in any document to avoid choosing terms that rarely happens. The result had a predicted directional accuracy of 71.18%. J. Zhai Focused on the analysis of publicly-available news reports by computers to provide the recommendation to traders for buy and sell stocks (Zhai, Cohen, & Atreya, 2011). Two approaches were used to produce sentiments by using training and testing data. The first way was a manual approach that was done by an expert by reading the articles and classifying sentiment, and the second was automatic approach using the market movements. The features were taken by unigram and bigram and words in article headlines and bodies were used as two separate sets of features. These sentiment words could be used as input to trading systems for prediction of daily market trend. The sentiment classification accuracy of a classical bag of words approach was improved by using natural language pre-processing methods (Alvim, Vilela, Motta, & Milidiú, 2010). The features were provided by part of speech tagging, text chunking, and negation. Support Vector Machines and Naive Bayes algorithms were used for sentiment classification. The results showed significant improvement of sentiment classification using Support Vector Machines in comparison with Naive Bayes. In the financial domain, focus has been on more subjective sources of financial information such as financial blogs (O Hare et al., 2009). In this study, they developed 1500 document-level annotations. Since, most of the single blogs discusses more than one topic; they employed text-extraction approaches to extract the most relevant phrases of a document according to a given topic. The text-extraction approaches were considered as N-paragraph, 280

N-sentence, and N-word (for example, N-word includes a given number, N of words, either side of any topic word and this method is also applicable to other cases) where all of them achieved improvements over the document level. Some of the studies, (Koppel & Shtrimberg, 2006), (Généreux et al., 2011),(Zhai et al., 2011) more have focused on the relationship between sentiment analysis and stock market movement. While (Schumaker et al., 2009),(Alvim et al., 2010),(O Hare et al., 2009), have concentrated on feature extraction. Although (Généreux et al., 2011), (Schumaker et al., 2009) use some feature selection methods to choose proper features but none of them don t investigate seriously concepts of feature space dimension reduction and improve classification. Table I provides a list consisting of many techniques used in these related works including supervised machine learning, especially, Support Vector Machine (SVM) regression. Moreover, it is clear that the presence of complicated and proper features extraction and selection is required for efficient sentiment classification with high accuracy. SENTIMENT ANALYSIS SENTIMENT ANALYSIS IN BRIEF Sentiment analysis seeks to recognize and analyze text containing sentiments, opinion, and biases. The authors in (Esuli & Belzoni, 2005) identified three specific subtasks that make up sentiment analysis, i.e. subjectivity, polarity, and polarity strength. Subjectivity: Identifying subjectivity involves deciding whether a part of text is factual or subjective. Subjective classification determines whether sentences in each text convey the opinion, on the words and format used by the author. Subjectivity may be detected by the bulk of sentimental features like adjectives within a sentence, although sentences may sometimes carry a sentiment without any specific and obvious sign at them (Pang & Lee, 2008). Polarity: Task sentimental polarity includes deciding whether the given an opinionated sentence, carries either a positive or negative standpoint. Opinion mining is a latest subdiscipline at the field of information retrieval and computation linguistics, which helps at determining the subjectivity expressed within a document. For example, SentiWordNet is a lexical resource for sentiment and opinion mining. In fact, each synset of WordNet is assigned with three sentiment scores: positivity, negativity, and objectivity by lexical resource of SentiWordNet 1 (Mayne, 2010). Polarity Strength: In the domain of finance, polarity strength can be important where it indicates the intensity of the opinion, which can be reflection of the confidence of the author in related subject or event. As mentioned in previous section SentiWordNet provides a quantitative strength indicative of how positive or negative a synset may be, however, this may not be a strong enough resource in this domain. There are other resources such as the financial gazetteer that can help to identify strengthening the author s opinion through features (up or down, and by how much) (Mayne, 2010). Sentiment Analysis is a combination of diverse fields, Natural Language Processing (NLP), Machine Learning (ML), and Pattern Recognition. Each of the fields causes a number of challenges that need to be considered when working within Sentiment Analysis. 1 http://sentiwordnet.isti.cnr.it/ 281

A. Natural Language Processing(NLP) NLP is a combination of computer science, artificial intelligence, and linguistics that is concerned with the interactions between computers and human (natural) languages. Various challenges in NLP involve natural language understanding that is, enabling computers to derive meaning from human or natural language input. Usually, Natural Language Processing can be applied to text in its raw or marked-up format. This means that a text corpus may either be simply in human readable format, or it may be annotated with Meta information about the text itself, such as the case or gender of a word. B. Machine Learning (ML) Machine learning technique is a broad sub-field of Artificial Intelligence. An intelligent machine is able to adapt to their environments without any interference by a user and optimize their solving-problem performance scale by using the example data. Machine learning technique is typically used in different applications such as web services, viruses detection, and sentiment analysis and so on. The primary issue of machine learning technique is a capability of the system to learn from its experience. The purpose of the machine learning problem is to predict or estimate the unknown value of an attribute y (output) of a system using the known values of other attributes x = (x 1, x 2,, x n ) that are referred to input or predictor variables. The classifier takes the form ŷ = ƒ(x 1, x 2,, x n ) = ƒ(x) that maps a set of inputs to a value ŷ for the output variable. The goal is to design an accurate target function ƒ(x) in Equation (1) as illustrated by a simple definition of machine learning: (Hamel, 2009) A data universe X A sample set S, where S X Some target function (labeling process) : X +, - A labeled training set T where T = {(x, y) x S and y= ƒ(x)} (1) Learning problems are classified into supervised learning and unsupervised learning and semi-supervised learning. The goal of supervised learning is to make an artificial function that is capable to learn the mapping between input and output, and it is able to predict the output of the system given new input. If the outputs are in continuous form the regression methods can be used whereas the categorical output uses classification methods. Clustering is the most important unsupervised learning where it can find a structure in a set of unlabeled data. It means that the algorithm is anticipated to predict input data cluster, while pre-defined classes do not label the input data. Semi-supervised learning makes use of both labeled and unlabeled data for training to perform otherwise unsupervised learning or supervise learning technique. It is a particular form of classification. Obtaining of labeled instances is often time-consuming and more 282

expensive, while unlabeled data can be collected easily by existing ways. Semi-supervised learning built better classifiers by a large amount of unlabeled together with the labeled data. C. Sentiment Analysis Techniques Generally, used techniques for sentiment classification can be categorized to two main techniques (Vohra & Teraiya, 2012). These contain machine learning algorithms and lexicon based techniques. Few research studies have also combined these two methods and achieve partly better performance. Machine learning based approaches use classification techniques for text classification. Lexicon based approaches utilize a sentiment dictionary by opinion words and match them with data to specify. polarity. Then, sentiments scores are assigned to opinion words to determine the polarity of the contained words in the dictionary such as positive, negative, and neutral (Liu, 2012). Machine learning based techniques: This type of techniques consists of two sets of documents such as training and testing. An automatic classifier uses a training set to learn the distinguishing features of corpus, and a test set is used to examine how fine the classifier performs. According to previous studies, a specific number of machine learning methods have been applied on sentiment analysis such as Naïve Bayes (NB), Maximum Entropy (ME), Support Vector Machines (SVM), Decision Tree and a few others. Naïve Bayes (NB) is a simple probabilistic classifier based on applying Bayes theorem with strong independence assumption. Maximum Entropy (ME) is a natural extension for Bayesian theory; furthermore it is a probability distribution estimation technique. ME used for a diversity of natural language tasks such as POS tagging and document classification. Support Vector Machine (SVM) is a discriminative classifier formally defined by a separated hyperplane. Given labeled training data, the algorithm outputs on optimal hyperplane, which categorizes new data. Decision Tree (DT) is machine learning like flowchart-like tree structure. Each internal node assigns a test on an attribute, and each branch shows an outcome of the test, and each leaf node holds a class label. In fact, DT is the learning of a decision tree classlabeled training tuples (Han & Kamber, 2006). Feature selection is one of the main tasks of supervised machine learning. Some of the common and efficient feature selection techniques in text processing are listed as follows: Terms and their frequency: Generally, terms or features include word n-grams as unigram and bigram and their frequency and presence. In some cases, word position is considered as an important feature. These features have been proved totally effective in sentiment classification. For example, (Pang, Lee, & Vaithyanathan, 2002) represents that unigrams give the better results than bigrams in movie reviews domain. Part of speech (POS): This feature was used in many studies that applied adjectives as indicator features. On the other hand, using POS, each term in document will be devoted with a label, which determines the position of the term as grammatical context (Liu, 2012). Opinion words and phrases: Opinion or sentiment words are words that are frequently implied to state positive or negative. For example, beautiful, good, and excellent are 283

positive opinion words and words such as bad and terrible are negative opinion words. For instance, WordNet is used to determine positive or negative polarity of opinion words. Opinion words can be as adjective, adverb, verb, noun, phrase, and idioms (Liu, 2012). Negations: Obviously, negation words are significant because they provide potentially negative meaning. For example, I don t like this laptop shows that this sentence is negative but negative words are not negative in every occurrence. For example not only in a sentence is not a negative sentence (Liu, 2012). Lexicon based techniques Sentiment or opinion lexicon includes lists of expressions and phrases used to state people s subjective emotions and attitudes. Indeed, they used as tools for sentiment mining. The lexicon-based techniques are a subset of unsupervised learning because there is no any prior training for it. Unsupervised techniques perform classification by comparing the features of a given document against the sentiment lexicon. There are three main approaches to collect and construct sentiment word lists: manual approach, dictionarybased approach, and corpus-based approach (Liu, 2012). Manual approach: This approach is very time consuming so it is not typically used alone. Although, it can be used along with other automated approaches as a final check in order to fix mistakes that will be created. Dictionary-based approach: The strategy of dictionary-based techniques is based on bootstrapping using a small list of seed sentiment words and an online dictionary such as WordNet or lexicon. First stage in this approach is collecting a small set of opinion words manually with known orientation. The second step is to grow the collection by looking in dictionaries and WordNet for their synonyms and antonyms. In the next step, the newly detected words are added to the seed set. In the last stage, subsequent iteration starts and continues until it did not find any more new word. The dictionary-based approach and similar techniques have a major shortcoming. This approach cannot find opinion words with domain and context-specific trends. For example, for a speakerphone, if it is quiet, it is usually negative but for a car, if it is quiet, it is positive (Liu, 2012). Corpus-based approach: This approach relies on syntactic patterns in large corpus. Corpus-based techniques can construct opinion and sentiment word with relatively high accuracy. The major weakness for this method is the need for a huge labeled training data. The corpus-based approach has a major advantage than the dictionary-based approach that is finding domain specific opinion words and their orientations. Efficient Market Hypotheses Less than a century, financial economists have formally brought up the idea of informed effective markets, and the significance of the Efficient Market Hypothesis (EMH). An efficient market is defined by Fama in (Fama, 1965) as a market where there are large numbers of rational, profit-maximiser actively competing with each other trying to predict future market values of individual securities, and where important current information is almost freely available to all participants. The EMH asserts that achieve positive returns consistently in financial markets is impossible because of relevant information is reflected in the existing stock market. 284

Financial Sentiment An investor s sentiment is often originated from a variety of news and data sources, often by relying on some and discounting others in certain situations (Gillam, 2006). Although this work is essentially very subjective, the concept of sentiment within the financial domain is roughly varied in comparison with other domains, due to the causal relationships between key indicators (such as net income, tax, sales, etc.) and a corporation. Sentiment is designed from good perspectives and future profits for a company, not just attitude expressed by opinion holders. For example, tax is generally seen as a neutral topic linguistically, however, a rise in taxes or regulation could be very harmful to an investor s benefit, and it can be the case that the news writer is objective in their reportage on this event. These relationships are not necessarily captured by traditional linguistic sentiment, but can have a great impact on the sentiment of a market participant towards stocks. DISCUSSION AND FUTURE WORK The sentiment analysis is generally performed on text data, and it usually contains many features, which are not easy to identify. Existing studies have introduced several features such as unigram, bigram and so forth, while, not all features are important in analysis of sentiment. In addition, in practice, a document may include sentiment words that have different sentiment in various fields, so that the ambiguity and context-dependency problems can lead to classification problems. For example, the sentence There was a decline is negative for finance but positive for crimes. Furthermore, many of the statements about entities, especially in the financial domain are factual in nature while they still carry sentiment. Therefore, we need to identify Financial News text modeling techniques for feature extraction and to design and develop effective dimension reduction techniques for elimination of the unnecessary extracted features (insignificant features). In addition, we intent to develop sentiment analysis in Financial News based on complex methods such as kernel methods to achieve higher accuracy. Conclusion This study reviewed some of the main research on the key concepts of sentiment analysis and its applications on financial domain such as news, blogs and micro-blogs. The study focused on machine learning based approaches in sentiment analysis such as Support Vector Machine (SVM), Naïve Bayes (NB), and other text processing methods relevant to sentiment analysis. References Alvim, L., Vilela, P., Motta, E., & Milidiú, R. L. (2010). Sentiment of Financial News : A Natural Language Processing Approach, 1 3. 285

Esuli, A., & Belzoni, V. G. B. (2005). Determining the Semantic Orientation of Terms through Gloss Classification. In Proceedings of the 14th ACM international conference on information and knowledge management, New York, NY, USA) (pp. 617 624). Fama, E. (1965). Random Walks in Stock Market Prices. Financial Analysts Journal, 21(5), 55 59. Généreux, M., Poibeau, T., & Koppel, M. (2011). Sentiment analysis using automatically labelled financial news items. In Sentiment Analysis Using Automatically Labelled Financial News Items (pp. 111 125). Gillam, L. (2006). Sentiment Analysis and Financial Grids. In Workshop on Bridging Quantitative and Qualitative Methods for Social Sciences Using Text Mining Techniques. Hamel, L. (2009). KNOWLEDGE DISCOVERY WITH SUPPORT VECTOR MACHINES. John Wiley & Sons, Inc., Hoboken, New Jersey. Han, J., & Kamber, M. (2006). Data Mining (Concepts and Techniques). (J. Widom & S. Ceri, Eds.). Elsevier (Morgan Kaufmann). Koppel, M., & Shtrimberg, I. (2006). Good News or Bad News? Let the Market Decide. Computing Attitude and Affect in Text: Theory and Applications the Information Retrieval, 20, 297 301. Liu, B. (2012). A SURVEY OF OPINION MINING AND SENTIMENT ANALYSIS. (C. C. Aggarwal & C. Zhai, Eds.). Boston, MA: Springer US. doi:10.1007/978-1-4614-3223-4 Mayne, A. (2010). Sentiment Analysis for Financial News. University of Sydney. Mittermayer, M. (2004). Forecasting Intraday Stock Price Trends with Text Mining Techniques *, 00(C), 1 10. O Hare, N., Davy, M., Bermingham, A., Ferguson, P., Sheridan, P., Gurrin, C., & Smeaton, A. F. (2009). Topic-dependent sentiment analysis of financial blogs. Proceeding of the 1st International CIKM Workshop on Topic-Sentiment Analysis for Mass Opinion - TSA 09, 9. doi:10.1145/1651461.1651464 Pang, B., & Lee, L. (2008). Opinion Mining and Sentiment Analysis. Foundations and Trends in Information Retrieval, 2(1 2), 1 135. doi:10.1561/1500000011 Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment Classification using Machine Learning Techniques. In EMNLP 02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing (pp. 79 86). Philadelphia. Schumaker, R. P., Zhang, Y., Huang, C., & Rochelle, N. (2009). Sentiment Analysis of Financial News Articles 1, 1 21. Vohra, S. M., & Teraiya, J. B. (2012). A COMPARATIVE STUDY OF SENTIMENT ANALYSIS TECHNIQUES. INFORMATION, KNOWLEDGE AND RESEARCH IN COMPUTER ENGINEERING, 2(2), 313 317. Zhai, J. J., Cohen, N., & Atreya, A. (2011). CS224N Final Project : Sentiment analysis of news articles for financial signal prediction, 1 8. Alvim, L., Vilela, P., Motta, E., & Milidiú, R. L. (2010). Sentiment of Financial News : A Natural Language Processing Approach, 1 3. Esuli, A., & Belzoni, V. G. B. (2005). Determining the Semantic Orientation of Terms through Gloss Classification. In Proceedings of the 14th ACM international conference on information and knowledge management, New York, NY, USA) (pp. 617 624). Fama, E. (1965). Random Walks in Stock Market Prices. Financial Analysts Journal, 21(5), 55 59. 286

Généreux, M., Poibeau, T., & Koppel, M. (2011). Sentiment analysis using automatically labelled financial news items. In Sentiment Analysis Using Automatically Labelled Financial News Items (pp. 111 125). Gillam, L. (2006). Sentiment Analysis and Financial Grids. In Workshop on Bridging Quantitative and Qualitative Methods for Social Sciences Using Text Mining Techniques. Hamel, L. (2009). KNOWLEDGE DISCOVERY WITH SUPPORT VECTOR MACHINES. John Wiley & Sons, Inc., Hoboken, New Jersey. Han, J., & Kamber, M. (2006). Data Mining (Concepts and Techniques). (J. Widom & S. Ceri, Eds.). Elsevier (Morgan Kaufmann). Koppel, M., & Shtrimberg, I. (2006). Good News or Bad News? Let the Market Decide. Computing Attitude and Affect in Text: Theory and Applications the Information Retrieval, 20, 297 301. Liu, B. (2012). A SURVEY OF OPINION MINING AND SENTIMENT ANALYSIS. (C. C. Aggarwal & C. Zhai, Eds.). Boston, MA: Springer US. doi:10.1007/978-1-4614-3223-4 Mayne, A. (2010). Sentiment Analysis for Financial News. University of Sydney. Mittermayer, M. (2004). Forecasting Intraday Stock Price Trends with Text Mining Techniques *, 00(C), 1 10. O Hare, N., Davy, M., Bermingham, A., Ferguson, P., Sheridan, P., Gurrin, C., & Smeaton, A. F. (2009). Topic-dependent sentiment analysis of financial blogs. Proceeding of the 1st International CIKM Workshop on Topic-Sentiment Analysis for Mass Opinion - TSA 09, 9. doi:10.1145/1651461.1651464 Pang, B., & Lee, L. (2008). Opinion Mining and Sentiment Analysis. Foundations and Trends in Information Retrieval, 2(1 2), 1 135. doi:10.1561/1500000011 Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment Classification using Machine Learning Techniques. In EMNLP 02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing (pp. 79 86). Philadelphia. Schumaker, R. P., Zhang, Y., Huang, C., & Rochelle, N. (2009). Sentiment Analysis of Financial News Articles 1, 1 21. Vohra, S. M., & Teraiya, J. B. (2012). A COMPARATIVE STUDY OF SENTIMENT ANALYSIS TECHNIQUES. INFORMATION, KNOWLEDGE AND RESEARCH IN COMPUTER ENGINEERING, 2(2), 313 317. Zhai, J. J., Cohen, N., & Atreya, A. (2011). CS224N Final Project : Sentiment analysis of news articles for financial signal prediction, 1 8. TABLE I- Research Work related to Machine Learning Classifier for Sentiment Analysis Author/ Year Techniq ues Data Source & Dataset Features Accuracy Moshe Koppel et al., 2004,20 06. Linear SVM, NB, Decision Tree The stocks in the Standard & Poor Relevant words, feature presence, Information gain (IG). 70.3% for the 2000-2002 corpus, 65.9% for the 2003 corpus 287

index of 500 leading stocks (S&P500) the entire 2000-2002 corpus, the 2003 corpus from the Multex Significan t Develop ments corpus 2. The total number of stories is over1200 0. The average length of each story is over 100 words. Michel Généreu x et al., 2008, 2011. Linear SVM corpus is a subset of the one used in (Moshe Koppel et al., 2006:6,27 7 news items unigrams, stems, financial terms, healthmeta phors and agentmetaphors, Document Frequency Feature Unigra m Unigra m Feature selection Informati on Gain Informati on Gain Feature Count Term Frequenc y Binary Count Accurac y (%) 67.6 67.5 Unigra X 2 Binary 66.1 2 http://news.moneycentral.msn.com/ticker/sigdev.asp but has since been removed. 288

averaging 71 words covering 464 stocks listed in the Standard &Poor 500 for the years 2000 2002 (DF), (IG), Chi-square (X2 ), Term Frequency (TF), feature presence m Unigra m Degree of Freedom Count Binary Count 59.4 Schuma ker et al., 2009. Support Vector Regressi on (SVR) Site of Comtex, PRNews Wire, Yahoo! Finance. Bag of Word, Nouns and Noun phrases, Named Entities, Proper Nouns, Feature Presence 71.18% Zhai et al., 2011. The Stanford Classifie r v. 2.0, utilizing Maximu m Entropy (ME) and Quasi- Newton optimiza tion The New York Times Annotate d Corpus (Jan 1987 to Jun 2007) LDC corpus. 3 Unigrams, Bigrams, Words in article headlines were used as one set of features, The words in the body were used as another set, lists of words 70% 3 http://www.ldc.upenn.edu/catalog/catalogentry.jsp? 289

considered to have positive and negative sentiment on the Internet. 4 Alvim et al., 2010 SVM, NB A Portugues e financial news annotated corpus that composed by a collection of one 1500 newspape r reports about the Petrobras energy company. Part-ofspeech tagging, text chunking and negation. Entropy Guided Transforma tion Learning algorithm is applied to obtain the required features. 85.94% Neil O Hare et al., 2009. SVM, multino mial Naïve Bayes (MNB) Financial blog articles collected automatic ally from a predefine d set of sources. (232 are N-word, N- sentence, and N- paragraph. N is number of words or sentence or paragraphs are either side of any Binary(Pos-Neg) Paragraph Sentence Word SV M 68.2 4 MN B 73.3 4 SVM 3Point(Pos-Neg-Neu) MN B SV M 70.70 72.59 74.3 7 Paragraph Sentence Word MNB 75.07 4 http://www.the-benefits-of-positive-thinking.com/ 290

identified in two crawls crawl 1 did for 3 weeks in Feb 2009 and crawl 2 did for 5 weeks from May to June 2009 ) topic word. SV M 53.3 1 MN B 57.7 2 SVM MN B SV M 53.87 57.48 56.6 0 MNB 59.46 The number of paragraphs, sentences and words are different (N). 291