Twitter Analytics for Insider Trading Fraud Detection

Size: px
Start display at page:

Download "Twitter Analytics for Insider Trading Fraud Detection"

Transcription

1 Twitter Analytics for Insider Trading Fraud Detection W-J Ketty Gann, John Day, Shujia Zhou Information Systems Northrop Grumman Corporation, Annapolis Junction, MD, USA {wan-ju.gann, john.day2, Abstract Twitter analytics have been developed to process Twitter data at a macro level for use in an insider trading detection system. This system establishes normal trading patterns between daily stock price change and public sentiment. Two machine learning models, Support Vector Machine (SVM) and Decision Tree, are built based on annotated historical Twitter data and the Stanford Sentiment140 Tweet corpus, respectively. This paper focuses on the discussions of polarized sentiment (positive and negative), comparison of SVM and Decision Tree models, Sentiment Key Performance Index (SKPI), Daily Sentiment Index (DSI) and mood analysis. The results illustrate that Twitter SKPI and DSI are useful indexes to predict the future stock price movement on regular stock trading. Keywords: sentiment analysis; mood analysis; Twitter analytics; machine learning; Support Vector Machine; Decision Tree 1. Introduction It has been reported that daily stock price change and public sentiment are correlated [1]; while this paper focuses on the detection of insider trade fraud based on Twitter analytics. Insider trading fraud is considered as one of the major financial fraud types. Such frauds often deploy sophisticated schemes which current approaches are not able to systematically detect in a timely manner. The proposed approach is to establish normal trading patterns among daily stock price and public sentiment at a macro level. When sentiment and mood are towards positive, the stock price rises and the majority of investors are buying; while sentiment and mood are towards negative, the stock price falls and the majority of investors are selling. When comparing insider trading data, such as US Securities and Exchange Commission (SEC) Form 4, with normal trading patterns at micro level, the system is able to detect the abnormal timing of the insider s stock trade execution and issues a warning on those trades for further investigation. The remainder of this paper focuses on discussions of Twitter analytics in sentiment/mood analysis used within the insider trading detection system mentioned above. Twitter is the world s largest micro-blogging platform. English is the most frequently used language, with 39% of total Twitter messages [2]. Two supervised machine learning trained models, Support Vector Machine (SVM) and Decision Tree, are built to classify the Twitter data. They are discussed along with the training data and feature selection. Various indexes, Sentiment Key Performance Index (SKPI) and Daily Sentiment Index (DSI), are implemented for tweet volume calculations. The results of the indexes show that daily tweets appear to form the basis for predicting the future stock trend. 2. Data Collection Supervised machine learning models require annotated training and testing sets. The following are descriptions of data collection and preparation for each model. 2.1 Historical Twitter Data We established a Twitter data set pertaining to Apple Inc. from November 2012 to February Each tweet is recorded with user ID (identifier), date/time of posting, source (where a tweet is published) and tweeter content. After cleaning the data and removing duplications from the collection, we retrieve tweets written in the English language with Apple related Twitter hashtags and/or keywords. The resulting tweets are then grouped by posting day with a total number of 167,345 tweets. 2.2 Stanford Sentiment140 Tweet Corpus This Sentiment140 data set is publically available [3]. To facilitate the labeling of such a large corpus (1.6 million), Stanford used only Twitter messages (i.e., tweets) containing emoticons to determine positive or negative sentiment. Positive emoticons, such as, or negative emoticons, such as are used to classify each tweet. Before training, the emoticons were removed in order to force the modeling software to build a sentiment model exclusively from the context of the words surrounding the emoticons. Stanford used this data to build their Sentiment140 maximum entropy classifier [4]. 3. Training Data Preparation For historical Twitter data, we focus on opinioned tweets by filtering the information bearing ones, such as tweets containing http and www strings. Strong sentiment words are selected from the filtered tweets to establish positive and negative sentiment training sets with 1,500 examples in each. The sentiment words (including their variations) are defined in Linguistic Inquiry and Word Count (LIWC) 2007 dictionary. LIWC [5] is a popular text analysis software program. For example, it has been used to analyze people s mental and physical health in correlation with the words they use in speech and written samples. The LIWC 2007 dictionary is the core of this text analysis software and is composed of 4,500 words and word stems in various cognitive and emotion hierarchical categories. For instance, positive sentiment words include love, thrill, wonderful; negative words consist of negation and blame words, e.g., hate, frustrate, damn. When preparing the training data, we primarily include tweets less than words for training purposes since in longer tweets, the sentiment tends to be balanced out or with more than one

2 sentiment presented. Currently, neutral tweets are not categorized in the training data. For the Stanford Sentiment140 Tweet corpus, the process is more complicated than that of AAPL Twitter data. The following are steps used to prepare it for training: Tag each tweet using the Carnegie Mellon University (CMU) ARK part-of-speech tagger [6]. Each tokenized word within a given tweet is concatenated an ARK tag. For example, run`v or dog`n. The CMU ARK tagger has defined 25 tags, specially customized for tagging tweets. A subset of 13 tags are used (V, N, A, R,!, &, G, E, #, T, Z, X, S). Select tagged tokens that occur 50 times or more in the overall data set. A total of 15,000 tokens are selected. Segregate these high frequency tagged tokens into positive and negative groups according to the POS (positive) or NEG (negative) label on each tweet. Calculate the Total Sentiment Index (TSI) for each selected tagged token. This represents the relative sentiment of a token based on the number of times p it occurred in positive-labeled tweets and the number of times n it occurred in negative tweets. Thus, an index of -1 would occur if the token was always seen in a negative tweet, +1 if always seen in a positive tweet, and 0 if seen in equal occurrences of positive or negative. To calibrate training sets with unequal numbers of positive and negative tweets, the totalpositive over total-negative ratio ( is used to rebalance the set so that TSI=0 represents neutral. (Note: the Stanford set was already balanced, with 800,000 positive and 800,000 negative tweets, so this ratio was 1 for our experiments) Select feature tokens appearing in the training corpus occurring more than 50 times, resulting in 15,000 tokens. For each of these tokens, a TSI value was computed. Tokens with TSI value near zero were discarded, leaving 6,799 tokens with TSI values relatively greater than or less than zero. The hypothesis is that the sum of these features tends to classify the tweets they occur in as positive or negative. All other tokens in each tweet are considered to have TSI=0. The following are examples of selected tokens: welcome`a as adjective occurs 2,298 times in the whole data set, where 137 of them are negative and 2,161 are positive. Both TSI and absolute TSI of welcome`a are 0.881; cavity`n as a noun has a high negative count of 54, and low positive count of 3, resulting in a TSI of Create a feature vector for each tweet in the data set with the following elements: ID: a unique identifier, Ground Truth (GT): Pos (Positive Sentiment), Neg (Negative Sentiment), or Neu (Neutral Sentiment) [Note: no Neu records were included in this training set but were reserved for future test purposes], TSI: sum of TSI for each token in the tweet. For words not in top 6,799 tokens as described above, 0 is the default value assigned. A Boolean array indicates the presence and absence of TSI-selected tokens in a given tweet. Thus, a feature vector contains 6,802 elements consisting of an ID, GT (ground truth tag), TSI sum and 6,799 Boolean features denoting presence or absence of each key sentiment token. 4. Building Training Models Probabilistic models using supervised machine learning algorithms are built using open source text/data mining tools. 4.1 SVM (Supported Vector Machine) Model SVM is a known classifier for learning in text categorization. Given a set of training samples, each marked with one of two categories (positive and negative in this case), an SVM training algorithm builds a model that assigns a new instance into a positive or negative category. We used the historical Twitter data set described above to build a SVM model on RapidMiner [7]. RapidMiner is an open source text/data mining tool. Version is used for this project. A 10-fold cross validation is used to estimate the accuracy of the model. Based on the total of 3,000 sample tweets in our training data (1,500 each in positive and negative), the precision has reached 90.25% and recall 74.27%. 4.2 Decision Tree Model A Decision Tree model is built for a general purpose sentiment classifier by using the open source data mining tool called RuleQuest C5 [8]. The model (unboosted decision tree) was trained on the 1.6 million machine-labeled tweets from the Stanford Sentiment140 training set, which included 359 additional labeled tweets set aside for blind testing of the models. Machine-labeling was facilitated by using only tweets containing emoticons, which were removed after labeling to force the model to learn the matching sentiment from the surrounding text. Experiments are run with random samplings of 10K tweets, 100K tweets, and the full 1.6M tweet data set, respectively. The 10K random sample provides the best result with 83.3% accuracy. This exceeds Stanford s accuracy report of 79.9% (Naïve Bayes), 79.9% (Maximum Entropy) and 81.9% (SVM) when unigram and POS are used as features [4]. 5. Discussions of SVM and Decision Tree Models The SVM and Decision Tree models were trained with different training data; therefore, the results are not directly comparable. However, it seems that the SVM model tends to have higher precision when using hand labeled training data. On the other hand, the Decision Tree model was trained on a much larger corpus of 1.6 million machine-labeled examples, exposing it to many more expressions of positive and negative sentiment from real-world Twitter messages. Thus, one would expect such models to be more robust and have greater generalization skills than models trained on a smaller set of sentiment concepts. The rationale for using qualified CMU ARK tags in the 1.6M tweet Sentiment140 data is to allow the same word to support different sentiment labels, depending on the part of speech context. For example, the word wish has a positive weight when used as a noun (wish`n); while it is negative

3 when used as a verb (wish`v). The intent of using this qualification is to improve the accuracy of sentiment predictions. To test the effectiveness of ARK-tagged features vs. untagged, we built two models on a randomly selected subset of 11,000 tweets from the Stanford training corpus. One model used bare words as features and the other used the same words tagged with the CMU ARK part-of-speech tags. The tagged model achieved 71.6% accuracy compared to 67.4% for the bare-word model, thus supporting the superiority of using features tagged by part-of-speech. It should be noted that Stanford also tried using part-ofspeech tags [4] and reported little success with this approach. However, they did not use the CMU ARK tagger, which has been optimized for recognizing unconventional tokens such as hash tags, emoticons and abbreviations as tokens in a specialized language for writing tweets. It is also worth noting that our approach for building the Decision Tree model used only word tokens that occurred more than 50 times as modeling features. This was to mitigate the effect of misspellings and nonsense words which tend to occur frequently in tweets. However, there were over 800,000 distinct tokens in the raw training corpus (including punctuation, but no emoticons). This is surprising because there are only about 50,000 words in standard English, suggesting that the language used to express Twitter messages is an order of magnitude larger than English. But these additional tokens come from proper names, misspelled and abbreviated words and many other specialized tokens (e.g., emoticons) that have become part of the unique tweet vocabulary, which are not normally considered standard parts of natural languages. The effect of filtering reduced the token set size to approximately 15,000. These tokens were then further processed (described in section 3. above) to select the key sentiment tokens which tend to classify tweets as positive or negative. These are precisely the tokens with relatively high positive or negative TSI values. Neutral words (and all other unknown or unrecognized words) are effectively ignored by setting their TSI values to zero. It was observed that the TSI sum feature (i.e. sum of individual TSI values for each word in a tweet) is a very strong feature by itself, capable of classifying the evaluation set with 75% accuracy. The additional 6,799 key token features are much weaker in classification strength, but collectively can raise the accuracy by additional 8 percent points to 83%. 6. Granger Causality and Durbin-Watson Test The tweets for days that the stock market is open from November 13, 2012 to February 5, 2013 (total number of 167,345) are processed by the trained SVM model. Instead of using a daily count of positive and negative tweets as the metric, a Sentiment Key Performance Index (SKPI) and stock market value time series are used as an indicator of sentiment. Granger Causality Analysis (GCA) is applied to the daily time series produced by the daily SKPI and AAPL stock market value time series. GCA is a standard test in finance and economics to discover causal links between independently generated time series. GCA is based on the assumption that if a variable X causes Y, then changes in X will systematically occur before changes in Y and the lagged values of X will illustrate a statistically significant correlation with Y. SKPI has shown Granger causality relation with AAPL stock price movement for lags ranging from 1 to 5 days (p<0.1), where a 3-day lag has the smallest p value (p<0.05). In other words, SKPI is shown to predict AAPL daily stock price movement with a 3-day lag (i.e., 3 days prior) as shown in Table 1. Table 1. GCA Results for AAPL price/skpi Lag (Day) Y=f(X)/X=f(Y) /0.045* /0.027* /0.018** /0.070* /0.034* / /0.152 Legend: **p < 0.05, *p < 0.1 Format Y=f(X)/X=f(Y); X Granger causes Y/Y Granger causes X; X = AAPL daily stock value changes, Y = SKPI However, correlation in GCA does not prove causation. GCA results are then validated with a Durbin-Watson (DW) test to filter out any spurious results. The DW test result implies a valid test with a high DW value (DW= 2.77). 7. Daily Sentiment Index (DSI) and Stock Trend DSI is created to compute the daily positive and negative sentiment counts returned by the model. DSI ranges are between a -1 and +1 scale to normalize daily fluctuations in tweet volume where tp = total sum of daily positive counts; tn = total sum of daily negative counts; n = daily negative tweet count and p = daily positive tweet count. DSI behaves like a time-derivative and spikes up or down during sentiment change. Three regression models are produced for comparison purposes. Model #3 shows the best results for predicting future stock trends given past stock trend and multiple DSI values (Figure 1). Model #1 => Predicted future AAPL Trend from Past Price Model #2 => Predicted future AAPL Trend from Past Price + AAPL DSI Model #3 => Predicated future AAPL Trend from Past Price + Multiple Source-DSI (Note: computed from 23 Tweet sources) To maximize the persistence and effectiveness of using tweet source as a model feature, we used the 21 most frequently occurring sources. The remaining 2,386 sources were aggregated into a 22nd virtual source called OTHER. The 23rd source was the aggregate of all sources, called ALL. See appendix for the source list.

4 Figure 1. Plots for three models: left panel = Model#1; middle panel = Model#2 and right panel = Model#3 Table 2. CGA Results for AAPL price/negative Mood Analysis Lag (Day) Positive Emotion Swear Anxiety Anger Sadness Tentative Certain / / / / / / **/ / / / / / / */ / / / / / / */ / / / / / / **/ / /0.002** 0.40/ /0.003** 0.59/ / */ / /0.005** 0.53/ /0.008** 0.72/ / / / /0.01** 0.47/ /0.009** 0.69/ **/0.03** 0.03**/0.22 Legend: **p < 0.05, *p < 0.1 Format Y=f(X)/X=f(Y); X Granger causes Y/Y Granger causes X; X = AAPL daily stock value changes, Y = word ratio Table 3. DW Validations of AAPL price/negative Mood Analysis CGA Results Durbin Watson Test Positive Emotion Swear Anxiety Anger Sadness Tentative Certain Legend: 2 = no correlation; 0 = positive correlation; 4 = negative correlation 8. Mood Analysis and Stock Trend We also investigated whether Mood might better form a basis for modeling stock trends by analyzing the negative sentiment tweets using LIWC. The negative sentiment is subcategorized into Swear, Anxiety, Anger, and Sadness based on the LIWC 2007 dictionary. In addition, two cognitive categories Tentative and Certain are also processed. The output is the word ratio of each category presented in daily tweets. GCA is performed based on the word ratio and AAPL daily changes in stock value. The results show that only Swear and Anger correlate with the AAPL stock price change in 5- day lag (Table 2). On the other hand, daily AAPL stock value correlates with Tentative in a 7-day lag, and Certain in 1, 4, and 7 day lags. However, DW test validation indicates that Swear and Anger have weak positive correlation with AAPL stock changes, and 1.989, respectively (Table 3), which is different from the result of Bollen et al [1]. They concluded that mood is a better predictor than sentiment for the stock market. We believe that the difference in data cleansing may be the potential cause. 9. Conclusion Two supervised models, SVM and Decision Tree, have been built for Twitter analytics as part of an insider trading

5 detection system. SVM has achieved high precision/recall when using historical AAPL tweet data, while Decision Tree with positive features reached 83.3% accuracy when processing a 10K random sample data set from the Stanford Sentiment140 Tweet corpus. Two major indexes, SKPI and DSI are discussed. They appear to be capable of predicting the stock price movement with data used in this project. DSI, when combined with Twitter sources, shows the best results in prediction. Acknowledgment This project is funded by Northrop Grumman Corporation 2013 Internal Research and Development program. Comments or opinions expressed in this paper do not necessarily represent the position of the company. We would like to thank Jim Sowder for helpful discussions. References [1] Johan Bollen, Huina Mao, Xiao-Jun Zeng. Twitter mood predicts the stock market. IEEE Computer, 44(10): 91-94, October [2] Arabic highest growth on Twitter, Semiocast, November 24, URL: on_twitter [3] Senitment140 Tweet Corpus. URL: [4] Alec Go, Richa Bhayani, Lei Huang. Twitter sentiment classification using distance supervision. Technical Report, Stanford University, [5] James W. Pennebaker, Cindy K. Chung, Molly Ireland, Amy Gonzales, Roger J. Booth. The development and psychometric properties of LIWC2007. LIWC2007 Manual. [6] Kevin Gimpel, Nathan Schneider, Brendan O Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, et al Part-of-speech tagging for Twitter: annotation, features, and experiments In Proceedings of the Annual Meeting of the Assoication for Computational Linguistics, companion volume, Portland, OR, June [7] RapidMiner. v Rapid-I GmbH, Stockumer Str. 475, Dortmund, Germany URL: [8] RuleQuest C5. URL:

6 Appendix List of Tweet Sources From Historical Tweet Data The metadata for November 2012 to February 2013 that we analyzed included the source for each tweet, an identifier representing the particular stream feed that each tweet was sampled from. There were 2,400 unique sources for the entire corpus, which appeared to be distributed according to a power law. The most frequent source web accounted for about 20% of the corpus. Whereas there were over 1,000 sources which only occurred once in the corpus. Rank Count Source web Twitter for iphone twitterfeed dlvr.it Twitter for Android Instagram TweetDeck IFTTT HootSuite Tweet Button Round Team ios Twitter for BlackberryAr Mobile Web Twitter for ipad Google Tweetbot for ios Sylvester Trends StockTwits Web Echophone Twitter for Mac Other (2,386 other sources) All (all sources)

Sentiment analysis on tweets in a financial domain

Sentiment analysis on tweets in a financial domain Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International

More information

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015 Sentiment Analysis D. Skrepetos 1 1 Department of Computer Science University of Waterloo NLP Presenation, 06/17/2015 D. Skrepetos (University of Waterloo) Sentiment Analysis NLP Presenation, 06/17/2015

More information

Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement

Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement Ray Chen, Marius Lazer Abstract In this paper, we investigate the relationship between Twitter feed content and stock market

More information

Forecasting stock markets with Twitter

Forecasting stock markets with Twitter Forecasting stock markets with Twitter Argimiro Arratia argimiro@lsi.upc.edu Joint work with Marta Arias and Ramón Xuriguera To appear in: ACM Transactions on Intelligent Systems and Technology, 2013,

More information

CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis

CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis Team members: Daniel Debbini, Philippe Estin, Maxime Goutagny Supervisor: Mihai Surdeanu (with John Bauer) 1 Introduction

More information

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies Sentiment analysis of Twitter microblogging posts Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies Introduction Popularity of microblogging services Twitter microblogging posts

More information

Using Twitter as a source of information for stock market prediction

Using Twitter as a source of information for stock market prediction Using Twitter as a source of information for stock market prediction Ramon Xuriguera (rxuriguera@lsi.upc.edu) Joint work with Marta Arias and Argimiro Arratia ERCIM 2011, 17-19 Dec. 2011, University of

More information

SI485i : NLP. Set 6 Sentiment and Opinions

SI485i : NLP. Set 6 Sentiment and Opinions SI485i : NLP Set 6 Sentiment and Opinions It's about finding out what people think... Can be big business Someone who wants to buy a camera Looks for reviews online Someone who just bought a camera Writes

More information

Semantic Sentiment Analysis of Twitter

Semantic Sentiment Analysis of Twitter Semantic Sentiment Analysis of Twitter Hassan Saif, Yulan He & Harith Alani Knowledge Media Institute, The Open University, Milton Keynes, United Kingdom The 11 th International Semantic Web Conference

More information

Tweets Miner for Stock Market Analysis

Tweets Miner for Stock Market Analysis Tweets Miner for Stock Market Analysis Bohdan Pavlyshenko Electronics department, Ivan Franko Lviv National University,Ukraine, Drahomanov Str. 50, Lviv, 79005, Ukraine, e-mail: b.pavlyshenko@gmail.com

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

More information

Microblog Sentiment Analysis with Emoticon Space Model

Microblog Sentiment Analysis with Emoticon Space Model Microblog Sentiment Analysis with Emoticon Space Model Fei Jiang, Yiqun Liu, Huanbo Luan, Min Zhang, and Shaoping Ma State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory

More information

Analysis of Tweets for Prediction of Indian Stock Markets

Analysis of Tweets for Prediction of Indian Stock Markets Analysis of Tweets for Prediction of Indian Stock Markets Phillip Tichaona Sumbureru Department of Computer Science and Engineering, JNTU College of Engineering Hyderabad, Kukatpally, Hyderabad-500 085,

More information

Robust Sentiment Detection on Twitter from Biased and Noisy Data

Robust Sentiment Detection on Twitter from Biased and Noisy Data Robust Sentiment Detection on Twitter from Biased and Noisy Data Luciano Barbosa AT&T Labs - Research lbarbosa@research.att.com Junlan Feng AT&T Labs - Research junlan@research.att.com Abstract In this

More information

The Viability of StockTwits and Google Trends to Predict the Stock Market. By Chris Loughlin and Erik Harnisch

The Viability of StockTwits and Google Trends to Predict the Stock Market. By Chris Loughlin and Erik Harnisch The Viability of StockTwits and Google Trends to Predict the Stock Market By Chris Loughlin and Erik Harnisch Spring 2013 Introduction Investors are always looking to gain an edge on the rest of the market.

More information

A CRF-based approach to find stock price correlation with company-related Twitter sentiment

A CRF-based approach to find stock price correlation with company-related Twitter sentiment POLITECNICO DI MILANO Scuola di Ingegneria dell Informazione POLO TERRITORIALE DI COMO Master of Science in Computer Engineering A CRF-based approach to find stock price correlation with company-related

More information

Kea: Expression-level Sentiment Analysis from Twitter Data

Kea: Expression-level Sentiment Analysis from Twitter Data Kea: Expression-level Sentiment Analysis from Twitter Data Ameeta Agrawal Computer Science and Engineering York University Toronto, Canada ameeta@cse.yorku.ca Aijun An Computer Science and Engineering

More information

Multilanguage sentiment-analysis of Twitter data on the example of Swiss politicians

Multilanguage sentiment-analysis of Twitter data on the example of Swiss politicians Multilanguage sentiment-analysis of Twitter data on the example of Swiss politicians Lucas Brönnimann University of Applied Science Northwestern Switzerland, CH-5210 Windisch, Switzerland Email: lucas.broennimann@students.fhnw.ch

More information

Neural Networks for Sentiment Detection in Financial Text

Neural Networks for Sentiment Detection in Financial Text Neural Networks for Sentiment Detection in Financial Text Caslav Bozic* and Detlef Seese* With a rise of algorithmic trading volume in recent years, the need for automatic analysis of financial news emerged.

More information

Sentiment analysis using emoticons

Sentiment analysis using emoticons Sentiment analysis using emoticons Royden Kayhan Lewis Moharreri Steven Royden Ware Lewis Kayhan Steven Moharreri Ware Department of Computer Science, Ohio State University Problem definition Our aim was

More information

Twitter sentiment vs. Stock price!

Twitter sentiment vs. Stock price! Twitter sentiment vs. Stock price! Background! On April 24 th 2013, the Twitter account belonging to Associated Press was hacked. Fake posts about the Whitehouse being bombed and the President being injured

More information

Twitter Stock Bot. John Matthew Fong The University of Texas at Austin jmfong@cs.utexas.edu

Twitter Stock Bot. John Matthew Fong The University of Texas at Austin jmfong@cs.utexas.edu Twitter Stock Bot John Matthew Fong The University of Texas at Austin jmfong@cs.utexas.edu Hassaan Markhiani The University of Texas at Austin hassaan@cs.utexas.edu Abstract The stock market is influenced

More information

Projektgruppe. Categorization of text documents via classification

Projektgruppe. Categorization of text documents via classification Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction

More information

Italian Journal of Accounting and Economia Aziendale. International Area. Year CXIV - 2014 - n. 1, 2 e 3

Italian Journal of Accounting and Economia Aziendale. International Area. Year CXIV - 2014 - n. 1, 2 e 3 Italian Journal of Accounting and Economia Aziendale International Area Year CXIV - 2014 - n. 1, 2 e 3 Could we make better prediction of stock market indicators through Twitter sentiment analysis? ALEXANDER

More information

Can Twitter provide enough information for predicting the stock market?

Can Twitter provide enough information for predicting the stock market? Can Twitter provide enough information for predicting the stock market? Maria Dolores Priego Porcuna Introduction Nowadays a huge percentage of financial companies are investing a lot of money on Social

More information

Using Social Media for Continuous Monitoring and Mining of Consumer Behaviour

Using Social Media for Continuous Monitoring and Mining of Consumer Behaviour Using Social Media for Continuous Monitoring and Mining of Consumer Behaviour Michail Salampasis 1, Giorgos Paltoglou 2, Anastasia Giahanou 1 1 Department of Informatics, Alexander Technological Educational

More information

Predicting Stock Market Fluctuations. from Twitter

Predicting Stock Market Fluctuations. from Twitter Predicting Stock Market Fluctuations from Twitter An analysis of the predictive powers of real-time social media Sang Chung & Sandy Liu Stat 157 Professor ALdous Dec 12, 2011 Chung & Liu 2 1. Introduction

More information

Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams

Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams 2012 International Conference on Computer Technology and Science (ICCTS 2012) IPCSIT vol. XX (2012) (2012) IACSIT Press, Singapore Using Text and Data Mining Techniques to extract Stock Market Sentiment

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

Twitter Sentiment Analysis of Movie Reviews using Machine Learning Techniques.

Twitter Sentiment Analysis of Movie Reviews using Machine Learning Techniques. Twitter Sentiment Analysis of Movie Reviews using Machine Learning Techniques. Akshay Amolik, Niketan Jivane, Mahavir Bhandari, Dr.M.Venkatesan School of Computer Science and Engineering, VIT University,

More information

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction Sentiment Analysis of Movie Reviews and Twitter Statuses Introduction Sentiment analysis is the task of identifying whether the opinion expressed in a text is positive or negative in general, or about

More information

Sentiment Analysis Tool using Machine Learning Algorithms

Sentiment Analysis Tool using Machine Learning Algorithms Sentiment Analysis Tool using Machine Learning Algorithms I.Hemalatha 1, Dr. G. P Saradhi Varma 2, Dr. A.Govardhan 3 1 Research Scholar JNT University Kakinada, Kakinada, A.P., INDIA 2 Professor & Head,

More information

Keywords social media, internet, data, sentiment analysis, opinion mining, business

Keywords social media, internet, data, sentiment analysis, opinion mining, business Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Real time Extraction

More information

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach. Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach. Pranali Chilekar 1, Swati Ubale 2, Pragati Sonkambale 3, Reema Panarkar 4, Gopal Upadhye 5 1 2 3 4 5

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

Predicting the Stock Market with News Articles

Predicting the Stock Market with News Articles Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is

More information

Sentiment analysis: towards a tool for analysing real-time students feedback

Sentiment analysis: towards a tool for analysing real-time students feedback Sentiment analysis: towards a tool for analysing real-time students feedback Nabeela Altrabsheh Email: nabeela.altrabsheh@port.ac.uk Mihaela Cocea Email: mihaela.cocea@port.ac.uk Sanaz Fallahkhair Email:

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

How To Predict Stock Price With Mood Based Models

How To Predict Stock Price With Mood Based Models Twitter Mood Predicts the Stock Market Xiao-Jun Zeng School of Computer Science University of Manchester x.zeng@manchester.ac.uk Outline Introduction and Motivation Approach Framework Twitter mood model

More information

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words , pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan

More information

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

More information

The Use of Twitter Activity as a Stock Market Predictor

The Use of Twitter Activity as a Stock Market Predictor National College of Ireland Higher Diploma in Science in Data Analytics 2013/2014 Robert Coyle X13109278 robert.coyle@student.ncirl.ie The Use of Twitter Activity as a Stock Market Predictor Table of Contents

More information

Social Market Analytics, Inc.

Social Market Analytics, Inc. S-Factors : Definition, Use, and Significance Social Market Analytics, Inc. Harness the Power of Social Media Intelligence January 2014 P a g e 2 Introduction Social Market Analytics, Inc., (SMA) produces

More information

Sentiment Analysis and Topic Classification: Case study over Spanish tweets

Sentiment Analysis and Topic Classification: Case study over Spanish tweets Sentiment Analysis and Topic Classification: Case study over Spanish tweets Fernando Batista, Ricardo Ribeiro Laboratório de Sistemas de Língua Falada, INESC- ID Lisboa R. Alves Redol, 9, 1000-029 Lisboa,

More information

Using Tweets to Predict the Stock Market

Using Tweets to Predict the Stock Market 1. Abstract Using Tweets to Predict the Stock Market Zhiang Hu, Jian Jiao, Jialu Zhu In this project we would like to find the relationship between tweets of one important Twitter user and the corresponding

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

Emoticon Smoothed Language Models for Twitter Sentiment Analysis

Emoticon Smoothed Language Models for Twitter Sentiment Analysis Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence Emoticon Smoothed Language Models for Twitter Sentiment Analysis Kun-Lin Liu, Wu-Jun Li, Minyi Guo Shanghai Key Laboratory of

More information

CSE 598 Project Report: Comparison of Sentiment Aggregation Techniques

CSE 598 Project Report: Comparison of Sentiment Aggregation Techniques CSE 598 Project Report: Comparison of Sentiment Aggregation Techniques Chris MacLellan cjmaclel@asu.edu May 3, 2012 Abstract Different methods for aggregating twitter sentiment data are proposed and three

More information

On the Predictability of Stock Market Behavior using StockTwits Sentiment and Posting Volume

On the Predictability of Stock Market Behavior using StockTwits Sentiment and Posting Volume On the Predictability of Stock Market Behavior using StockTwits Sentiment and Posting Volume Abstract. In this study, we explored data from StockTwits, a microblogging platform exclusively dedicated to

More information

Big Data and High Quality Sentiment Analysis for Stock Trading and Business Intelligence. Dr. Sulkhan Metreveli Leo Keller

Big Data and High Quality Sentiment Analysis for Stock Trading and Business Intelligence. Dr. Sulkhan Metreveli Leo Keller Big Data and High Quality Sentiment Analysis for Stock Trading and Business Intelligence Dr. Sulkhan Metreveli Leo Keller The greed https://www.youtube.com/watch?v=r8y6djaeolo The money https://www.youtube.com/watch?v=x_6oogojnaw

More information

Text Opinion Mining to Analyze News for Stock Market Prediction

Text Opinion Mining to Analyze News for Stock Market Prediction Int. J. Advance. Soft Comput. Appl., Vol. 6, No. 1, March 2014 ISSN 2074-8523; Copyright SCRG Publication, 2014 Text Opinion Mining to Analyze News for Stock Market Prediction Yoosin Kim 1, Seung Ryul

More information

Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation. Abstract

Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation. Abstract Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation Linhao Zhang Department of Computer Science, The University of Texas at Austin (Dated: April 16, 2013) Abstract Though

More information

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER INTRODUCTION TO SAS TEXT MINER TODAY S AGENDA INTRODUCTION TO SAS TEXT MINER Define data mining Overview of SAS Enterprise Miner Describe text analytics and define text data mining Text Mining Process

More information

Statistical Feature Selection Techniques for Arabic Text Categorization

Statistical Feature Selection Techniques for Arabic Text Categorization Statistical Feature Selection Techniques for Arabic Text Categorization Rehab M. Duwairi Department of Computer Information Systems Jordan University of Science and Technology Irbid 22110 Jordan Tel. +962-2-7201000

More information

It has often been said that stock

It has often been said that stock Twitter Mood as a Stock Market Predictor Johan Bollen and Huina Mao Indiana University Bloomington Behavioral finance researchers can apply computational methods to large-scale social media data to better

More information

Dialog System Using Real-Time Crowdsourcing and Twitter Large-Scale Corpus

Dialog System Using Real-Time Crowdsourcing and Twitter Large-Scale Corpus Dialog System Using Real-Time Crowdsourcing and Twitter Large-Scale Corpus Fumihiro Bessho, Tatsuya Harada, Yasuo Kuniyoshi The University of Tokyo Department of Mechano-Informatics 7-3-1 Hongo, Bunkyo-ku,

More information

SENTIMENT ANALYSIS: TEXT PRE-PROCESSING, READER VIEWS AND CROSS DOMAINS EMMA HADDI BRUNEL UNIVERSITY LONDON

SENTIMENT ANALYSIS: TEXT PRE-PROCESSING, READER VIEWS AND CROSS DOMAINS EMMA HADDI BRUNEL UNIVERSITY LONDON BRUNEL UNIVERSITY LONDON COLLEGE OF ENGINEERING, DESIGN AND PHYSICAL SCIENCES DEPARTMENT OF COMPUTER SCIENCE DOCTOR OF PHILOSOPHY DISSERTATION SENTIMENT ANALYSIS: TEXT PRE-PROCESSING, READER VIEWS AND

More information

IIIT-H at SemEval 2015: Twitter Sentiment Analysis The good, the bad and the neutral!

IIIT-H at SemEval 2015: Twitter Sentiment Analysis The good, the bad and the neutral! IIIT-H at SemEval 2015: Twitter Sentiment Analysis The good, the bad and the neutral! Ayushi Dalmia, Manish Gupta, Vasudeva Varma Search and Information Extraction Lab International Institute of Information

More information

Stock Prediction Using Twitter Sentiment Analysis

Stock Prediction Using Twitter Sentiment Analysis Stock Prediction Using Twitter Sentiment Analysis Anshul Mittal Stanford University anmittal@stanford.edu Arpit Goel Stanford University argoel@stanford.edu ABSTRACT In this paper, we apply sentiment analysis

More information

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA Professor Yang Xiang Network Security and Computing Laboratory (NSCLab) School of Information Technology Deakin University, Melbourne, Australia http://anss.org.au/nsclab

More information

Initial Report. Predicting association football match outcomes using social media and existing knowledge.

Initial Report. Predicting association football match outcomes using social media and existing knowledge. Initial Report Predicting association football match outcomes using social media and existing knowledge. Student Number: C1148334 Author: Kiran Smith Supervisor: Dr. Steven Schockaert Module Title: One

More information

Mining a Corpus of Job Ads

Mining a Corpus of Job Ads Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department

More information

IMPACT OF SOCIAL MEDIA ON THE STOCK MARKET: EVIDENCE FROM TWEETS

IMPACT OF SOCIAL MEDIA ON THE STOCK MARKET: EVIDENCE FROM TWEETS IMPACT OF SOCIAL MEDIA ON THE STOCK MARKET: EVIDENCE FROM TWEETS Vojtěch Fiala 1, Svatopluk Kapounek 1, Ondřej Veselý 1 1 Mendel University in Brno Volume 1 Issue 1 ISSN 2336-6494 www.ejobsat.com ABSTRACT

More information

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or

More information

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet Muhammad Atif Qureshi 1,2, Arjumand Younus 1,2, Colm O Riordan 1,

More information

Supervised Learning Evaluation (via Sentiment Analysis)!

Supervised Learning Evaluation (via Sentiment Analysis)! Supervised Learning Evaluation (via Sentiment Analysis)! Why Analyze Sentiment? Sentiment Analysis (Opinion Mining) Automatically label documents with their sentiment Toward a topic Aggregated over documents

More information

Content vs. Context for Sentiment Analysis: a Comparative Analysis over Microblogs

Content vs. Context for Sentiment Analysis: a Comparative Analysis over Microblogs Content vs. Context for Sentiment Analysis: a Comparative Analysis over Microblogs Fotis Aisopos $, George Papadakis $,, Konstantinos Tserpes $, Theodora Varvarigou $ L3S Research Center, Germany papadakis@l3s.de

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

A Description of Consumer Activity in Twitter

A Description of Consumer Activity in Twitter Justin Stewart A Description of Consumer Activity in Twitter At least for the astute economist, the introduction of techniques from computational science into economics has and is continuing to change

More information

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

More information

Bug Report, Feature Request, or Simply Praise? On Automatically Classifying App Reviews

Bug Report, Feature Request, or Simply Praise? On Automatically Classifying App Reviews Bug Report, Feature Request, or Simply Praise? On Automatically Classifying App Reviews Walid Maalej University of Hamburg Hamburg, Germany maalej@informatik.uni-hamburg.de Hadeer Nabil University of Hamburg

More information

Correlation between Stock Prices and polarity of companies performance in Tweets: a CRF-based Approach

Correlation between Stock Prices and polarity of companies performance in Tweets: a CRF-based Approach Correlation between Stock Prices and polarity of companies performance in Tweets: a CRF-based Approach Ekaterina Shabunina Università degli Studi di Milano-Bicocca Dipartimento di Informatica Sistemistica

More information

End-to-End Sentiment Analysis of Twitter Data

End-to-End Sentiment Analysis of Twitter Data End-to-End Sentiment Analysis of Twitter Data Apoor v Agarwal 1 Jasneet Singh Sabharwal 2 (1) Columbia University, NY, U.S.A. (2) Guru Gobind Singh Indraprastha University, New Delhi, India apoorv@cs.columbia.edu,

More information

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

Effect of Using Regression on Class Confidence Scores in Sentiment Analysis of Twitter Data

Effect of Using Regression on Class Confidence Scores in Sentiment Analysis of Twitter Data Effect of Using Regression on Class Confidence Scores in Sentiment Analysis of Twitter Data Itir Onal *, Ali Mert Ertugrul, Ruken Cakici * * Department of Computer Engineering, Middle East Technical University,

More information

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

More information

F. Aiolli - Sistemi Informativi 2007/2008

F. Aiolli - Sistemi Informativi 2007/2008 Text Categorization Text categorization (TC - aka text classification) is the task of buiding text classifiers, i.e. sofware systems that classify documents from a domain D into a given, fixed set C =

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

IT services for analyses of various data samples

IT services for analyses of various data samples IT services for analyses of various data samples Ján Paralič, František Babič, Martin Sarnovský, Peter Butka, Cecília Havrilová, Miroslava Muchová, Michal Puheim, Martin Mikula, Gabriel Tutoky Technical

More information

Beating the NCAA Football Point Spread

Beating the NCAA Football Point Spread Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information

Machine Learning in Stock Price Trend Forecasting

Machine Learning in Stock Price Trend Forecasting Machine Learning in Stock Price Trend Forecasting Yuqing Dai, Yuning Zhang yuqingd@stanford.edu, zyn@stanford.edu I. INTRODUCTION Predicting the stock price trend by interpreting the seemly chaotic market

More information

How To Learn From The Revolution

How To Learn From The Revolution The Revolution Learning from : Text, Feelings and Machine Learning IT Management, CBS Supply Chain Leaders Forum 3 September 2015 The Revolution Learning from : Text, Feelings and Machine Learning Outline

More information

New Developments in the Automatic Classification of Email Records. Inge Alberts, André Vellino, Craig Eby, Yves Marleau

New Developments in the Automatic Classification of Email Records. Inge Alberts, André Vellino, Craig Eby, Yves Marleau New Developments in the Automatic Classification of Email Records Inge Alberts, André Vellino, Craig Eby, Yves Marleau ARMA Canada 2014 INTRODUCTION 2014 2 OUTLINE 1. Research team 2. Research context

More information

Sentiment Analysis of Twitter Data

Sentiment Analysis of Twitter Data Sentiment Analysis of Twitter Data Apoorv Agarwal Boyi Xie Ilia Vovsha Owen Rambow Rebecca Passonneau Department of Computer Science Columbia University New York, NY 10027 USA {apoorv@cs, xie@cs, iv2121@,

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

How To Solve The Kd Cup 2010 Challenge

How To Solve The Kd Cup 2010 Challenge A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn

More information

Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

More information

Sentiment Analysis for Movie Reviews

Sentiment Analysis for Movie Reviews Sentiment Analysis for Movie Reviews Ankit Goyal, a3goyal@ucsd.edu Amey Parulekar, aparulek@ucsd.edu Introduction: Movie reviews are an important way to gauge the performance of a movie. While providing

More information

QUANTIFYING THE EFFECTS OF ONLINE BULLISHNESS ON INTERNATIONAL FINANCIAL MARKETS

QUANTIFYING THE EFFECTS OF ONLINE BULLISHNESS ON INTERNATIONAL FINANCIAL MARKETS QUANTIFYING THE EFFECTS OF ONLINE BULLISHNESS ON INTERNATIONAL FINANCIAL MARKETS Huina Mao School of Informatics and Computing Indiana University, Bloomington, USA ECB Workshop on Using Big Data for Forecasting

More information

The Enron Corpus: A New Dataset for Email Classification Research

The Enron Corpus: A New Dataset for Email Classification Research The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu

More information

Content-Based Discovery of Twitter Influencers

Content-Based Discovery of Twitter Influencers Content-Based Discovery of Twitter Influencers Chiara Francalanci, Irma Metra Department of Electronics, Information and Bioengineering Polytechnic of Milan, Italy irma.metra@mail.polimi.it chiara.francalanci@polimi.it

More information

not possible or was possible at a high cost for collecting the data.

not possible or was possible at a high cost for collecting the data. Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day

More information

CS224N Final Project: Sentiment analysis of news articles for financial signal prediction

CS224N Final Project: Sentiment analysis of news articles for financial signal prediction 1 CS224N Final Project: Sentiment analysis of news articles for financial signal prediction Jinjian (James) Zhai (jameszjj@stanford.edu) Nicholas (Nick) Cohen (nick.cohen@gmail.com) Anand Atreya (aatreya@stanford.edu)

More information

Prediction of Stock Performance Using Analytical Techniques

Prediction of Stock Performance Using Analytical Techniques 136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University

More information

Multiple Kernel Learning on the Limit Order Book

Multiple Kernel Learning on the Limit Order Book JMLR: Workshop and Conference Proceedings 11 (2010) 167 174 Workshop on Applications of Pattern Analysis Multiple Kernel Learning on the Limit Order Book Tristan Fletcher Zakria Hussain John Shawe-Taylor

More information

Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

More information

Marketing Mix Modelling and Big Data P. M Cain

Marketing Mix Modelling and Big Data P. M Cain 1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored

More information