Twitter Analytics for Insider Trading Fraud Detection

Twitter Analytics for Insider Trading Fraud Detection W-J Ketty Gann, John Day, Shujia Zhou Information Systems Northrop Grumman Corporation, Annapolis Junction, MD, USA {wan-ju.gann, john.day2, shujia.zhou}@ngc.com Abstract Twitter analytics have been developed to process Twitter data at a macro level for use in an insider trading detection system. This system establishes normal trading patterns between daily stock price change and public sentiment. Two machine learning models, Support Vector Machine (SVM) and Decision Tree, are built based on annotated historical Twitter data and the Stanford Sentiment140 Tweet corpus, respectively. This paper focuses on the discussions of polarized sentiment (positive and negative), comparison of SVM and Decision Tree models, Sentiment Key Performance Index (SKPI), Daily Sentiment Index (DSI) and mood analysis. The results illustrate that Twitter SKPI and DSI are useful indexes to predict the future stock price movement on regular stock trading. Keywords: sentiment analysis; mood analysis; Twitter analytics; machine learning; Support Vector Machine; Decision Tree 1. Introduction It has been reported that daily stock price change and public sentiment are correlated [1]; while this paper focuses on the detection of insider trade fraud based on Twitter analytics. Insider trading fraud is considered as one of the major financial fraud types. Such frauds often deploy sophisticated schemes which current approaches are not able to systematically detect in a timely manner. The proposed approach is to establish normal trading patterns among daily stock price and public sentiment at a macro level. When sentiment and mood are towards positive, the stock price rises and the majority of investors are buying; while sentiment and mood are towards negative, the stock price falls and the majority of investors are selling. When comparing insider trading data, such as US Securities and Exchange Commission (SEC) Form 4, with normal trading patterns at micro level, the system is able to detect the abnormal timing of the insider s stock trade execution and issues a warning on those trades for further investigation. The remainder of this paper focuses on discussions of Twitter analytics in sentiment/mood analysis used within the insider trading detection system mentioned above. Twitter is the world s largest micro-blogging platform. English is the most frequently used language, with 39% of total Twitter messages [2]. Two supervised machine learning trained models, Support Vector Machine (SVM) and Decision Tree, are built to classify the Twitter data. They are discussed along with the training data and feature selection. Various indexes, Sentiment Key Performance Index (SKPI) and Daily Sentiment Index (DSI), are implemented for tweet volume calculations. The results of the indexes show that daily tweets appear to form the basis for predicting the future stock trend. 2. Data Collection Supervised machine learning models require annotated training and testing sets. The following are descriptions of data collection and preparation for each model. 2.1 Historical Twitter Data We established a Twitter data set pertaining to Apple Inc. from November 2012 to February 2013. Each tweet is recorded with user ID (identifier), date/time of posting, source (where a tweet is published) and tweeter content. After cleaning the data and removing duplications from the collection, we retrieve tweets written in the English language with Apple related Twitter hashtags and/or keywords. The resulting tweets are then grouped by posting day with a total number of 167,345 tweets. 2.2 Stanford Sentiment140 Tweet Corpus This Sentiment140 data set is publically available [3]. To facilitate the labeling of such a large corpus (1.6 million), Stanford used only Twitter messages (i.e., tweets) containing emoticons to determine positive or negative sentiment. Positive emoticons, such as, or negative emoticons, such as are used to classify each tweet. Before training, the emoticons were removed in order to force the modeling software to build a sentiment model exclusively from the context of the words surrounding the emoticons. Stanford used this data to build their Sentiment140 maximum entropy classifier [4]. 3. Training Data Preparation For historical Twitter data, we focus on opinioned tweets by filtering the information bearing ones, such as tweets containing http and www strings. Strong sentiment words are selected from the filtered tweets to establish positive and negative sentiment training sets with 1,500 examples in each. The sentiment words (including their variations) are defined in Linguistic Inquiry and Word Count (LIWC) 2007 dictionary. LIWC [5] is a popular text analysis software program. For example, it has been used to analyze people s mental and physical health in correlation with the words they use in speech and written samples. The LIWC 2007 dictionary is the core of this text analysis software and is composed of 4,500 words and word stems in various cognitive and emotion hierarchical categories. For instance, positive sentiment words include love, thrill, wonderful; negative words consist of negation and blame words, e.g., hate, frustrate, damn. When preparing the training data, we primarily include tweets less than 15-20 words for training purposes since in longer tweets, the sentiment tends to be balanced out or with more than one

sentiment presented. Currently, neutral tweets are not categorized in the training data. For the Stanford Sentiment140 Tweet corpus, the process is more complicated than that of AAPL Twitter data. The following are steps used to prepare it for training: Tag each tweet using the Carnegie Mellon University (CMU) ARK part-of-speech tagger [6]. Each tokenized word within a given tweet is concatenated an ARK tag. For example, run`v or dog`n. The CMU ARK tagger has defined 25 tags, specially customized for tagging tweets. A subset of 13 tags are used (V, N, A, R,!, &, G, E, #, T, Z, X, S). Select tagged tokens that occur 50 times or more in the overall data set. A total of 15,000 tokens are selected. Segregate these high frequency tagged tokens into positive and negative groups according to the POS (positive) or NEG (negative) label on each tweet. Calculate the Total Sentiment Index (TSI) for each selected tagged token. This represents the relative sentiment of a token based on the number of times p it occurred in positive-labeled tweets and the number of times n it occurred in negative tweets. Thus, an index of -1 would occur if the token was always seen in a negative tweet, +1 if always seen in a positive tweet, and 0 if seen in equal occurrences of positive or negative. To calibrate training sets with unequal numbers of positive and negative tweets, the totalpositive over total-negative ratio ( is used to rebalance the set so that TSI=0 represents neutral. (Note: the Stanford set was already balanced, with 800,000 positive and 800,000 negative tweets, so this ratio was 1 for our experiments) Select feature tokens appearing in the training corpus occurring more than 50 times, resulting in 15,000 tokens. For each of these tokens, a TSI value was computed. Tokens with TSI value near zero were discarded, leaving 6,799 tokens with TSI values relatively greater than or less than zero. The hypothesis is that the sum of these features tends to classify the tweets they occur in as positive or negative. All other tokens in each tweet are considered to have TSI=0. The following are examples of selected tokens: welcome`a as adjective occurs 2,298 times in the whole data set, where 137 of them are negative and 2,161 are positive. Both TSI and absolute TSI of welcome`a are 0.881; cavity`n as a noun has a high negative count of 54, and low positive count of 3, resulting in a TSI of -0.895. Create a feature vector for each tweet in the data set with the following elements: ID: a unique identifier, Ground Truth (GT): Pos (Positive Sentiment), Neg (Negative Sentiment), or Neu (Neutral Sentiment) [Note: no Neu records were included in this training set but were reserved for future test purposes], TSI: sum of TSI for each token in the tweet. For words not in top 6,799 tokens as described above, 0 is the default value assigned. A Boolean array indicates the presence and absence of TSI-selected tokens in a given tweet. Thus, a feature vector contains 6,802 elements consisting of an ID, GT (ground truth tag), TSI sum and 6,799 Boolean features denoting presence or absence of each key sentiment token. 4. Building Training Models Probabilistic models using supervised machine learning algorithms are built using open source text/data mining tools. 4.1 SVM (Supported Vector Machine) Model SVM is a known classifier for learning in text categorization. Given a set of training samples, each marked with one of two categories (positive and negative in this case), an SVM training algorithm builds a model that assigns a new instance into a positive or negative category. We used the historical Twitter data set described above to build a SVM model on RapidMiner [7]. RapidMiner is an open source text/data mining tool. Version 5.3.007 is used for this project. A 10-fold cross validation is used to estimate the accuracy of the model. Based on the total of 3,000 sample tweets in our training data (1,500 each in positive and negative), the precision has reached 90.25% and recall 74.27%. 4.2 Decision Tree Model A Decision Tree model is built for a general purpose sentiment classifier by using the open source data mining tool called RuleQuest C5 [8]. The model (unboosted decision tree) was trained on the 1.6 million machine-labeled tweets from the Stanford Sentiment140 training set, which included 359 additional labeled tweets set aside for blind testing of the models. Machine-labeling was facilitated by using only tweets containing emoticons, which were removed after labeling to force the model to learn the matching sentiment from the surrounding text. Experiments are run with random samplings of 10K tweets, 100K tweets, and the full 1.6M tweet data set, respectively. The 10K random sample provides the best result with 83.3% accuracy. This exceeds Stanford s accuracy report of 79.9% (Naïve Bayes), 79.9% (Maximum Entropy) and 81.9% (SVM) when unigram and POS are used as features [4]. 5. Discussions of SVM and Decision Tree Models The SVM and Decision Tree models were trained with different training data; therefore, the results are not directly comparable. However, it seems that the SVM model tends to have higher precision when using hand labeled training data. On the other hand, the Decision Tree model was trained on a much larger corpus of 1.6 million machine-labeled examples, exposing it to many more expressions of positive and negative sentiment from real-world Twitter messages. Thus, one would expect such models to be more robust and have greater generalization skills than models trained on a smaller set of sentiment concepts. The rationale for using qualified CMU ARK tags in the 1.6M tweet Sentiment140 data is to allow the same word to support different sentiment labels, depending on the part of speech context. For example, the word wish has a positive weight when used as a noun (wish`n); while it is negative

when used as a verb (wish`v). The intent of using this qualification is to improve the accuracy of sentiment predictions. To test the effectiveness of ARK-tagged features vs. untagged, we built two models on a randomly selected subset of 11,000 tweets from the Stanford training corpus. One model used bare words as features and the other used the same words tagged with the CMU ARK part-of-speech tags. The tagged model achieved 71.6% accuracy compared to 67.4% for the bare-word model, thus supporting the superiority of using features tagged by part-of-speech. It should be noted that Stanford also tried using part-ofspeech tags [4] and reported little success with this approach. However, they did not use the CMU ARK tagger, which has been optimized for recognizing unconventional tokens such as hash tags, emoticons and abbreviations as tokens in a specialized language for writing tweets. It is also worth noting that our approach for building the Decision Tree model used only word tokens that occurred more than 50 times as modeling features. This was to mitigate the effect of misspellings and nonsense words which tend to occur frequently in tweets. However, there were over 800,000 distinct tokens in the raw training corpus (including punctuation, but no emoticons). This is surprising because there are only about 50,000 words in standard English, suggesting that the language used to express Twitter messages is an order of magnitude larger than English. But these additional tokens come from proper names, misspelled and abbreviated words and many other specialized tokens (e.g., emoticons) that have become part of the unique tweet vocabulary, which are not normally considered standard parts of natural languages. The effect of filtering reduced the token set size to approximately 15,000. These tokens were then further processed (described in section 3. above) to select the key sentiment tokens which tend to classify tweets as positive or negative. These are precisely the tokens with relatively high positive or negative TSI values. Neutral words (and all other unknown or unrecognized words) are effectively ignored by setting their TSI values to zero. It was observed that the TSI sum feature (i.e. sum of individual TSI values for each word in a tweet) is a very strong feature by itself, capable of classifying the evaluation set with 75% accuracy. The additional 6,799 key token features are much weaker in classification strength, but collectively can raise the accuracy by additional 8 percent points to 83%. 6. Granger Causality and Durbin-Watson Test The tweets for days that the stock market is open from November 13, 2012 to February 5, 2013 (total number of 167,345) are processed by the trained SVM model. Instead of using a daily count of positive and negative tweets as the metric, a Sentiment Key Performance Index (SKPI) and stock market value time series are used as an indicator of sentiment. Granger Causality Analysis (GCA) is applied to the daily time series produced by the daily SKPI and AAPL stock market value time series. GCA is a standard test in finance and economics to discover causal links between independently generated time series. GCA is based on the assumption that if a variable X causes Y, then changes in X will systematically occur before changes in Y and the lagged values of X will illustrate a statistically significant correlation with Y. SKPI has shown Granger causality relation with AAPL stock price movement for lags ranging from 1 to 5 days (p<0.1), where a 3-day lag has the smallest p value (p<0.05). In other words, SKPI is shown to predict AAPL daily stock price movement with a 3-day lag (i.e., 3 days prior) as shown in Table 1. Table 1. GCA Results for AAPL price/skpi Lag (Day) Y=f(X)/X=f(Y) 1 0.338/0.045* 2 0.407/0.027* 3 0.056/0.018** 4 0.138/0.070* 5 0.32/0.034* 6 0.48/0.104 7 0.408/0.152 Legend: **p < 0.05, *p < 0.1 Format Y=f(X)/X=f(Y); X Granger causes Y/Y Granger causes X; X = AAPL daily stock value changes, Y = SKPI However, correlation in GCA does not prove causation. GCA results are then validated with a Durbin-Watson (DW) test to filter out any spurious results. The DW test result implies a valid test with a high DW value (DW= 2.77). 7. Daily Sentiment Index (DSI) and Stock Trend DSI is created to compute the daily positive and negative sentiment counts returned by the model. DSI ranges are between a -1 and +1 scale to normalize daily fluctuations in tweet volume where tp = total sum of daily positive counts; tn = total sum of daily negative counts; n = daily negative tweet count and p = daily positive tweet count. DSI behaves like a time-derivative and spikes up or down during sentiment change. Three regression models are produced for comparison purposes. Model #3 shows the best results for predicting future stock trends given past stock trend and multiple DSI values (Figure 1). Model #1 => Predicted future AAPL Trend from Past Price Model #2 => Predicted future AAPL Trend from Past Price + AAPL DSI Model #3 => Predicated future AAPL Trend from Past Price + Multiple Source-DSI (Note: computed from 23 Tweet sources) To maximize the persistence and effectiveness of using tweet source as a model feature, we used the 21 most frequently occurring sources. The remaining 2,386 sources were aggregated into a 22nd virtual source called OTHER. The 23rd source was the aggregate of all sources, called ALL. See appendix for the source list.

Figure 1. Plots for three models: left panel = Model#1; middle panel = Model#2 and right panel = Model#3 Table 2. CGA Results for AAPL price/negative Mood Analysis Lag (Day) Positive Emotion Swear Anxiety Anger Sadness Tentative Certain 1 0.27/0.93 0.82/0.97 0.82/0.52 0.86/0.73 0.67/0.40 0.23/0.40 0.03**/0.9 2 0.41/0.97 0.85/0.75 0.89/0.68 0.77/0.74 0.43/0.76 0.27/0.61 0.06*/0.95 3 0.42/0.68 0.72/0.63 0.96/0.74 0.73/0.62 0.63/0.92 0.31/0.32 0.05*/0.66 4 0.22/0.70 0.10/0.78 0.36/0.57 0.16/0.78 0.54/0.94 0.18/0.39 0.04**/0.62 5 0.25/0.79 0.14/0.002** 0.40/0.70 0.21/0.003** 0.59/0.96 0.22/0.45 0.08*/0.72 6 0.32/0.87 0.22/0.005** 0.53/0.72 0.32/0.008** 0.72/0.92 0.29/0.41 0.12/0.50 7 0.16/0.33 0.32/0.01** 0.47/0.39 0.46/0.009** 0.69/0.79 0.09**/0.03** 0.03**/0.22 Legend: **p < 0.05, *p < 0.1 Format Y=f(X)/X=f(Y); X Granger causes Y/Y Granger causes X; X = AAPL daily stock value changes, Y = word ratio Table 3. DW Validations of AAPL price/negative Mood Analysis CGA Results Durbin Watson Test Positive Emotion Swear Anxiety Anger Sadness Tentative Certain 2.035982 1.982703 2.025471 1.989783 2.105724 2.105724 2.150572 Legend: 2 = no correlation; 0 = positive correlation; 4 = negative correlation 8. Mood Analysis and Stock Trend We also investigated whether Mood might better form a basis for modeling stock trends by analyzing the negative sentiment tweets using LIWC. The negative sentiment is subcategorized into Swear, Anxiety, Anger, and Sadness based on the LIWC 2007 dictionary. In addition, two cognitive categories Tentative and Certain are also processed. The output is the word ratio of each category presented in daily tweets. GCA is performed based on the word ratio and AAPL daily changes in stock value. The results show that only Swear and Anger correlate with the AAPL stock price change in 5- day lag (Table 2). On the other hand, daily AAPL stock value correlates with Tentative in a 7-day lag, and Certain in 1, 4, and 7 day lags. However, DW test validation indicates that Swear and Anger have weak positive correlation with AAPL stock changes, 1.982 and 1.989, respectively (Table 3), which is different from the result of Bollen et al [1]. They concluded that mood is a better predictor than sentiment for the stock market. We believe that the difference in data cleansing may be the potential cause. 9. Conclusion Two supervised models, SVM and Decision Tree, have been built for Twitter analytics as part of an insider trading

detection system. SVM has achieved high precision/recall when using historical AAPL tweet data, while Decision Tree with positive features reached 83.3% accuracy when processing a 10K random sample data set from the Stanford Sentiment140 Tweet corpus. Two major indexes, SKPI and DSI are discussed. They appear to be capable of predicting the stock price movement with data used in this project. DSI, when combined with Twitter sources, shows the best results in prediction. Acknowledgment This project is funded by Northrop Grumman Corporation 2013 Internal Research and Development program. Comments or opinions expressed in this paper do not necessarily represent the position of the company. We would like to thank Jim Sowder for helpful discussions. References [1] Johan Bollen, Huina Mao, Xiao-Jun Zeng. Twitter mood predicts the stock market. IEEE Computer, 44(10): 91-94, October 2011. [2] Arabic highest growth on Twitter, Semiocast, November 24, 2011. URL: http://semiocast.com/publications/2011_11_24_arabic_highest_growth_ on_twitter [3] Senitment140 Tweet Corpus. URL: http://help.sentiment140.com/forstudents [4] Alec Go, Richa Bhayani, Lei Huang. Twitter sentiment classification using distance supervision. Technical Report, Stanford University, 2009. [5] James W. Pennebaker, Cindy K. Chung, Molly Ireland, Amy Gonzales, Roger J. Booth. The development and psychometric properties of LIWC2007. LIWC2007 Manual. [6] Kevin Gimpel, Nathan Schneider, Brendan O Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, et al Part-of-speech tagging for Twitter: annotation, features, and experiments In Proceedings of the Annual Meeting of the Assoication for Computational Linguistics, companion volume, Portland, OR, June 2011. [7] RapidMiner. v 5.3.007 Rapid-I GmbH, Stockumer Str. 475, 44227 Dortmund, Germany URL: http://rapid-i.com/content/view/181/190/ [8] RuleQuest C5. URL: http://www.rulequest.com/see5-info.html.

Appendix List of Tweet Sources From Historical Tweet Data The metadata for November 2012 to February 2013 that we analyzed included the source for each tweet, an identifier representing the particular stream feed that each tweet was sampled from. There were 2,400 unique sources for the entire corpus, which appeared to be distributed according to a power law. The most frequent source web accounted for about 20% of the corpus. Whereas there were over 1,000 sources which only occurred once in the corpus. Rank Count Source 1 108002 web 2 74432 Twitter for iphone 3 62361 twitterfeed 4 22975 dlvr.it 5 22355 Twitter for Android 6 20897 Instagram 7 18805 TweetDeck 8 18416 IFTTT 9 15313 HootSuite 10 15221 Tweet Button 11 11388 Round Team 12 10820 ios 13 10635 Twitter for BlackberryAr 14 10520 Mobile Web 15 9336 Twitter for ipad 16 7771 Google 17 7736 Tweetbot for ios 18 6561 Sylvester Trends 19 6074 StockTwits Web 20 5373 Echophone 21 5111 Twitter for Mac 22 89898 Other (2,386 other sources) 23 6000 All (all sources)