Twitter Analytics for Insider Trading Fraud Detection



Similar documents
Sentiment analysis on tweets in a financial domain

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement

Forecasting stock markets with Twitter

CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

Using Twitter as a source of information for stock market prediction

Semantic Sentiment Analysis of Twitter

Tweets Miner for Stock Market Analysis

Social Media Mining. Data Mining Essentials

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Microblog Sentiment Analysis with Emoticon Space Model

Analysis of Tweets for Prediction of Indian Stock Markets

Robust Sentiment Detection on Twitter from Biased and Noisy Data

The Viability of StockTwits and Google Trends to Predict the Stock Market. By Chris Loughlin and Erik Harnisch

A CRF-based approach to find stock price correlation with company-related Twitter sentiment

Kea: Expression-level Sentiment Analysis from Twitter Data

Multilanguage sentiment-analysis of Twitter data on the example of Swiss politicians

Neural Networks for Sentiment Detection in Financial Text

Sentiment analysis using emoticons

Twitter sentiment vs. Stock price!

Twitter Stock Bot. John Matthew Fong The University of Texas at Austin

Projektgruppe. Categorization of text documents via classification

Italian Journal of Accounting and Economia Aziendale. International Area. Year CXIV n. 1, 2 e 3

Can Twitter provide enough information for predicting the stock market?

Using Social Media for Continuous Monitoring and Mining of Consumer Behaviour

Predicting Stock Market Fluctuations. from Twitter

Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams

DATA MINING TECHNIQUES AND APPLICATIONS

Twitter Sentiment Analysis of Movie Reviews using Machine Learning Techniques.

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction

Sentiment Analysis Tool using Machine Learning Algorithms

Keywords social media, internet, data, sentiment analysis, opinion mining, business

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.

Active Learning SVM for Blogs recommendation

Predicting the Stock Market with News Articles

Sentiment analysis: towards a tool for analysing real-time students feedback

Web Document Clustering

How To Predict Stock Price With Mood Based Models

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

The Use of Twitter Activity as a Stock Market Predictor

Social Market Analytics, Inc.

Sentiment Analysis and Topic Classification: Case study over Spanish tweets

Using Tweets to Predict the Stock Market

Data Mining Algorithms Part 1. Dejan Sarka

Emoticon Smoothed Language Models for Twitter Sentiment Analysis

CSE 598 Project Report: Comparison of Sentiment Aggregation Techniques

On the Predictability of Stock Market Behavior using StockTwits Sentiment and Posting Volume

Big Data and High Quality Sentiment Analysis for Stock Trading and Business Intelligence. Dr. Sulkhan Metreveli Leo Keller

Text Opinion Mining to Analyze News for Stock Market Prediction

Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation. Abstract

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

Statistical Feature Selection Techniques for Arabic Text Categorization

SENTIMENT ANALYSIS: TEXT PRE-PROCESSING, READER VIEWS AND CROSS DOMAINS EMMA HADDI BRUNEL UNIVERSITY LONDON

IIIT-H at SemEval 2015: Twitter Sentiment Analysis The good, the bad and the neutral!

Stock Prediction Using Twitter Sentiment Analysis

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

Initial Report. Predicting association football match outcomes using social media and existing knowledge.

Mining a Corpus of Job Ads

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

Content vs. Context for Sentiment Analysis: a Comparative Analysis over Microblogs

Machine Learning using MapReduce

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Bug Report, Feature Request, or Simply Praise? On Automatically Classifying App Reviews

End-to-End Sentiment Analysis of Twitter Data

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Effect of Using Regression on Class Confidence Scores in Sentiment Analysis of Twitter Data

Comparison of Data Mining Techniques used for Financial Data Analysis

Chapter 6. The stacking ensemble approach

IT services for analyses of various data samples

Beating the NCAA Football Point Spread

Leveraging Ensemble Models in SAS Enterprise Miner

Machine Learning in Stock Price Trend Forecasting

New Developments in the Automatic Classification of Records. Inge Alberts, André Vellino, Craig Eby, Yves Marleau

Sentiment Analysis of Twitter Data

Data Mining - Evaluation of Classifiers

How To Solve The Kd Cup 2010 Challenge

Advanced Ensemble Strategies for Polynomial Models

Sentiment Analysis for Movie Reviews

QUANTIFYING THE EFFECTS OF ONLINE BULLISHNESS ON INTERNATIONAL FINANCIAL MARKETS

The Enron Corpus: A New Dataset for Classification Research

Content-Based Discovery of Twitter Influencers

not possible or was possible at a high cost for collecting the data.

CS224N Final Project: Sentiment analysis of news articles for financial signal prediction

Prediction of Stock Performance Using Analytical Techniques

Multiple Kernel Learning on the Limit Order Book

Data Mining Yelp Data - Predicting rating stars from review text

Marketing Mix Modelling and Big Data P. M Cain

Transcription:

Twitter Analytics for Insider Trading Fraud Detection W-J Ketty Gann, John Day, Shujia Zhou Information Systems Northrop Grumman Corporation, Annapolis Junction, MD, USA {wan-ju.gann, john.day2, shujia.zhou}@ngc.com Abstract Twitter analytics have been developed to process Twitter data at a macro level for use in an insider trading detection system. This system establishes normal trading patterns between daily stock price change and public sentiment. Two machine learning models, Support Vector Machine (SVM) and Decision Tree, are built based on annotated historical Twitter data and the Stanford Sentiment140 Tweet corpus, respectively. This paper focuses on the discussions of polarized sentiment (positive and negative), comparison of SVM and Decision Tree models, Sentiment Key Performance Index (SKPI), Daily Sentiment Index (DSI) and mood analysis. The results illustrate that Twitter SKPI and DSI are useful indexes to predict the future stock price movement on regular stock trading. Keywords: sentiment analysis; mood analysis; Twitter analytics; machine learning; Support Vector Machine; Decision Tree 1. Introduction It has been reported that daily stock price change and public sentiment are correlated [1]; while this paper focuses on the detection of insider trade fraud based on Twitter analytics. Insider trading fraud is considered as one of the major financial fraud types. Such frauds often deploy sophisticated schemes which current approaches are not able to systematically detect in a timely manner. The proposed approach is to establish normal trading patterns among daily stock price and public sentiment at a macro level. When sentiment and mood are towards positive, the stock price rises and the majority of investors are buying; while sentiment and mood are towards negative, the stock price falls and the majority of investors are selling. When comparing insider trading data, such as US Securities and Exchange Commission (SEC) Form 4, with normal trading patterns at micro level, the system is able to detect the abnormal timing of the insider s stock trade execution and issues a warning on those trades for further investigation. The remainder of this paper focuses on discussions of Twitter analytics in sentiment/mood analysis used within the insider trading detection system mentioned above. Twitter is the world s largest micro-blogging platform. English is the most frequently used language, with 39% of total Twitter messages [2]. Two supervised machine learning trained models, Support Vector Machine (SVM) and Decision Tree, are built to classify the Twitter data. They are discussed along with the training data and feature selection. Various indexes, Sentiment Key Performance Index (SKPI) and Daily Sentiment Index (DSI), are implemented for tweet volume calculations. The results of the indexes show that daily tweets appear to form the basis for predicting the future stock trend. 2. Data Collection Supervised machine learning models require annotated training and testing sets. The following are descriptions of data collection and preparation for each model. 2.1 Historical Twitter Data We established a Twitter data set pertaining to Apple Inc. from November 2012 to February 2013. Each tweet is recorded with user ID (identifier), date/time of posting, source (where a tweet is published) and tweeter content. After cleaning the data and removing duplications from the collection, we retrieve tweets written in the English language with Apple related Twitter hashtags and/or keywords. The resulting tweets are then grouped by posting day with a total number of 167,345 tweets. 2.2 Stanford Sentiment140 Tweet Corpus This Sentiment140 data set is publically available [3]. To facilitate the labeling of such a large corpus (1.6 million), Stanford used only Twitter messages (i.e., tweets) containing emoticons to determine positive or negative sentiment. Positive emoticons, such as, or negative emoticons, such as are used to classify each tweet. Before training, the emoticons were removed in order to force the modeling software to build a sentiment model exclusively from the context of the words surrounding the emoticons. Stanford used this data to build their Sentiment140 maximum entropy classifier [4]. 3. Training Data Preparation For historical Twitter data, we focus on opinioned tweets by filtering the information bearing ones, such as tweets containing http and www strings. Strong sentiment words are selected from the filtered tweets to establish positive and negative sentiment training sets with 1,500 examples in each. The sentiment words (including their variations) are defined in Linguistic Inquiry and Word Count (LIWC) 2007 dictionary. LIWC [5] is a popular text analysis software program. For example, it has been used to analyze people s mental and physical health in correlation with the words they use in speech and written samples. The LIWC 2007 dictionary is the core of this text analysis software and is composed of 4,500 words and word stems in various cognitive and emotion hierarchical categories. For instance, positive sentiment words include love, thrill, wonderful; negative words consist of negation and blame words, e.g., hate, frustrate, damn. When preparing the training data, we primarily include tweets less than 15-20 words for training purposes since in longer tweets, the sentiment tends to be balanced out or with more than one

sentiment presented. Currently, neutral tweets are not categorized in the training data. For the Stanford Sentiment140 Tweet corpus, the process is more complicated than that of AAPL Twitter data. The following are steps used to prepare it for training: Tag each tweet using the Carnegie Mellon University (CMU) ARK part-of-speech tagger [6]. Each tokenized word within a given tweet is concatenated an ARK tag. For example, run`v or dog`n. The CMU ARK tagger has defined 25 tags, specially customized for tagging tweets. A subset of 13 tags are used (V, N, A, R,!, &, G, E, #, T, Z, X, S). Select tagged tokens that occur 50 times or more in the overall data set. A total of 15,000 tokens are selected. Segregate these high frequency tagged tokens into positive and negative groups according to the POS (positive) or NEG (negative) label on each tweet. Calculate the Total Sentiment Index (TSI) for each selected tagged token. This represents the relative sentiment of a token based on the number of times p it occurred in positive-labeled tweets and the number of times n it occurred in negative tweets. Thus, an index of -1 would occur if the token was always seen in a negative tweet, +1 if always seen in a positive tweet, and 0 if seen in equal occurrences of positive or negative. To calibrate training sets with unequal numbers of positive and negative tweets, the totalpositive over total-negative ratio ( is used to rebalance the set so that TSI=0 represents neutral. (Note: the Stanford set was already balanced, with 800,000 positive and 800,000 negative tweets, so this ratio was 1 for our experiments) Select feature tokens appearing in the training corpus occurring more than 50 times, resulting in 15,000 tokens. For each of these tokens, a TSI value was computed. Tokens with TSI value near zero were discarded, leaving 6,799 tokens with TSI values relatively greater than or less than zero. The hypothesis is that the sum of these features tends to classify the tweets they occur in as positive or negative. All other tokens in each tweet are considered to have TSI=0. The following are examples of selected tokens: welcome`a as adjective occurs 2,298 times in the whole data set, where 137 of them are negative and 2,161 are positive. Both TSI and absolute TSI of welcome`a are 0.881; cavity`n as a noun has a high negative count of 54, and low positive count of 3, resulting in a TSI of -0.895. Create a feature vector for each tweet in the data set with the following elements: ID: a unique identifier, Ground Truth (GT): Pos (Positive Sentiment), Neg (Negative Sentiment), or Neu (Neutral Sentiment) [Note: no Neu records were included in this training set but were reserved for future test purposes], TSI: sum of TSI for each token in the tweet. For words not in top 6,799 tokens as described above, 0 is the default value assigned. A Boolean array indicates the presence and absence of TSI-selected tokens in a given tweet. Thus, a feature vector contains 6,802 elements consisting of an ID, GT (ground truth tag), TSI sum and 6,799 Boolean features denoting presence or absence of each key sentiment token. 4. Building Training Models Probabilistic models using supervised machine learning algorithms are built using open source text/data mining tools. 4.1 SVM (Supported Vector Machine) Model SVM is a known classifier for learning in text categorization. Given a set of training samples, each marked with one of two categories (positive and negative in this case), an SVM training algorithm builds a model that assigns a new instance into a positive or negative category. We used the historical Twitter data set described above to build a SVM model on RapidMiner [7]. RapidMiner is an open source text/data mining tool. Version 5.3.007 is used for this project. A 10-fold cross validation is used to estimate the accuracy of the model. Based on the total of 3,000 sample tweets in our training data (1,500 each in positive and negative), the precision has reached 90.25% and recall 74.27%. 4.2 Decision Tree Model A Decision Tree model is built for a general purpose sentiment classifier by using the open source data mining tool called RuleQuest C5 [8]. The model (unboosted decision tree) was trained on the 1.6 million machine-labeled tweets from the Stanford Sentiment140 training set, which included 359 additional labeled tweets set aside for blind testing of the models. Machine-labeling was facilitated by using only tweets containing emoticons, which were removed after labeling to force the model to learn the matching sentiment from the surrounding text. Experiments are run with random samplings of 10K tweets, 100K tweets, and the full 1.6M tweet data set, respectively. The 10K random sample provides the best result with 83.3% accuracy. This exceeds Stanford s accuracy report of 79.9% (Naïve Bayes), 79.9% (Maximum Entropy) and 81.9% (SVM) when unigram and POS are used as features [4]. 5. Discussions of SVM and Decision Tree Models The SVM and Decision Tree models were trained with different training data; therefore, the results are not directly comparable. However, it seems that the SVM model tends to have higher precision when using hand labeled training data. On the other hand, the Decision Tree model was trained on a much larger corpus of 1.6 million machine-labeled examples, exposing it to many more expressions of positive and negative sentiment from real-world Twitter messages. Thus, one would expect such models to be more robust and have greater generalization skills than models trained on a smaller set of sentiment concepts. The rationale for using qualified CMU ARK tags in the 1.6M tweet Sentiment140 data is to allow the same word to support different sentiment labels, depending on the part of speech context. For example, the word wish has a positive weight when used as a noun (wish`n); while it is negative

when used as a verb (wish`v). The intent of using this qualification is to improve the accuracy of sentiment predictions. To test the effectiveness of ARK-tagged features vs. untagged, we built two models on a randomly selected subset of 11,000 tweets from the Stanford training corpus. One model used bare words as features and the other used the same words tagged with the CMU ARK part-of-speech tags. The tagged model achieved 71.6% accuracy compared to 67.4% for the bare-word model, thus supporting the superiority of using features tagged by part-of-speech. It should be noted that Stanford also tried using part-ofspeech tags [4] and reported little success with this approach. However, they did not use the CMU ARK tagger, which has been optimized for recognizing unconventional tokens such as hash tags, emoticons and abbreviations as tokens in a specialized language for writing tweets. It is also worth noting that our approach for building the Decision Tree model used only word tokens that occurred more than 50 times as modeling features. This was to mitigate the effect of misspellings and nonsense words which tend to occur frequently in tweets. However, there were over 800,000 distinct tokens in the raw training corpus (including punctuation, but no emoticons). This is surprising because there are only about 50,000 words in standard English, suggesting that the language used to express Twitter messages is an order of magnitude larger than English. But these additional tokens come from proper names, misspelled and abbreviated words and many other specialized tokens (e.g., emoticons) that have become part of the unique tweet vocabulary, which are not normally considered standard parts of natural languages. The effect of filtering reduced the token set size to approximately 15,000. These tokens were then further processed (described in section 3. above) to select the key sentiment tokens which tend to classify tweets as positive or negative. These are precisely the tokens with relatively high positive or negative TSI values. Neutral words (and all other unknown or unrecognized words) are effectively ignored by setting their TSI values to zero. It was observed that the TSI sum feature (i.e. sum of individual TSI values for each word in a tweet) is a very strong feature by itself, capable of classifying the evaluation set with 75% accuracy. The additional 6,799 key token features are much weaker in classification strength, but collectively can raise the accuracy by additional 8 percent points to 83%. 6. Granger Causality and Durbin-Watson Test The tweets for days that the stock market is open from November 13, 2012 to February 5, 2013 (total number of 167,345) are processed by the trained SVM model. Instead of using a daily count of positive and negative tweets as the metric, a Sentiment Key Performance Index (SKPI) and stock market value time series are used as an indicator of sentiment. Granger Causality Analysis (GCA) is applied to the daily time series produced by the daily SKPI and AAPL stock market value time series. GCA is a standard test in finance and economics to discover causal links between independently generated time series. GCA is based on the assumption that if a variable X causes Y, then changes in X will systematically occur before changes in Y and the lagged values of X will illustrate a statistically significant correlation with Y. SKPI has shown Granger causality relation with AAPL stock price movement for lags ranging from 1 to 5 days (p<0.1), where a 3-day lag has the smallest p value (p<0.05). In other words, SKPI is shown to predict AAPL daily stock price movement with a 3-day lag (i.e., 3 days prior) as shown in Table 1. Table 1. GCA Results for AAPL price/skpi Lag (Day) Y=f(X)/X=f(Y) 1 0.338/0.045* 2 0.407/0.027* 3 0.056/0.018** 4 0.138/0.070* 5 0.32/0.034* 6 0.48/0.104 7 0.408/0.152 Legend: **p < 0.05, *p < 0.1 Format Y=f(X)/X=f(Y); X Granger causes Y/Y Granger causes X; X = AAPL daily stock value changes, Y = SKPI However, correlation in GCA does not prove causation. GCA results are then validated with a Durbin-Watson (DW) test to filter out any spurious results. The DW test result implies a valid test with a high DW value (DW= 2.77). 7. Daily Sentiment Index (DSI) and Stock Trend DSI is created to compute the daily positive and negative sentiment counts returned by the model. DSI ranges are between a -1 and +1 scale to normalize daily fluctuations in tweet volume where tp = total sum of daily positive counts; tn = total sum of daily negative counts; n = daily negative tweet count and p = daily positive tweet count. DSI behaves like a time-derivative and spikes up or down during sentiment change. Three regression models are produced for comparison purposes. Model #3 shows the best results for predicting future stock trends given past stock trend and multiple DSI values (Figure 1). Model #1 => Predicted future AAPL Trend from Past Price Model #2 => Predicted future AAPL Trend from Past Price + AAPL DSI Model #3 => Predicated future AAPL Trend from Past Price + Multiple Source-DSI (Note: computed from 23 Tweet sources) To maximize the persistence and effectiveness of using tweet source as a model feature, we used the 21 most frequently occurring sources. The remaining 2,386 sources were aggregated into a 22nd virtual source called OTHER. The 23rd source was the aggregate of all sources, called ALL. See appendix for the source list.

Figure 1. Plots for three models: left panel = Model#1; middle panel = Model#2 and right panel = Model#3 Table 2. CGA Results for AAPL price/negative Mood Analysis Lag (Day) Positive Emotion Swear Anxiety Anger Sadness Tentative Certain 1 0.27/0.93 0.82/0.97 0.82/0.52 0.86/0.73 0.67/0.40 0.23/0.40 0.03**/0.9 2 0.41/0.97 0.85/0.75 0.89/0.68 0.77/0.74 0.43/0.76 0.27/0.61 0.06*/0.95 3 0.42/0.68 0.72/0.63 0.96/0.74 0.73/0.62 0.63/0.92 0.31/0.32 0.05*/0.66 4 0.22/0.70 0.10/0.78 0.36/0.57 0.16/0.78 0.54/0.94 0.18/0.39 0.04**/0.62 5 0.25/0.79 0.14/0.002** 0.40/0.70 0.21/0.003** 0.59/0.96 0.22/0.45 0.08*/0.72 6 0.32/0.87 0.22/0.005** 0.53/0.72 0.32/0.008** 0.72/0.92 0.29/0.41 0.12/0.50 7 0.16/0.33 0.32/0.01** 0.47/0.39 0.46/0.009** 0.69/0.79 0.09**/0.03** 0.03**/0.22 Legend: **p < 0.05, *p < 0.1 Format Y=f(X)/X=f(Y); X Granger causes Y/Y Granger causes X; X = AAPL daily stock value changes, Y = word ratio Table 3. DW Validations of AAPL price/negative Mood Analysis CGA Results Durbin Watson Test Positive Emotion Swear Anxiety Anger Sadness Tentative Certain 2.035982 1.982703 2.025471 1.989783 2.105724 2.105724 2.150572 Legend: 2 = no correlation; 0 = positive correlation; 4 = negative correlation 8. Mood Analysis and Stock Trend We also investigated whether Mood might better form a basis for modeling stock trends by analyzing the negative sentiment tweets using LIWC. The negative sentiment is subcategorized into Swear, Anxiety, Anger, and Sadness based on the LIWC 2007 dictionary. In addition, two cognitive categories Tentative and Certain are also processed. The output is the word ratio of each category presented in daily tweets. GCA is performed based on the word ratio and AAPL daily changes in stock value. The results show that only Swear and Anger correlate with the AAPL stock price change in 5- day lag (Table 2). On the other hand, daily AAPL stock value correlates with Tentative in a 7-day lag, and Certain in 1, 4, and 7 day lags. However, DW test validation indicates that Swear and Anger have weak positive correlation with AAPL stock changes, 1.982 and 1.989, respectively (Table 3), which is different from the result of Bollen et al [1]. They concluded that mood is a better predictor than sentiment for the stock market. We believe that the difference in data cleansing may be the potential cause. 9. Conclusion Two supervised models, SVM and Decision Tree, have been built for Twitter analytics as part of an insider trading

detection system. SVM has achieved high precision/recall when using historical AAPL tweet data, while Decision Tree with positive features reached 83.3% accuracy when processing a 10K random sample data set from the Stanford Sentiment140 Tweet corpus. Two major indexes, SKPI and DSI are discussed. They appear to be capable of predicting the stock price movement with data used in this project. DSI, when combined with Twitter sources, shows the best results in prediction. Acknowledgment This project is funded by Northrop Grumman Corporation 2013 Internal Research and Development program. Comments or opinions expressed in this paper do not necessarily represent the position of the company. We would like to thank Jim Sowder for helpful discussions. References [1] Johan Bollen, Huina Mao, Xiao-Jun Zeng. Twitter mood predicts the stock market. IEEE Computer, 44(10): 91-94, October 2011. [2] Arabic highest growth on Twitter, Semiocast, November 24, 2011. URL: http://semiocast.com/publications/2011_11_24_arabic_highest_growth_ on_twitter [3] Senitment140 Tweet Corpus. URL: http://help.sentiment140.com/forstudents [4] Alec Go, Richa Bhayani, Lei Huang. Twitter sentiment classification using distance supervision. Technical Report, Stanford University, 2009. [5] James W. Pennebaker, Cindy K. Chung, Molly Ireland, Amy Gonzales, Roger J. Booth. The development and psychometric properties of LIWC2007. LIWC2007 Manual. [6] Kevin Gimpel, Nathan Schneider, Brendan O Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, et al Part-of-speech tagging for Twitter: annotation, features, and experiments In Proceedings of the Annual Meeting of the Assoication for Computational Linguistics, companion volume, Portland, OR, June 2011. [7] RapidMiner. v 5.3.007 Rapid-I GmbH, Stockumer Str. 475, 44227 Dortmund, Germany URL: http://rapid-i.com/content/view/181/190/ [8] RuleQuest C5. URL: http://www.rulequest.com/see5-info.html.

Appendix List of Tweet Sources From Historical Tweet Data The metadata for November 2012 to February 2013 that we analyzed included the source for each tweet, an identifier representing the particular stream feed that each tweet was sampled from. There were 2,400 unique sources for the entire corpus, which appeared to be distributed according to a power law. The most frequent source web accounted for about 20% of the corpus. Whereas there were over 1,000 sources which only occurred once in the corpus. Rank Count Source 1 108002 web 2 74432 Twitter for iphone 3 62361 twitterfeed 4 22975 dlvr.it 5 22355 Twitter for Android 6 20897 Instagram 7 18805 TweetDeck 8 18416 IFTTT 9 15313 HootSuite 10 15221 Tweet Button 11 11388 Round Team 12 10820 ios 13 10635 Twitter for BlackberryAr 14 10520 Mobile Web 15 9336 Twitter for ipad 16 7771 Google 17 7736 Tweetbot for ios 18 6561 Sylvester Trends 19 6074 StockTwits Web 20 5373 Echophone 21 5111 Twitter for Mac 22 89898 Other (2,386 other sources) 23 6000 All (all sources)