Forecasting stock markets with Twitter



Similar documents
Using Twitter as a source of information for stock market prediction

Sentiment analysis on tweets in a financial domain

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

Analysis of Tweets for Prediction of Indian Stock Markets

Can Twitter provide enough information for predicting the stock market?

CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis

Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation. Abstract

An Introduction to Data Mining

Using News Articles to Predict Stock Price Movements

Stock Prediction Using Twitter Sentiment Analysis

Sentiment Analysis of Twitter Data within Big Data Distributed Environment for Stock Prediction

Semantic Sentiment Analysis of Twitter

On the Predictability of Stock Market Behavior using StockTwits Sentiment and Posting Volume

Supervised Learning (Big Data Analytics)

DATA MINING TECHNIQUES AND APPLICATIONS

Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction

Introduction. A. Bellaachia Page: 1

How To Predict Web Site Visits

Using Tweets to Predict the Stock Market

Applying Machine Learning to Stock Market Trading Bryce Taylor

Neural Networks for Sentiment Detection in Financial Text

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

How To Use Neural Networks In Data Mining

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Emoticon Smoothed Language Models for Twitter Sentiment Analysis

Predict Influencers in the Social Network

CSE 598 Project Report: Comparison of Sentiment Aggregation Techniques

The Use of Twitter Activity as a Stock Market Predictor

NAIVE BAYES ALGORITHM FOR TWITTER SENTIMENT ANALYSIS AND ITS IMPLEMENTATION IN MAPREDUCE. A Thesis. Presented to. The Faculty of the Graduate School

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

A CRF-based approach to find stock price correlation with company-related Twitter sentiment

Hong Kong Stock Index Forecasting

Spam Detection on Twitter Using Traditional Classifiers M. McCord CSE Dept Lehigh University 19 Memorial Drive West Bethlehem, PA 18015, USA

Data Mining. Nonlinear Classification

Active Learning SVM for Blogs recommendation

Tweets Miner for Stock Market Analysis

The relation between news events and stock price jump: an analysis based on neural network

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Keywords social media, internet, data, sentiment analysis, opinion mining, business

The Viability of StockTwits and Google Trends to Predict the Stock Market. By Chris Loughlin and Erik Harnisch

Sentiment Analysis Tool using Machine Learning Algorithms

Machine learning for algo trading

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.

How can we discover stocks that will

ALGORITHMIC TRADING USING MACHINE LEARNING TECH-

Initial Report. Predicting association football match outcomes using social media and existing knowledge.

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Machine Learning Techniques for Stock Prediction. Vatsal H. Shah

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING

Content-Based Recommendation

Predicting stocks returns correlations based on unstructured data sources

A Logistic Regression Approach to Ad Click Prediction

STA 4273H: Statistical Machine Learning

Crowdfunding Support Tools: Predicting Success & Failure

Twitter Sentiment Analysis of Movie Reviews using Machine Learning Techniques.

Data Pre-Processing in Spam Detection

Web Document Clustering

Machine Learning in FX Carry Basket Prediction

Microblog Sentiment Analysis with Emoticon Space Model

Data Mining - Evaluation of Classifiers

Why is Internal Audit so Hard?

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Spatio-Temporal Patterns of Passengers Interests at London Tube Stations

Challenges of Cloud Scale Natural Language Processing

Chapter 6. The stacking ensemble approach

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Social Media Mining. Data Mining Essentials

Machine Learning Algorithms and Predictive Models for Undergraduate Student Retention

Equity forecast: Predicting long term stock price movement using machine learning

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Twitter Stock Bot. John Matthew Fong The University of Texas at Austin

Twitter sentiment vs. Stock price!

MS1b Statistical Data Mining

Customer Relationship Management using Adaptive Resonance Theory

1. Classification problems

The Data Mining Process

Chapter 12 Discovering New Knowledge Data Mining

Italian Journal of Accounting and Economia Aziendale. International Area. Year CXIV n. 1, 2 e 3

Projektgruppe. Categorization of text documents via classification

Bug Report, Feature Request, or Simply Praise? On Automatically Classifying App Reviews

LCs for Binary Classification

Final Project Report. Twitter Sentiment Analysis

Financial Trading System using Combination of Textual and Numerical Data

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

Machine Learning in Stock Price Trend Forecasting

Transcription:

Forecasting stock markets with Twitter Argimiro Arratia argimiro@lsi.upc.edu Joint work with Marta Arias and Ramón Xuriguera To appear in: ACM Transactions on Intelligent Systems and Technology, 2013, as part of larger study by the title Forecasting with Twitter data.

Motivations and Goals Does a public Sentiment Indicator extracted from daily Twitter messages can indeed improve the forecasting of social, economic, or commercial indicators, based on time series models? Answer: Experiment with all possible machine learning models reinforced with a Sentiment Index built from Twitter data, and compare their performance when trained with and without the Twitter index.

Some technical difficulties There are no public data sets available and Twitter have imposed several restrictions on retrieving on-line posted tweets. As of April 2010 Twitter forbids 3rd parties to redistribute tweets. Hence we have to create our own data set, with limitations. Begun on 22 March 2011. Use a Streaming API: One HTTP connection is kept alive to retrieve tweets as they are posted Filter the stream by keyword or user

The Twitter Gold Mine You can BUY data from official Twitter reseller. Very, very expensive! The Twitter-Hedge fund July 2011: Derwent Capital, a UK based hedge fund, in partnership with Bollen et al. began trading a $40 Million Hedge Fund using the Twitter Predictor. http://www.derwentcapitalmarkets.com/ http://mashable.com/2011/05/17/twitter-based-hedge-fund/

How we did it Target Dataset R.Filter Model Lag... Acc. w/o Acc. w/... GOOG GOOG TRUE SVM(sigmoid) 2... 0.567 0.578... GOOG GOOG TRUE SVM(sigmoid) 3... 0.578 0.589... GOOG GOOG TRUE SVM(sigmoid) 4... 0.589 0.612... GOOG GOOG TRUE SVM(sigmoid) 2... 0.567 0.578... GOOG GOOG TRUE SVM(sigmoid) 3... 0.578 0.589... GOOG GOOG TRUE SVM(sigmoid) 4... 0.589 0.612... GOOG GOOG TRUE SVM(sigmoid) 2... 0.567 0.578... GOOG GOOG TRUE SVM(sigmoid) 3... 0.578 0.589... GOOG GOOG TRUE SVM(sigmoid) 4... 0.589 0.612... GOOG GOOG TRUE SVM(sigmoid) 2... 0.567 0.578... GOOG GOOG TRUE SVM(sigmoid) 3... 0.578 0.589... GOOG GOOG TRUE SVM(sigmoid) 4... 0.589 0.612... GOOG GOOG TRUE SVM(sigmoid) 2... 0.567 0.578... GOOG GOOG TRUE SVM(sigmoid) 3... 0.578 0.589... GOOG GOOG TRUE SVM(sigmoid) 4... 0.589 0.612... GOOG GOOG TRUE SVM(sigmoid) 2... 0.567 0.578... GOOG GOOG TRUE SVM(sigmoid) 3... 0.578 0.589... GOOG GOOG TRUE SVM(sigmoid) 4... 0.589 0.612... GOOG GOOG TRUE SVM(sigmoid) 2... 0.567 0.578... GOOG GOOG TRUE SVM(sigmoid) 3... 0.578 0.589... GOOG GOOG TRUE SVM(sigmoid) 4... 0.589 0.612.............................. VIX YHOO TRUE SVM(sigmoid) 3... 0.521 0.601... VIX YHOO TRUE SVM(sigmoid) 4... 0.537 0.635.............................. Messages Text Processing Time Series Model Fitting and Evaluation Results table Summary Tree @twittov 20:33 You can choose exactly who you want to be a good person. I know because I'm human.... Relevance Filter Data Cleaning Bag of Words Sentiment Index Twitter Series lags lags L.Regression N. Network SVM Target Series Figure: Overview of data collection, preprocessing, forecasting and final analysis processes.

Twitter Online service that allows to build social networks based on microblogging Messages or tweets up to 140 characters Special or reserved symbols: @ for replying; # (hashtags) for topics assignment or keywords; RT (retweet) for sharing with followers; URLs.

Twitter social indicators implemented Our Twitter based indicators: Volume and Sentiment Hypothesis: (Volume) More messages more variance (Volatility)? Hypothesis: (Sentiment index) negative/positive decrease/increase benefits (returns)?

Sentiment Analysis Goal: Determine whether a message contains positive or negative impressions on a given subject Subjectivity Recognition Polarity Detection Begin with creating a labelled corpora for supervised training a sentiment classifier We apply a recent pictorial tagging idea [Bifet et al, 2010]: Create a dataset of tweets that are automatically labelled positive if contains a smiley of the form: :-),;-D or negative if contains :-(. (Formally, consider all regular expressions from [:=8][ -]?[)D] for positive, and from [:=8][ -]?( for negative.)

Sentiment Analysis Expand the training datasets by Feature Lists of words: classify by frequent words with at least 5 occurrences, and topic (e.g. a stock s ticker). Clean the data previously: Remove duplicates (mostly in retweets); Remove stopwords (e.g. pronouns, prepositions, many verbs); Stemming (reduce words to their roots by removing suffixes); Negation handling (use tagging: not good NOT good); Relevance filtering: text containing some of the keywords is no guarantee of its relevance for the subject we want to classify. E.g. I really love eating an apple Apple stock soared above $404 today

Sentiment Classifiers (SC) Multinomial Naïve Bayes: Given D the set of all docs. and C set of class labels, a Naive Bayes classifier will assign a doc. d the class with the highest conditional probability given that document. (d is represented as n-dim. vector (w 1,..., w n ) of Bernoulli-distributed variables indicating whether word w i occurs in d). The naive assumption is that the presence of a particular feature of a class is unrelated to the presence of any other feature. We only focus on binary classification (positive/negative) 3 classes of data sets to train different SCs: English, Multi-language, Stock. Variations of SC obtained by training on these data sets with different preprocessing, feature words to represent docs, etc. Best scoring SC extracted: C-En, C-Ml, C-Stk. Accuracy: 76.49% for C-En, 79.5% for C-Ml, 76% for C-Stk

Sentiment Index (or Twitter series) Sentiment: A time series consisting of the daily percentage of positive tweets (over the total number of tweets posted) for each top-scoring SC. Volume: time series of daily number of tweets concerning a subject.

Financial Time Series Goal To predict stock s return or volatility Focus on following companies and indices: AAPL, MSFT, GOOG, YHOO S&P100 (OEX), S&P100 s implicit volatility (VXO), S&P500 (GSPC), S&P500 implicit volatility (VIX). Returns Computed from Adjusted Close Log normally distributed Log returns Volatility Computed from log returns Exponential Weighted Moving Average m V (t n ) = (1 λ) λ i 1 Rn i 2 i=1

Forecasting time series Models for prediction Linear Regression y = n i=0 w ix i Neural Networks (feedforward) y j = ϕ ( n i=0 w ijx ij ) Support Vector Machines (with kernel polynomial, radial or sigmoid) Model Assessment Nonlinearity: we use White neural network test for neglected nonlinearity to assess the possible nonlinear relationship between two time series Causality: apply Granger test of causality, both parametric and non-parametric, to the two time series and for different lags. Prequential evaluation: Our data is from short period, we cannot spare an independent set for training. Use past data to update/retrain the classifier and predict the direction of the time series for the following day.

Experimental set up Combining all different parameters give us approx. 39000 experiments. Summary trees Decision trees built with REPTree (Weka), by greedy selection of attributes that give a higher performance gain. The tree give an indication of the attributes that most influent the increase on accuracy.

Experimental Results (General) Big failure of linear models (only improve 2.5% of the time: 64 improved/ 2418 not improved) Nonlinear models with either Twitter series improve as predictors of the trend of volatility indices (VXO, VIX) and historic volatilities of stocks. Predicting the trend of benefits is more dependent on the parameters, the input data and Twitter classifier. Best case is SVM with poly-2 kernel for predicting OEX with sentiment index obtained from C-Stk classifier.

Experimental Results (Specific) Success rates by model family for predicting the VIX index using Tweet Volume and lags = 3. Model family Successful Unsuccessful Success rate Neural Networks 100 35 74.07% Support Vector Machines 103 121 45.98% Success rates of SVM by kernel type for predicting the VXO index when lags = 2, plus either Twitter series. Kernel Type Successful Unsuccessful Success rate Polynomial Kernels 104 192 35.13% Radial 5 70 0.07% Sigmoid 68 7 90.67%

WARNING! Don t play with your home account The Twitter-hedge fund crash October 2011: Derwent Capital closes shop with reported returns of 1.86% after a month of trading with Bollen et al Twitter predictor, and is for sale today: http://www.derwentcapitalmarkets.com/auction/ The only machine model used in Bollen et al is a Fuzzy Logic Neural Network and they only care about price direction disregarding volatility. We were not able to test such machine (the program for it was of course not available), but in general Neural Network score low in our tests for predicting direction of returns for stocks (and the results of such experiments are very parameter-dependent). This is the END