Leveraging Big Data A case study from Thomson Reuters
About the speakers Chawapong Suriyajan, Development Group Leader Sakol Suwinaitrakool Senior Solution Architect 2
FOLLOW US: facebook.com/thomsonreutersthailand 3
What s the problem we want to solve? Behavioral finance is an area of increasing interest in financial markets, but it's been difficult for human traders to keep pace due to the sheer volume and detail of data and the need to interpret it and spot trends immediately Philip Brittan Chief Technology Officer & Global Head of Platform, Thomson Reuters 4
Introducing Social Media Monitor The tool that helps overcome the challenges in analyzing social data, and provide the insights for investors 5
Awards Corporate Entrepreneur Awards 2014 Best New Service The Technical Analyst Awards 2014 Best Specialist Product FStech Awards 2015 Financial Sector Innovation of the Year 6
What does SMM do? Perform Sentiment Analysis Visualize 7
Why Social Media? Fast!!! 8
Social Media is Fast On June 10, 2014, as Iraqi militants seized the Baiji oil refinery, the news broke on Twitter - six hours in advance of other media outlets covering the story. 9
Social Media is Fast In November, 2013, when The Globe and Mail tweeted that BlackBerry s $4.7 billion buyout was scuttled. The tweet happened at 8:12 a.m., and by 8:19 a.m., BlackBerry stock had fallen 20% 10
Why Social Media? Provides collective sentiment indicators 11
Social Medias 13
The Growth of Social Data Source: http://www.searchenginejournal.com/growth-social-media-2-0-infographic/77055/ 14
Challenges leveraging Social Media? - Data are incredibly huge - How we can make a machine analyze the sentiment data correctly - How can we deal with data that are noises - How do we present the huge amount of data in the way that a human can easily understand 15
Looking at the challenges - Data are incredibly huge - How we can make a machine analyze the sentiment data correctly - How can we deal with data that are noises - How do we present the huge amount of data in the way that a human can easily understand 16
Emerging Technology Trends: Big Data T O O L File System: Document Store: Wide-column Store: Key-value Store: 17
How big is our Data? Millions of tweets with cash tag (e.g. $AAPL) per quarter Greater than 1Tera Bytes of Compressed data Around 50 GB of data flowing into our system Daily 18
Social Media Monitor Data Sources 45,000 Entries/Day 45,000,000 Entries/Day Social Media Ingestor Filter 215,000 Entries/Day 19
What is used to handle such big data? Distributed High Availability Full-Text Search Document Oriented Schema Free RESTFul API Apache 2 Open Source License Low Cost 20
Our experience using Elasticsearch & Hadoop Strengths Clean distributed deployment and prior in-house testing done Challenges Determining the size of the cluster Resource contention / Resource Sharing Large dataset 21
Looking at the challenges - Data are incredibly huge - How we can make a machine analyze the sentiment data correctly - How can we deal with data that are noises - How do we present the huge amount of data in the way that a human can easily understand 22
Analyzing Sentiments Natural Language Processing Apples are red. They are very delicious Tokenize Apples, are, red, Part of Speech tagging Apples = Subject, are = verb Lemmatization Apples = Apple, are = be Name Entity Relation Apples = Fruits, Red = Color Coreference resolution They = Apples 23
Analyzing Sentiments Machine Learning 24
Processing Tweets Tweets NLP Sentiment Analysis (Machine Learning) SM Ingestor Tweets + Sentiments/ Bullish, Bearish Search Processed in miliseconds Count of positive tweets Count of negative tweets Count of neutral tweets Count of bullish tweets Count of bearish tweets Total tweet count SM Statistic Aggregator 25
Looking at the challenges - Data are incredibly huge - How we can make a machine analyze the sentiment data correctly - How can we deal with data that are noises - How do we present the huge amount of data in the way that a human can easily understand 26
What are the noises What if people tweet about some company with great bias? What if someone tweet jokes? Will this impact the analysis? Example: Buy $Apple? Is it positive or Negative? 27
Minimizing the noises Use the proper filter for the PowerTrack API Weighted Sentiment score using Klout score Focus on collective sentiments during a specific time period, instead of individual tweet. Enough training data to train our sentiment engine 28
Klout score The Klout Score is a number between 1-100 that represents your influence. The more influential you are, the higher your Klout Score. 29
Looking at the challenges - Data are incredibly huge - How we can make a machine analyze the sentiment data correctly - How can we deal with data that are noises - How do we present the huge amount of data in the way that a human can easily understand 30
Data Visualization Bubbles Chart 31
Data Visualization Heatmap 32
Data Visualization Technology 33
Strengths and Challenges Strengths Server-side deployment No installation on client machines Off-load Presentation logic to Client machines Save resource requirement on server side more scalable (Good code needed) Scalable Node.JS is single-thread non-blocking IO, no overhead for context switching 34
Strengths and Challenges Challenges Developer Skills on Angular.js Framework JavaScript Performance Node.JS is quite sensitive to unhandled exceptions, which cause excessive memory usage 35
Q&A 36
Thank you 37