CS294-1 Final Project Report. Real-Time Event Detection by Twitter

Similar documents
Project Report CS294-1: Behavioral Data Mining Chronological and Geographical Visualization for Tweets. Hanzhong (Ayden) Ye, Hong Wu, Rohan Nagesh

Active Learning SVM for Blogs recommendation

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

EST.03. An Introduction to Parametric Estimating

Predicting Flight Delays

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control

Machine Learning Final Project Spam Filtering

AN EXAMINATION OF ONLINE FRAUD COMPLAINT OCCURRENCES

Can Twitter provide enough information for predicting the stock market?

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

National Price Rankings

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Exploring Our World with GIS Lesson Plans Engage

Less naive Bayes spam detection

An Analysis of Commercial Vehicle Driver Traffic Conviction Data to Identify High Safety Risk Motor Carriers

REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION

Three-Year Moving Averages by States % Home Internet Access

Recognizing Informed Option Trading

Chapter 6. The stacking ensemble approach

Impacts of Sequestration on the States

Job Market Intelligence:

Applying Machine Learning to Stock Market Trading Bryce Taylor

Server Load Prediction

Public School Teacher Experience Distribution. Public School Teacher Experience Distribution

Information About Filing a Case in the United States Tax Court. Attached are the forms to use in filing your case in the United States Tax Court.

End-to-End Sentiment Analysis of Twitter Data

Net-Temps Job Distribution Network

Predicting the Stock Market with News Articles

Data Mining Yelp Data - Predicting rating stars from review text

CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis

Sentiment analysis using emoticons

2016 ERCOT System Planning Long-Term Hourly Peak Demand and Energy Forecast December 31, 2015

NON-RESIDENT INDEPENDENT, PUBLIC, AND COMPANY ADJUSTER LICENSING CHECKLIST

BUSINESS DEVELOPMENT OUTCOMES

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Twitter Analytics: Architecture, Tools and Analysis

VACA: A Tool for Qualitative Video Analysis

Supervised and unsupervised learning - 1

MAINE (Augusta) Maryland (Annapolis) MICHIGAN (Lansing) MINNESOTA (St. Paul) MISSISSIPPI (Jackson) MISSOURI (Jefferson City) MONTANA (Helena)

State Tax Information

ObamaCare s Impact on Small Business Wages and Employment

Search and Information Retrieval

Knowledge Discovery from patents using KMX Text Analytics

State Estate Taxes BECAUSE YOU ASKED ADVANCED MARKETS

Neural Networks and Support Vector Machines

Wyoming Community College Commission 2015 Tuition Review

Executive Summary. Public Support for Marriage for Same-sex Couples by State by Andrew R. Flores and Scott Barclay April 2013

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

Monitoring and Analyzing Customer Feedback Through Social Media Platforms for Identifying and Remedying Customer Problems

Avalanche: Prepare, Manage, and Understand Crisis Situations Using Social Media Analytics

Creating a NL Texas Hold em Bot

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Workers Compensation State Guidelines & Availability

West Case Reporters and Digests Research Guide

Predict the Popularity of YouTube Videos Using Early View Data

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Englishinusa.com Positions in MSN under different search terms.

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems

Predicting Defaults of Loans using Lending Club s Loan Data

Blog Post Extraction Using Title Finding

Bayesian Spam Detection

Challenges for Big Data Applications in Japan: Hopes and Concerns

14-Sep-15 State and Local Tax Deduction by State, Tax Year 2013

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

Big Data Text Mining and Visualization. Anton Heijs

ITU Kaleidoscope 2015

Music Mood Classification

other distance education innovations, have changed distance education offerings.

Information Management course

US Department of Health and Human Services Exclusion Program. Thomas Sowinski Special Agent in Charge/ Reviewing Official

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Sentiment analysis on tweets in a financial domain

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Chex Systems, Inc. does not currently charge a fee to place, lift or remove a freeze; however, we reserve the right to apply the following fees:

Sample Size and Power in Clinical Trials

Making Sense of the Mayhem: Machine Learning and March Madness

Experiments in Web Page Classification for Semantic Web

NTT DATA Big Data Reference Architecture Ver. 1.0

State Tax Information

Feature Selection for Electronic Negotiation Texts

High Risk Health Pools and Plans by State

Data show key role for community colleges in 4-year

Geovisualization. Geovisualization, cartographic transformation, cartograms, dasymetric maps, scientific visualization (ViSC), PPGIS

Statistical tests for SPSS

CAPSTONE ADVISOR: PROFESSOR MARY HANSEN

E-commerce Transaction Anomaly Classification

Concepts of digital forensics

ANALYTICS IN BIG DATA ERA

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Comparison of Data Mining Techniques used for Financial Data Analysis

Recruitment and Retention Resources By State List

Worksheet: Firearm Availability and Unintentional Firearm Deaths

Analysis of Social Media Streams

L25: Ensemble learning

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Feature Subset Selection in Spam Detection

Data, Measurements, Features

Transcription:

CS294-1 Final Project Report Real-Time Event Detection by Twitter Cheng Lu & Yun Jin Abstract Twitter has become a popular microblogging service. An important characteristic of Twitter is its real-time nature. For example, when an earthquake occurs, people make many tweets related to the earthquake, which enables detection of earthquake occurrence promptly, simply by observing the tweets. In this report, we investigate the real-time interaction of events such as earthquakes, in Twitter, and propose an algorithm to monitor tweets and to detect a target event. To detect a target event, we devise a classifier of tweets based on features such as the keywords in a tweet, the number of words, and their context. We also establish a temporal model to estimate time of occurrence. Finally, we use geocharts from Google to show our results on maps. In this report, section 1 describes our goal of this project. Section 2 introduces how we collect data and how we processed these data for classification. Section 3 talks about the approach we used for the project in detail and why we choose it. Section 4 shows the results and performance analysis. Section 5 gives a short scenario about what will be done for next step. Section 6 summarizes what we have done and what we have learned through this project. Project Goal Since real-time nature is one of the most important features for Twitter, we could detect real-time events by monitoring tweets. And we try to find the real time of location of such events. These events share several properties in common: i) They are of large scale (many users experience the event). ii) They particularly influence people s daily life (for that reason, they are induced to tweet about it). iii) They have both spatial and temporal regions (so that real-time location estimation would be possible). The research questions we are trying to answer in this project are: i) Can we detect such event occurrence in real-time by monitoring twitter? ii) Can we find the real time and location information of such event?

In this report, we focus on the detection of earthquake that has the above properties. The goal of our project is to detect earthquake events by collecting tweets data and to estimate the time of location of earthquake occurrence. Data Collecting & Data Processing We start our analysis by searching the queries from the MarkLogic Twitter Database by XQueries. In our project, we first search the keyword earthquake for a subset of tweets that all contains the keyword in our searching queries. Subsequently, we limit the location of all tweets to US and set the language as English since we concentrate on the events that are in the region of US. Finally, we get 20,000 tweets that are dated from Nov 22nd, 2011 to Mar 13th, 2012. Follow by that, we employed simple statistical analysis of the twitter data we got. From these tweets data, we found that most are located in west in accordance to the fact that earthquake happens most in the west coast in US and are pretty rare in Midwest and Northeast area (this may also due to the possibility that the twitter population is way more dense in west coast). Figure 1 shows the regional distribution of the data that contains the keyword earthquake located in US. Figure1. Distribution of earthquake in US from tweets data Approach To detect an event like earthquake and estimate the time and location of such event, we build an overall analysis platform for classification and estimation. Figure 2 shows the procedure of our approach. In this report, we first introduce two assumptions for detecting an event like earthquake. Then we describe the semantic analysis for classification. Subsequently, we propose a temporal model for event occurrence probability. Finally, we estimate the time of location of the earthquake.

Figure2. Frame of approach Assumption If we search the tweet and find out one user posted a relevant tweet, then we classify it into a positive class. The users function like a sensor of the event, therefore a tweet can be considered as a sensor output. In order to make our event detection feasible, we made the following assumptions: 1. Each Twitter user is a sensor, which detects a target event and makes a report following a certain probability. 2. Each tweet is associated with a time and location. We called our Twitter users as virtual sensors, which have various characteristics. Each sensor has different sensitivity, which is contradictory to traditional sensors. This means that some users are only activate for a limited time and specific events, while others are active to a wider range of time and events. The number of sensors is enormous. Therefore, we concluded that the virtual sensors are noisier than traditional physical sensors. Each tweet is associated with its post time, and we use this information as the estimated occurrence time of our target event. Figure 3 shows the conceptual model of our analysis platform. Each twitter user could be considered as a virtual sensor for our targeting event. Upon the occurrence of such events, each twitter users will have a certain probability to post a tweet about it, and the tweet they posted is the output of every sensor, just like the output signal of normal thermal sensor. Trained from known data, the classifier will take these tweets as input and giving out positive or negative result. The probabilistic model will finally indicate the signal of the targeting event.

Figure3. Conceptual model Semantic Analysis To obtain tweets on the target event precisely, we apply semantic analysis of a tweet: For example, users might make tweets such as Earthquake! thus earthquake could be keywords, but users might also make tweets such as I am attending an Earthquake Conference. Moreover, even though a tweet refers to the target event, it might not be appropriate as an event report; for example a user makes tweets such as The earthquake yesterday was scaring, or Three earthquakes in four days. Japan scares me. These tweets are truly the mentions of the target event, but they are not real-time reports of the events. Therefore, it is necessary to clarify that a tweet is actually referring to an actual earthquake occurrence, which is denoted as a positive class. By preparing positive and negative examples as a training set, we can produce a model to classify tweets automatically into positive and negative categories. We prepare three groups of features for each tweet as follows:

Features A (statistical features): The number of words in a tweet message Features B (word context features): The position of the query word within a tweet. Features C (keyword features): Bag of words in tweets We use the dictionary built in HW2, which has the length of around 260K. After gathering these features, we find the dimension of the feature vector is 260K. However, the model has more degrees of freedom than we have data. We decide to use linear dimensionality reduction (LDA) to compress data for both efficiency and accuracy. After the process of LDA, the dimension of feature vector has been reduced to 10K. Classification In the training stage, we have to manually classify the twitter data to prepare the training set. We manually classify about 1000 tweets. One interesting facts we found out is that most of the earthquake tweets are not really talking about real-time event of earthquake. Some people use earthquake as an expression, while others simply post earthquake news obtained from other media source, which are not our targeting positive tweets. To classify a tweet into a positive class or a negative class, we use several methods to train our classifier, including Naïve Bayes, linear regression and kernel SVMs to classify the data into positive and negative class. After training, we get top ten positive words and top ten negative words and we also provide weights for these words (Table 1). Table1. Weights of top ten positive and negative words

We can tell that the table shows some insights about our system. For the positive term, most authentic real-time earthquake tweets will tweet like I was shocked by an earthquake or Woke up from bed or I had a weird feeling, maybe an earthquake. The most significant negative terms are predict, report and magnitude, which usually don t indicate a real-time earthquake. People are also like to post Japan earthquake to show sympathy. Another interesting facts is that the number of words in each tweet has a negative correlation, meaning that most of the positive real-time earthquake tweets are shorter than the report and other expression, which makes sense too. In the testing stage, we get a comparison of rate of accuracy with LDA and without LDA for different classifier. Figure4. Rate of accuracy for different algorithms with LDA and without LDA From the result, the Intersection SVM, Linear Regression and Naïve Bayes methods outperform others, which all exceed 90% of success. We get our best result by using Intersection kernel SVM, baring only 8% of error rate. Event Occurrence Probability After classifying a tweet into positive or negative case, we have to calculate the event occurrence probability. We form a probabilistic model that gives the probability of event occurrence, for a given tweet that is a positive example. If the probability is larger than a predetermined threshold, then it determines an actual occurrence of the target event. To assess an alarm, we must calculate the reliability of multiple sensor values. For example, a user might make a false alarm by writing a tweet. It is also possible that the classifier misclassifies a tweet into a positive class. We can design the alarm probabilistically using the following two facts:

1. The false-positive ratio P f of a sensor is approximately 0.3. (The misclassification error we got is about 10%, while the rest are those who reports there is an earthquake while it does not happen.) 2. Sensors are assumed to be independent and identically distributed. Assuming that we have n sensors at one specific time, which produce positive signals, the probability of all n sensors returning a false alarm is P f n. Therefore, the probability of event occurrence can be estimated as P occur = 1 P f n. We determine the threshold as 0.95, so the probability of event occurrence should be P occur 0.95. We can calculate that the positive tweets we got should be no less than 3 (n 3). So our system will give out alarms when seeing more than 3 positive tweets about earthquake within a specific time. Estimate Time & Location For time estimation, we set the first post tweet time as the occurrence time. We didn t do much about the location estimation, since our data lost most of the GPS information. We simply used the city information for the location of events occurrence. Result We implemented a geochart by using HTML to show our final results of detecting the occurrence of earthquake in US from Nov 22nd, 2011 to Mar 13th, 2012 (Figure 5). We also collect real earthquake data from http://earthquake.usgs.gov/. Table 2 shows a comparison of our results and real earthquake data. From the comparison, we find that the estimated time of earthquake is a bit late than the real time of earthquake, but within a tolerable level. Figure5. Results of detecting earthquake Tweet Time Tweet Location Real Time Real Location 03:55:02 UTC, 11/26/2011 Oklahoma 03:53:10UTC, 11/26/2011 Oklahoma

11:58:20 UTC 11/30/2011 Seattle 11:56:30 UTC 11/30/2011 Seattle, Washington 03:48:37 UTC 12/23/2011 Southern Alaska 03:44:27 UTC 12/23/2011 Southern Alaska 12:09:10 UTC 1/3/2012 Beaver 12:06:36 UTC 1/3/2012 Beaver, Utah 07:25:34 UTC 1/5/2012 Boise 07:20:10 UTC 1/5/2012 Southern Idaho 07:15:05 UTC 1/20/2012 Butte 07:05:55 UTC 1/20/2012 Western Montana 05:26:18 UTC 1/21/2012 Camden 05:20:30 UTC 1/21/2012 Philadelphia 03:50:22 UTC 1/29/2012 Riverton 03:46:24 UTC 1/29/2012 Lander, Wyoming 03:59:49 UTC 1/31/2012 Hawaii 03:54:43 UTC 1/31/2012 Hawaii 04:50:11 UTC 2/1/2012 Wells 04:40:31 UTC 2/1/2012 Wells, Nevada 07:29:45 UTC 2/12/2012 New York 07:28:30 UTC 2/12/2012 Boston, Massachusetts 03:33:28 UTC 2/15/2012 Coos Bay 03:31:20 UTC 2/15/2012 Coos Bay, Oregon 05:07:30 UTC 3/5/2012 San Francisco 05:03:30 UTC 3/5/2012 Richmond, California 03:45:18 UTC 3/5/2012 Milpitas 03:33:12 UTC 3/5/2012 San Francisco 03:18:34 UTC 3/6/2012 El Paso 03:11:50 UTC 3/6/2012 Western Texas Table2. Comparison of estimated occurrence of earthquake and real earthquake Future Work We would focus on more accurate temporal and spatial model for time and location estimation if we have more data with location information. For temporal model, we should consider about the interval time between the occurrence time and the post time. So we should not use post time as occurrence time. For spatial model, we d like to apply Karman filtering or particle filtering into estimating the locations of events. We plan to detect real-time events of various kinds using Twitter, such as traffic jam, accidents and typhoon. Our model includes the assumption that a single instance of the target event exists at a certain time. For example, we assume that we do not have two or more earthquakes or typhoons simultaneously. Although the assumption is reasonable for these cases, it might not hold for other events such as traffic jams, accidents, and rainbows. To realize multiple event detection, we must produce advanced probabilistic models that allow hypotheses of multiple event occurrences. A search query is important to search possibly relevant tweets. For example, we set a query term as earthquake because most tweets mentioning an earthquake occurrence use this word. However, to improve the recall, it is necessary to obtain a good set of queries. We can use advanced algorithms for query expansion, which is a subject of our future work. Conclusion The project we propose is Real Time Event Detection by Twitter, which utilizes the real-time nature of Twitter. We consider each Twitter user as a virtual sensor, and use semantic analysis to classify tweets into a positive and negative class. Afterwards, we use probabilistic temporal model for detecting such event. Finally we use the post time as estimated occurrence time of the earthquake. Since we lose the GPS information of data, we cannot investigate into estimating the location of earthquake. From this project, we learn how to use MarkLogic XQuery to collect data. Also, we learn how to parse and tokenize XML data. The semantic analysis and feature

extraction are the key issue in the whole project. We also use LDA to get more accurate classification result. Additionally, we utilize different classification techniques to train and test data. Finally, we learnt how to use geocharts to visualize our results on maps. Reference 1. Sakaki T, Okazaki M, Matsuo Y (2010) Earthquake shakes Twitter users: Realtime event detection by social sensors.