CS294-1 Final Project Report. Real-Time Event Detection by Twitter

Transcription

1 CS294-1 Final Project Report Real-Time Event Detection by Twitter Cheng Lu & Yun Jin Abstract Twitter has become a popular microblogging service. An important characteristic of Twitter is its real-time nature. For example, when an earthquake occurs, people make many tweets related to the earthquake, which enables detection of earthquake occurrence promptly, simply by observing the tweets. In this report, we investigate the real-time interaction of events such as earthquakes, in Twitter, and propose an algorithm to monitor tweets and to detect a target event. To detect a target event, we devise a classifier of tweets based on features such as the keywords in a tweet, the number of words, and their context. We also establish a temporal model to estimate time of occurrence. Finally, we use geocharts from Google to show our results on maps. In this report, section 1 describes our goal of this project. Section 2 introduces how we collect data and how we processed these data for classification. Section 3 talks about the approach we used for the project in detail and why we choose it. Section 4 shows the results and performance analysis. Section 5 gives a short scenario about what will be done for next step. Section 6 summarizes what we have done and what we have learned through this project. Project Goal Since real-time nature is one of the most important features for Twitter, we could detect real-time events by monitoring tweets. And we try to find the real time of location of such events. These events share several properties in common: i) They are of large scale (many users experience the event). ii) They particularly influence people s daily life (for that reason, they are induced to tweet about it). iii) They have both spatial and temporal regions (so that real-time location estimation would be possible). The research questions we are trying to answer in this project are: i) Can we detect such event occurrence in real-time by monitoring twitter? ii) Can we find the real time and location information of such event?

2 In this report, we focus on the detection of earthquake that has the above properties. The goal of our project is to detect earthquake events by collecting tweets data and to estimate the time of location of earthquake occurrence. Data Collecting & Data Processing We start our analysis by searching the queries from the MarkLogic Twitter Database by XQueries. In our project, we first search the keyword earthquake for a subset of tweets that all contains the keyword in our searching queries. Subsequently, we limit the location of all tweets to US and set the language as English since we concentrate on the events that are in the region of US. Finally, we get 20,000 tweets that are dated from Nov 22nd, 2011 to Mar 13th, Follow by that, we employed simple statistical analysis of the twitter data we got. From these tweets data, we found that most are located in west in accordance to the fact that earthquake happens most in the west coast in US and are pretty rare in Midwest and Northeast area (this may also due to the possibility that the twitter population is way more dense in west coast). Figure 1 shows the regional distribution of the data that contains the keyword earthquake located in US. Figure1. Distribution of earthquake in US from tweets data Approach To detect an event like earthquake and estimate the time and location of such event, we build an overall analysis platform for classification and estimation. Figure 2 shows the procedure of our approach. In this report, we first introduce two assumptions for detecting an event like earthquake. Then we describe the semantic analysis for classification. Subsequently, we propose a temporal model for event occurrence probability. Finally, we estimate the time of location of the earthquake.

3 Figure2. Frame of approach Assumption If we search the tweet and find out one user posted a relevant tweet, then we classify it into a positive class. The users function like a sensor of the event, therefore a tweet can be considered as a sensor output. In order to make our event detection feasible, we made the following assumptions: 1. Each Twitter user is a sensor, which detects a target event and makes a report following a certain probability. 2. Each tweet is associated with a time and location. We called our Twitter users as virtual sensors, which have various characteristics. Each sensor has different sensitivity, which is contradictory to traditional sensors. This means that some users are only activate for a limited time and specific events, while others are active to a wider range of time and events. The number of sensors is enormous. Therefore, we concluded that the virtual sensors are noisier than traditional physical sensors. Each tweet is associated with its post time, and we use this information as the estimated occurrence time of our target event. Figure 3 shows the conceptual model of our analysis platform. Each twitter user could be considered as a virtual sensor for our targeting event. Upon the occurrence of such events, each twitter users will have a certain probability to post a tweet about it, and the tweet they posted is the output of every sensor, just like the output signal of normal thermal sensor. Trained from known data, the classifier will take these tweets as input and giving out positive or negative result. The probabilistic model will finally indicate the signal of the targeting event.

4 Figure3. Conceptual model Semantic Analysis To obtain tweets on the target event precisely, we apply semantic analysis of a tweet: For example, users might make tweets such as Earthquake! thus earthquake could be keywords, but users might also make tweets such as I am attending an Earthquake Conference. Moreover, even though a tweet refers to the target event, it might not be appropriate as an event report; for example a user makes tweets such as The earthquake yesterday was scaring, or Three earthquakes in four days. Japan scares me. These tweets are truly the mentions of the target event, but they are not real-time reports of the events. Therefore, it is necessary to clarify that a tweet is actually referring to an actual earthquake occurrence, which is denoted as a positive class. By preparing positive and negative examples as a training set, we can produce a model to classify tweets automatically into positive and negative categories. We prepare three groups of features for each tweet as follows:

5 Features A (statistical features): The number of words in a tweet message Features B (word context features): The position of the query word within a tweet. Features C (keyword features): Bag of words in tweets We use the dictionary built in HW2, which has the length of around 260K. After gathering these features, we find the dimension of the feature vector is 260K. However, the model has more degrees of freedom than we have data. We decide to use linear dimensionality reduction (LDA) to compress data for both efficiency and accuracy. After the process of LDA, the dimension of feature vector has been reduced to 10K. Classification In the training stage, we have to manually classify the twitter data to prepare the training set. We manually classify about 1000 tweets. One interesting facts we found out is that most of the earthquake tweets are not really talking about real-time event of earthquake. Some people use earthquake as an expression, while others simply post earthquake news obtained from other media source, which are not our targeting positive tweets. To classify a tweet into a positive class or a negative class, we use several methods to train our classifier, including Naïve Bayes, linear regression and kernel SVMs to classify the data into positive and negative class. After training, we get top ten positive words and top ten negative words and we also provide weights for these words (Table 1). Table1. Weights of top ten positive and negative words

6 We can tell that the table shows some insights about our system. For the positive term, most authentic real-time earthquake tweets will tweet like I was shocked by an earthquake or Woke up from bed or I had a weird feeling, maybe an earthquake. The most significant negative terms are predict, report and magnitude, which usually don t indicate a real-time earthquake. People are also like to post Japan earthquake to show sympathy. Another interesting facts is that the number of words in each tweet has a negative correlation, meaning that most of the positive real-time earthquake tweets are shorter than the report and other expression, which makes sense too. In the testing stage, we get a comparison of rate of accuracy with LDA and without LDA for different classifier. Figure4. Rate of accuracy for different algorithms with LDA and without LDA From the result, the Intersection SVM, Linear Regression and Naïve Bayes methods outperform others, which all exceed 90% of success. We get our best result by using Intersection kernel SVM, baring only 8% of error rate. Event Occurrence Probability After classifying a tweet into positive or negative case, we have to calculate the event occurrence probability. We form a probabilistic model that gives the probability of event occurrence, for a given tweet that is a positive example. If the probability is larger than a predetermined threshold, then it determines an actual occurrence of the target event. To assess an alarm, we must calculate the reliability of multiple sensor values. For example, a user might make a false alarm by writing a tweet. It is also possible that the classifier misclassifies a tweet into a positive class. We can design the alarm probabilistically using the following two facts:

7 1. The false-positive ratio P f of a sensor is approximately 0.3. (The misclassification error we got is about 10%, while the rest are those who reports there is an earthquake while it does not happen.) 2. Sensors are assumed to be independent and identically distributed. Assuming that we have n sensors at one specific time, which produce positive signals, the probability of all n sensors returning a false alarm is P f n. Therefore, the probability of event occurrence can be estimated as P occur = 1 P f n. We determine the threshold as 0.95, so the probability of event occurrence should be P occur We can calculate that the positive tweets we got should be no less than 3 (n 3). So our system will give out alarms when seeing more than 3 positive tweets about earthquake within a specific time. Estimate Time & Location For time estimation, we set the first post tweet time as the occurrence time. We didn t do much about the location estimation, since our data lost most of the GPS information. We simply used the city information for the location of events occurrence. Result We implemented a geochart by using HTML to show our final results of detecting the occurrence of earthquake in US from Nov 22nd, 2011 to Mar 13th, 2012 (Figure 5). We also collect real earthquake data from Table 2 shows a comparison of our results and real earthquake data. From the comparison, we find that the estimated time of earthquake is a bit late than the real time of earthquake, but within a tolerable level. Figure5. Results of detecting earthquake Tweet Time Tweet Location Real Time Real Location 03:55:02 UTC, 11/26/2011 Oklahoma 03:53:10UTC, 11/26/2011 Oklahoma

8 11:58:20 UTC 11/30/2011 Seattle 11:56:30 UTC 11/30/2011 Seattle, Washington 03:48:37 UTC 12/23/2011 Southern Alaska 03:44:27 UTC 12/23/2011 Southern Alaska 12:09:10 UTC 1/3/2012 Beaver 12:06:36 UTC 1/3/2012 Beaver, Utah 07:25:34 UTC 1/5/2012 Boise 07:20:10 UTC 1/5/2012 Southern Idaho 07:15:05 UTC 1/20/2012 Butte 07:05:55 UTC 1/20/2012 Western Montana 05:26:18 UTC 1/21/2012 Camden 05:20:30 UTC 1/21/2012 Philadelphia 03:50:22 UTC 1/29/2012 Riverton 03:46:24 UTC 1/29/2012 Lander, Wyoming 03:59:49 UTC 1/31/2012 Hawaii 03:54:43 UTC 1/31/2012 Hawaii 04:50:11 UTC 2/1/2012 Wells 04:40:31 UTC 2/1/2012 Wells, Nevada 07:29:45 UTC 2/12/2012 New York 07:28:30 UTC 2/12/2012 Boston, Massachusetts 03:33:28 UTC 2/15/2012 Coos Bay 03:31:20 UTC 2/15/2012 Coos Bay, Oregon 05:07:30 UTC 3/5/2012 San Francisco 05:03:30 UTC 3/5/2012 Richmond, California 03:45:18 UTC 3/5/2012 Milpitas 03:33:12 UTC 3/5/2012 San Francisco 03:18:34 UTC 3/6/2012 El Paso 03:11:50 UTC 3/6/2012 Western Texas Table2. Comparison of estimated occurrence of earthquake and real earthquake Future Work We would focus on more accurate temporal and spatial model for time and location estimation if we have more data with location information. For temporal model, we should consider about the interval time between the occurrence time and the post time. So we should not use post time as occurrence time. For spatial model, we d like to apply Karman filtering or particle filtering into estimating the locations of events. We plan to detect real-time events of various kinds using Twitter, such as traffic jam, accidents and typhoon. Our model includes the assumption that a single instance of the target event exists at a certain time. For example, we assume that we do not have two or more earthquakes or typhoons simultaneously. Although the assumption is reasonable for these cases, it might not hold for other events such as traffic jams, accidents, and rainbows. To realize multiple event detection, we must produce advanced probabilistic models that allow hypotheses of multiple event occurrences. A search query is important to search possibly relevant tweets. For example, we set a query term as earthquake because most tweets mentioning an earthquake occurrence use this word. However, to improve the recall, it is necessary to obtain a good set of queries. We can use advanced algorithms for query expansion, which is a subject of our future work. Conclusion The project we propose is Real Time Event Detection by Twitter, which utilizes the real-time nature of Twitter. We consider each Twitter user as a virtual sensor, and use semantic analysis to classify tweets into a positive and negative class. Afterwards, we use probabilistic temporal model for detecting such event. Finally we use the post time as estimated occurrence time of the earthquake. Since we lose the GPS information of data, we cannot investigate into estimating the location of earthquake. From this project, we learn how to use MarkLogic XQuery to collect data. Also, we learn how to parse and tokenize XML data. The semantic analysis and feature

9 extraction are the key issue in the whole project. We also use LDA to get more accurate classification result. Additionally, we utilize different classification techniques to train and test data. Finally, we learnt how to use geocharts to visualize our results on maps. Reference 1. Sakaki T, Okazaki M, Matsuo Y (2010) Earthquake shakes Twitter users: Realtime event detection by social sensors.