Twitter and Natural Disasters Peter Ney

Twitter and Natural Disasters Peter Ney Introduction The growing popularity of mobile computing and social media has created new opportunities to incorporate social media data into crisis response. In crisis situations such as natural disasters and mass emergencies, social media data could become a valuable source of real-time, firsthand information from affected individuals and could be used to help emergency managers and first responders efficiently assess and respond to needs. However, responders and affected people alike are unable to fully leverage this information due in part to the massive volume of noisy data, and the lack of tools and processes to help them filter and analyze these new information streams. The social media platform Twitter provides a rich public dataset of short, user created messages called tweets that arrive in real time and provide an effective model to study disruptive crisis situations. In crisis situations it is especially important to get reliable information from primary sources on the ground because they are in a unique position to find information that is not present elsewhere, and they often have knowledge about the local geography and culture [1]. One difficulty in finding reliable information on twitter is that the majority of twitter communications are derivative because they are reposts or simply repeat already known information, which makes the data more noisy [1]. Another limitation of twitter data is that posts are restricted to 140 characters which prevents the effectiveness of Natural Language Processing (NLP) techniques since most aim to derive meaning from long pieces of text. There are also problems with the scale of the data because previous crisis related Twitter studies often involve hundreds of thousands of tweets and thousands of users [2]. To get around these problems we hope to leverage Twitter metadata like the which hashtags were used, the location of the tweet, whether a tweet was retweeted, and date and time information to improve crisis response. The retweet mechanism of twitter is an especially useful way to find relevant information. Retweets with important keywords have been found to be more on-topic to the crisis situation than those that were non-retweets. This suggests that tweets are being used as a form of recommendation system [3]. It is also known that there are different retweet patterns between messages written by people at a crisis event than those that are away [3]. Other useful features include the frequency of tweets by a tweeter and the location of the original tweeter. Data and Approach

Professor Kate Starbird has large crisis related Twitter datasets from events such as hurricanes, tornadoes, and forest fires. Manually curating such large sets of crisis data is complex and time consuming. Therefore, our approach is to use well established data management techniques to import and structure crisis data into a relational database system. Specifically, our analysis is based on a twitter dataset taken in the few weeks surrounding the May 2011 tornado disaster in Joplin, Missouri. This dataset contains approximately 1,000,000 tweets and requires 200 MB of storage. The tweets in the data set are a subset of all tweets that contained keywords common to the event like tornado or Joplin. Twitter limits retroactive access to twitter data so the data had to be collected in real time. The raw tweets are stored in a sql file which defined a single relation to store tweet data and bulk insert statements to add the data into the single tweet relation. The raw tweet data contains 9 attributes for each tweet and include: tweet id, text, author screen name, tweet source, time of tweet, gps latitude, gps longitude, location, and location url. Next we created a normalized E/R schema to structure and store the tweet data (Figure A). Most of the useful information in the raw tweet data like hashtags, mentions, and retweets were stored in the text of the tweet. To get this information we had to do a significant amount of parsing. Any word that preceded a single # or @, with white space in front, was considered a hashtag or mention, respectively. To parse out linked web pages from the text we filtered for website regular expressions. The main convention for retweets is "RT @username:". Any tweets that contained this retweet pattern were considered a retweet of the mentioned user. After parsing, the cleaned data was then imported into the PosgreSQL database according to the E/R schema. Figure A: Normalized E/R schema to store cleaned tweet Information in PostgreSQL.

After formatting and cleaning the data we ran queries on the database to learn the overall structure of the data and to look for data features that can help determine if a given tweet or set of tweets are reliable [1]. If the database can quickly answer these queries then it can act as a model for a system used directly by first responders that will help them filter for relevant Twitter information to improve the crisis response. Results First we wanted to better understand the tweet frequency of all users in the database. For each user we plotted the number of tweets they sent and sorted the users in increasing order of number of tweets (Figure B). This query required a group by, count over user_id in the tweeted relation. Not surprisingly, the distribution is very tail heavy, with most users only sending a few tweets or less, but a small fraction of users sending hundreds of tweets. Figure B: This shows for a given user (X-axis) how many tweets that user sent (Y-axis log scale). The users were sorted in increasing order by the number of tweets they sent.

Since retweets act as a proxy for a recommendation we wanted to see whether some users were being retweeted much more than others. To do this we ran a query which group by, counted the retweets by user_id and then group by, counted the previous count totals. This created a histogram which binned users based on how many times their tweets were retweeted (Figure C). This is an also very heavy tailed, with most users only being retweeted a few times but a few users being retweeted thousands of times. Figure C: Histogram of number of retweeted tweets per user. Each point represents the number of users (Y-axis log scale) which have been retweeted a given number of times (X-axis log scale). Next we wanted to look at how the twitter data changed in the days following the disaster. By using a count aggregate we counted the number of tweets that were sent in the nine days following the disaster. This was plotted in a histogram, which shows an increase in activity until day 4 and then begins to decrease (Figure D). The spike in the initial three days after the disaster probably occurred due to the increase in media activity that followed the tornado. Then, to see if the types of tweets sent changed in the days after the tornado we looked at the most common hashtags one, two, and three days after the disaster (Figure E). In the first day many of the hashtags were related to helping victims like #redcross, #moneeds, and #mohaves,

but by the third day the hashtags were much less specific to the Joplin tornado disaster. This may indicate that tweets soonest after the disaster are more useful. Figure E: Histogram of number of tweets sent in the nine days after the tornado. Figure F: The twenty most popular hashtags one, two, and three days (left to right) following the tornado disaster. Hashtags in red were text used to filter the tweets initially, so they are not informative.

Finally, we wanted to see if we could find useful location information in the tweets. For all tweets that had a known city we did a group by, count to see how many tweets came from that city. We then plotted the number of tweets in each city in ascending order of number of tweets in that city. Most of the cities had only a few tweets, but many of the larger cities like New York City had many tweets. Importantly, Joplin MO, which is a relatively small town, had the third most tweets of any city. This indicates that there were many tweeters on the ground in Joplin. Figure G: The number of tweets (y-axis log scale) in each city (x-axis) in ascending order of number of tweets. Next, using GPS information we took the looked at the distance in miles each tweet was from Joplin. To do this, distance from the tweet to Joplin was approximated by using the GPS coordinates to determine the Euclidian distance. Then we plotted the distance each tweet was to Joplin in ascending order of distance (Figure H). Since this calculation made a flat earth assumption the most distance tweets have inaccurate distances, but the results work well for tweets in the United States. Not surprisingly, most of the tweets are within a few thousand miles of Joplin and mostly came from the United States.

Figure H: An estimate of the distance each tweet was from Joplin using GPS information. The tweets were sorted in ascending order by the distance from Joplin. All of the data processing and analysis was performed on an Asus Zenbook laptop with a SSD HD. The data import, data processing, and indexing took around 3 minutes. Each of the queries took no more than 10 seconds. Conclusions and Future Work This study shows that there are effective means to finding reliable tweet information. First the very heavy tailed distribution of the retweet information shows that there are a few number of prolific, reliable tweeters. Since the amount of retweeting is a form of recommendation strength, it is likely that these users are more reliable than average. The time-data show some important caution when collecting tweets. In the day immediately following the disaster the hashtag information showed tweets that were more on-topic than on later days. Also, because a spike in national interest happens a few days after the disaster, it may be easier to find tweeters that are on the ground in the first day since there was less tweet volume. Finally, the location data shows that there are many people tweeting on the ground in Joplin, so these users are good targets for primary sources of information.

There were two primary limitations of this study. First the scope of the data was limited. We only had access to tweet text information, but there is other useful metadata at the user level, like the number or types of followers. This is especially important because a change in the number of followers during a crisis has been shown to be a useful feature for predicting whether a tweeter has reliable or firsthand information. Secondly, this study took a static view of the twitter space. In a live crisis situation new tweets are constantly arriving and need to be processed in real time. We believe that this database prototype fulfilled the goals of this study. It was rapidly able to process and structure a large number of tweets and to answer queries that are relevant to first responders. Moreover, if the limitations mentioned above are addressed then we believe that the speed and usefulness of this system show that a structured database paradigm is advantageous in live crisis situations. Citations and Acknowledgements I would like to thank Professor Kate Starbird for generously giving me access to the Joplin, MO tornado disaster twitter data. [1] K. Starbird, G. Muzny, and L. Palen. Learning from the crowd: collaborative filtering techniques for identifying on-the-ground Twitterers during mass disruptions. Presented at: 9 th International ISCRAM Conference, Apr. 2012 [2] L. Palen, K. M. Anderson, G. Mark, J. Martin, D. Sicker, M. Palmer, & D. Grunwald. A vision for technologymediated support for public participation & assistance in mass emergencies & disasters. In 2010 BCS Conference on Visions of Computer Science, Apr. 2010. Article 8, 12 pages. [3] Starbird K. & Palen L. (2012). (How) Will the Revolution be Retweeted?: Information Propagation in the 2011 Egyptian Uprising. Proc of CSCW 2012.