Spatiotemporal Clustering of Twitter Feeds for Activity Summarization N. Wayant 1, A. Crooks 2, A. Stefanidis 3, A. Croitoru 3, J. Radzikowski 3, J. Stahl 2, J. Shine 2 1 US Army ERDC Topographic Engineering Center, 7701 Telegraph Road, Alexandria, VA 22315-3802 Email: {Nicole.M.Wayant, Justin.D.Stahl, James.Shine}@usace.army.mil 2 Dept. of Computational Social Science, George Mason University, 4400 University Drive, MS 6C3, Fairfax, VA 22030 Email: acrooks2@gmu.edu 3 Center for Geospatial Intelligence, George Mason University, 4400 University Drive, MS 6C3, Fairfax, VA 22030 Email: {astefani, acroitor, jradziko}@gmu.edu 1. Introduction Social media have drastically altered the concept of information contribution and dissemination by empowering the general public to publish and distribute user-generated content. These social media contributions may be viewed as expressions of humans acting as sensors, reporting events and activities in which they participate, or commenting on others that are somehow affecting them, or catching their attention. Thematically, the content of such media is diverse, ranging from reporting events like an earthquake (Crooks et al., 2012) to making mundane comments, pop culture references, or daily activity reports (Mischaud, 2007). Regardless of the topic, this information always has a temporal component, in the form of its submission time. Social media feeds also often have geolocation information associated with them, available in the form of precise coordinates, or as location descriptions listing for example only a city name (Croitoru et al., 2012). In this paper we focus on the spatiotemporal content of twitter feeds in order to assess their use as a hybrid form of a sensor network to monitor evolving events. Our objective is to investigate how social media contributions can be utilized to study the spatiotemporal evolution of dynamic sociocultural events. Towards this goal we use as a representative case the events of the Occupy Wall Street (OWS) movement in New York City, NY, on the International Day of Action of November 17 th, 2011 (OccupyWallSt.org, 2012). We use twitter as a representative example to harvest social media for our study. We collected geolocated tweets during that day, making reference to the Occupy Wall Street movement (e.g. through its associated hashtags and usernames, such as #ows) and analyze them to investigate how well they capture that day s activities. 2. The Events of Day of Action On November 17 th, 2011, on its second month anniversary, the OWS movement planned a series of organized activities across Manhattan and the five boroughs of New York City, as a demonstration and celebration of its commitment to action. This was not a spontaneous demonstration, but rather a well-organized event with a set schedule, advertised broadly through a wide spectrum of communication avenues, ranging from the blogosphere (OccupyWallSt.org, 2012) to facebook (Caren and Gaby, 2012) and even posters (Figure 1). The activities were organized around three key planned events as shown in the event poster as shown in Figure 1 and communicated through social media channels: 7:00am: Shutting down Wall Street 3:00pm: Occupying the subways (with a particular emphasis on Union Square as the subject of a mass student strike)
5:00pm: Taking Foley Square, across from New York s City Hall. These events were to be followed by a march towards Brooklyn Bridge to round out this day of action. Figure 1. The poster of the OWS movement announcing the planned demonstrations of Thursday November 17 th in Manhattan, moving from Wall Street at 7:00am (left tank) to the subway stations of the five boroughs at 3:00pm (middle tank) and Foley Square at 5:00pm (right tank). 3. Data Harvesting and Analysis Harvesting information from social media feeds entails in general three operations: extracting data from the data providers (various social media servers) via application programming interfaces (APIs); parsing, integrating, and storing these data in a resident database (e.g. implemented using ProstgresSQL); and then analyzing these data to extract information of interest. Using a system prototype that we developed to harvest such information (Croitoru, 2012) we collected twitter feeds related to the events of the Day of Action in the days leading to it and the day itself. The data were collected through queries to twitter s API, and in these queries we used the three hashtag terms that were widely adopted by the community and were most relevant to the event: #occupywallst, #ows, #occupywallstreet. The emergence and adoption of keywords and terms to refer to events in social media is a complex process that relates to the dynamics of social interactions within this community (see e.g. Kwak et al, 2010). Using these keywords we were able to collect a random worldwide sample of geolocated tweets over a period of two days (November 16 th and 17 th ). Figure 2 shows the global
disribution of these tweets, with highest concentration in the US and Western Europe, but spreading as far as Brazil, the Arab peninsula, and Australia. From among these tweets we selected a sample of 1,300 precisely geolocated tweets within New York City for our analysis. Figure 2. A map of the geolocated tweets sample with references to the Occupy Wall Street movement on November 16 th and 17 th. The event we are addressing in this paper is particularly suitable for our analysis for a variety of reasons: Demographics: The average age of twitter users is 39.1 years, with approximately 63% of its user constituency being younger than 44 years (pingdom.com, 2010). This is a good match to the average age of the OWS protesters, which appears to be slightly higher than one would expect, at 33 (Panagopoulos, 2011). It appears that for every college student in the crowd there was also a mid-career professional in their 40s participating in the event (Goodale, 2011). Location: The stage for this protest, New York City, is one of the top cities worldwide in twitter use (Java et al., 2007), ranked 4 th behind London, Los Angeles, and Chicago (Grader.com, 2012). Accordingly we argue that in this particular situation we have a good match of the demographics of the medium twitter users with the demographics of the event, and we also have a sufficiently large volume of information to support our analysis. The geolocated tweets referring to OWS from within New York City are shown in Figure 3. In order to derive a spatiotemporal summarization of the event they communicate we analyzed them to identify spatiotemporal clusters within it. A variety of techniques exist for the spatiotemporal clustering of data streams (see Cao et al, 2006). In our case, as the number
of data points was rather small (approx. 1,300) we opted to proceed with a two-step process that comprises an initial spatial clustering of the aggregate group of points using DBSCAN (Ester et al, 1996), followed by a second step, of temporal segmentation and re-clustering of the spatial clusters in order to derive the final spatiotemporal clusters. Figure 3. Geolocated OWS tweets originating from Manhattan on November 17 th, 2012. Each dot corresponds to the originating location of a geolocated tweet that contained in its body references to the selected OWS-related hashtags. Figure 4. Spatially clustered geolocated OWS tweets are marked by different dot colors. Spatiotemporal clusters are delineated by colored polygons. In Figure 4 we show the results of both the spatial and temporal clustering processes. A total of 19 spatial clusters were identified (corresponding to the different colors of the dots in
this figure) through DBSCAN. Subsequently, these clusters were reconfigured into 5 spatiotemporal clusters (annotated 1-5), which correspond very well to the planned schedule of activities as it was presented in Section 2 above: cluster 1 captures the morning events (shutting down Wall Street), cluster 2 captures the occupy the subways portion, cluster 3 is centered around Foley square, while clusters 4 and 5 show the crossing of Brooklyn Bridge, and the landing on the other borough. Figure 4 demonstrates vividly two important facts, which are rather crucial observations regarding the use of social media feeds to capture the spatiotemporal evolution of activities as they unfold. Firstly, we observe that twitter is being used to provide real-time in-situ reports from the event. In this particular situation we see that people (either protesters or bystanders) are using their cell phones or other mobile devices to tweet during the march. Secondly, we observe that by harvesting this information we get an excellent overview of the activities in the ground, without deploying any local sensors, and we can derive successful summarizations of the individual parts of a composite event. With locals acting as sensors and providing steady feeds in the form of tweets we can gain remotely valuable situational awareness. 4. Outlook While social media lacks the homogeneity and standards of authoritative sources of data, it often captures emerging dynamic events and situations better than official sources. This was demonstrated quite vividly during the Arab Spring events, across North Africa and the Middle East, in early 2011. This paper has demonstrated how by using one such type of social media, namely twitter, and without advanced knowledge of the events we can identify evolving patterns of human activity linked directly to place. Through the analysis of individual tweets we can identify clusters of activity related to a specific event, as we demonstrated using the OWS International Day of Action of November 17 th, 2011. Through spatiotemporal clustering of social media feeds we can derive an activity summarization that closely matches the planned (and actual) events of the OWS organizers in New York City as shown in Figure 1. We would therefore argue that using people as a distributed sensor system through the use of mobile social media platforms can provide us with a new lens to study the manifestations and complexities of human activity. This opens up a wide range of future research applications for exploring issues relating to human geography and through advances in computing and software architectures one could imagine carrying out such analysis in real time for anywhere around the world thus providing us the ability to monitor events unfolding in space and time. Acknowledgements We would like to thank the US Army Engineer Research and Development Center, Alexandria VA, for their support of this research. References Cao F, Ester M, Qian W and Zhou A, 2006, Density-based clustering over an evolving data stream with noise, In: Gosh J, Lambert D, Skillicorn D and Srivastava J (eds), Proceedings of the 6 th SIAM International Conference on Data Mining, USA, 328-339. Caren N and Gaby S, 2012, Occupy online: Facebook and the spread of Occupy Wall Street (October 24, 2011). Social Science Research Network, http://dx.doi.org/10.2139/ssrn.1943168. Accessed on 26th April, 2012 Croitoru A, Stefanidis A, Radzikowski J, Crooks A, Stahl J and Wayant N, 2012, Towards a collaborative geosocial analysis workbench. In: Proceedings COM.Geo, Washington, DC (in press) Crooks A, Croitoru A, Stefanidis A and Radzikowski J, 2012, #Earthquake: Twitter as a distributed sensor system. Transactions in GIS (in press)
Ester M, Kriegel H-P, Sander J, Xu X, 1996, A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis E, Han J, and Fayyad U (eds) Proceedings of the 2 nd International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, Oregon, 226 231. Goodale G, 2011, Who is Occupy Wall Street? After six weeks, a profile finally emerges. Christian Science Monitor. Available at http://www.csmonitor.com/usa/politics/2011/1101/who-is-occupy-wall-street- After-six-weeks-a-profile-finally-emerges. Accessed on 26th April, 2012. Grader.com, 2012, Top twitter cities. Available at http://tweet.grader.com/top/cities. Java A, Song X, Finin T, Tseng B, 2007, Why we twitter: Understanding microblogging usage and communities. In: Proceedings WEBKDD/SNA-KDD 07, San Jose, CA, 56-65. Kwak, H.; Lee, C.; Park, H.; and Moon, S. (2010), What is Twitter, a social network or a news media? in WWW 10, Raleigh, NC, pp. 591-600. Mischaud E, 2007, Twitter: expressions of the whole self, MEDIA@LSE Electronic Dissertation Series, London School of Economics and Political Science, UK. Available at http://www.mendeley.com/research/twitterexpressions-whole-self-8/ OccupyWallSt.org, 2012, November 17 th Day of Action. Available at http://occupywallst.org/action/november- 17th/. Accessed on 26 th April, 2012. Panagopoulos C, 2011, Occupy Wall Street Survey Results October 2011. Available at http://www.fordham.edu/images/academics/graduate_schools/gsas/elections_and_campaign_/occupy wall street survey results 102611.pdf. Accessed on 26 th April, 2012. Pingdom.org, 2010, Study: Ages of Social Network Users. Available at http://royal.pingdom.com/2010/02/16/ study-ages-of-social-network-users/. Accessed on 26 th April, 2012.