How To Analyse The Diffusion Patterns Of A Lexical Innovation In Twitter

GOOD MORNING TWEETHEARTS! : THE DIFFUSION OF A LEXICAL INNOVATION IN TWITTER REBECCA MAYBAUM (University of Haifa) Abstract The paper analyses the diffusion patterns of a community-specific lexical innovation, tweethearts, in the online social network platform Twitter. The data for the study consist of all Twitter posts containing the innovation from its first appearance in February 2008 through February 2012. Analysis of the frequency data showed that the adoption of tweethearts by community members accelerated with respect to time. Comparison with a general Englishlanguage Twitter corpus (Zappavigna 2012) revealed a higher proportion of internet- and Twitter-related words in the tweethearts corpus. Monthly word frequency analysis of the tweethearts corpus showed changes over time, including a proportional increase in #hashtag frequency and a decrease in URL frequency. Advantages of using temporal data in combination with corpus methods for text analysis are illustrated using examples from the tweethearts corpus, and methodological implications discussed. 1. Introduction The paper investigates the mechanisms by which an innovative lexical item diffuses through a community, and analyses the changes in the use and meaning of the innovation that accompany its diffusion. The data were collected from the online social networking platform Twitter, and consist of all posts containing the community-specific slang word tweethearts from February 2008 to February 2012. The corpus contains 44,556 total posts (650,927 words), published by a total of 26,371 individual Twitter users. The goals of the study are: (1) to analyse the diffusion pattern and functional/semantic changes of a new variant as it is adopted and spread throughout a community, in particular through the use of time-linked data, and (2) to illustrate some of the advantages of taking an interdisciplinary approach to the study of language change by integrating methods from nonlinguistic innovation diffusion research. The paper is organised as follows: Section 2 gives an overview of relevant concepts from innovation diffusion research, and places the study in context relative to language change research. Section 3 describes the quantitative frequency analysis and the time-linked corpus analysis carried out on the data. Section 4 discusses how the time-linked data can complement typical corpus methods by enriching contextual interpretation of the results and giving direction for closer examination of specific points of interest within the data. 2. Background In the mid-20 th century an interdisciplinary body of research began to develop that was interested in the way in which innovations be they technological, agricultural, medical, political, ideological, or other spread throughout a population. One of the major findings of this new field was that the diffusion patterns of many different types of innovations shared a characteristic shape when graphed with respect to time. This shape was classically described by Rogers (1962) as the S-curve of diffusion. Newcastle Working Papers in Linguistics 19.1, 2013 Selected Papers from Sociolinguistics Summer School 4 Roberts, N. S. and Childs C. (eds.)

THE DIFFUSION OF A LEXICAL INNOVATION IN TWITTER 129 Rogers also classified five different adopter categories within the population, depending on how early or late a person adopts the innovation relative to the rest of the population. He called these categories: innovators, early adopters, early majority, late majority, and laggards. The adopter categories and S-curve pattern typical of innovation diffusion are illustrated in Figure 1 (http://en.wikipedia.org/wiki/file:diffusionofideas.png). Figure 1: Typical S-curve innovation diffusion pattern Population Adoption over time The x-axis shows the number of adopters, and the y-axis shows the percentage of the total population. The yellow curve represents cumulative adoption, while the blue curve represents the number of new adopters at each point in time. The S-curve refers to the shape of cumulative measure; the number of new adopters per point in time typically has a normal distribution. While the analysis of the diffusion patterns is based on research traditions of nonlinguistic innovation diffusion, the results are interpreted in the framework of Third Wave variationist sociolinguistics, as described in Eckert (2012). The Third Wave perspective adds to the classic theoretical framework of sociolinguistics, shifting from the idea that social categories are existing structures to a view emphasizing the agentive role of the speaker himself in constructing social meaning and social categories. Previous studies investigating the way in which new variants appear and spread through a population have used empirical data from a relatively small number of speakers. For example, Milroy and Milroy (1985) recorded speech data from 48 speakers, and Eckert s (2000) study included 200 interviews, of which 69 were used to make up the Belten High corpus. While these sample sizes are quite robust for the in-depth ethnographic methods employed, statistical analysis of the overall diffusion patterns requires numbers on a larger scale. On the other end of the spectrum, some linguists have begun using computer models to simulate large-scale diffusion of linguistic innovations. Nettle (1999) used computer simulations based on Social Impact Theory to test the parameters that influence the diffusion of linguistic changes. Building on Nettle s work, Ke et al. (2008) modeled the diffusion rates of a minority variant in four different social network structure environments, and found that the structure of the social network did, in fact, affect the rate of diffusion of the variant. These computer simulations allow researchers to investigate the macro-level mechanisms at work in

130 REBECCA MAYBAUM instances of language change within a community. The current study is based on a dataset that is both large in scale and empirical in nature: all Twitter posts containing tweethearts. While the size of the corpus (and the relative anonymity of the speakers/users) does not permit the level of detailed ethnographic analysis carried out in the studies mentioned above, the large number of individual Twitter users represented in the corpus opens the door for a greater understanding of the mechanisms at work driving language at the macro, community-wide level. 3. Analysis The analysis utilises some basic conventions and terminology used in the Twitter online community. Twitter users can publish posts, called tweets, of up to 140 characters each, which are broadcast to their subscribers. Users can follow other Twitter accounts (subscribe to receive their tweets), and/or be followed by them (the relationship is not necessarily reciprocal). In addition, conventions have developed whereby a user can retweet another user s post, similar to forwarding, or mention another user, linking his username name to the tweet. They can also include urls within the text of the tweet. These communication practices influence the way in which innovative language spreads through the community. The analysis of the tweethearts data consists of three parts: (1) analysis of the post frequency and diffusion pattern over time, (2) comparison of the tweethearts corpus with a general Twitter corpus, and (3) combination of time data and corpus analysis results to analyse changes over time. 3.1. Diffusion patterns Each individual tweet is linked to a Unix timestamp, measured to the second, at which the post was published. The first tweet of each unique user in the community was identified, and the timestamp of the user s first post counted as the time of adoption for that user. The data were then divided into 30-day bins, and the number of new users per 30 days was calculated. Figure 2a shows the cumulative number of new tweethearts users over time. Figure 2b shows the number of new tweethearts users in each 30-day period. Figure 2a: Adoption of tweethearts by new users, cumulative Population Adoption over time

THE DIFFUSION OF A LEXICAL INNOVATION IN TWITTER 131 Figure 2b: Adoption of tweethearts by new users, per 30-day period Population In both figures, the x-axis shows time (month and year) and the y-axis shows the number of users. The graphs show clearly that the adoption of tweethearts by Twitter users increased over time. Figure 2a is marked to indicate sudden increases in the overall diffusion rate at specific points in time. The first tweethearts posts appeared in early 2008. There is a slight increase in slope starting in February 2009, another increase around January 2010, and again in July 2011. The diffusion pattern shares characteristics of the classic S-curve shape (Rogers 1962), though further data collection will be required over the next several years to find out whether the acceleration will indeed level off to complete the S-curve pattern predicted by the innovation diffusion literature. 3.2. Corpus word list comparison Adoption over time Zappavigna s (2012) HERMES general English-language Twitter corpus is used as a point of comparison to the tweethearts corpus. HERMES contains approximately 7 million Twitter posts and 100 million words. The corpus analysis of the tweethearts data was carried out using Antconc concordance software (Anthony 2011). Table 1 shows the top 20 words by frequency of the tweethearts corpus (left column) and the HERMES corpus (right column). Frequency for each word is shown as the raw number of appearances as well as a percentage of total words in the corpus. For the purposes of the corpus analysis and comparison, specific usernames, hashtag topic markers, and html links were counted generically as @mentions, #hashtags, and urls respectively.

132 REBECCA MAYBAUM Table 1: Tweethearts and HERMES top 20 word lists N Tweethearts corpus HERMES Twitter corpus 1 @mention 45,175 6.98% @mention 4,037,829 4.04% 2 tweetheart 40,827 6.31% the 3,358,659 3.15% 3 url 28,112 4.34% to 2,379,223 2.23% 4 #hashtag 14,984 2.32% i 2,236,470 2.10% 5 rt 13,523 2.09% a 1,674,654 1.57% 6 you 13,432 2.08% url 1,631,187 1.53% 7 be 13,005 2.01% and 1,545,943 1.45% 8 to 11,827 1.83% #hashtag 1,253,853 1.25% 9 my 11,709 1.81% of 1,217,398 1.14% 10 i 11,208 1.73% you 1,194,631 1.12% 11 a 10,578 1.63% is 1,120,058 1.05% 12 the 10,483 1.62% in 1,118,227 1.05% 13 for 7,683 1.19% rt 990,287 0.93% 14 love 6,160 0.95% on 906,996 0.85% 15 good 6,089 0.94% for 891,100 0.84% 16 and 5,964 0.92% that 871,925 0.82% 17 morning 5,779 0.89% my 858,657 0.81% 18 all 5,446 0.84% it 853,209 0.80% 19 of 5,345 0.83% this 678,193 0.64% 20 have 5,036 0.78% me 676,702 0.63% The rows highlighted in red in Table 1 indicate words that are unusually frequent in the tweethearts corpus. That is, they appear in the tweethearts top 20 word list, but not in the HERMES list. It is not surprising that the word tweetheart itself is much more frequent in the tweethearts corpus; by design the word appears at least once per Twitter post. In addition to tweetheart, unusually frequent words include love, good, morning, all, and have. The blue highlighted rows in Table 1 indicate internet- or Twitter-related words in the corpora. The frequency of this group of words is significantly higher in the tweethearts corpus than in the HERMES corpus. Even though @mention is the most frequent item in both corpora, it accounts for a much higher percentage of the tweethearts corpus (approximately 7%) than the HERMES corpus (approximately 4%). This clustering of internet- and Twitterrelated words at the top of the tweethearts word list would seem to indicate that in posts comtaining tweethearts, users have a higher degree of awareness of and explicit reference to their medium of communication. 3.3. Corpus results over time Using the timestamp data attached to each Twitter post in the corpus, text files were generated for each month throughout the period represented in the dataset. Each text file was analysed using AntConc, and individual word lists were generated for each month. These outputs were combined to create a table showing percentages of all search terms for each month through February 2012. The trajectories of individual search terms in the tweethearts corpus could then be graphed individually with respect to time. The comparative corpus results presented in section 3.2 serve as a guide for which keywords to graph using the monthly corpus results. Figures 3a 3d show the corpus results

THE DIFFUSION OF A LEXICAL INNOVATION IN TWITTER 133 over time for the internet- and Twitter-related words highlighted in Table 1. The x-axes in the Figure 3a 3d range from January 2007 through February 2012. The y-axes show the frequency of each word as a percent of the total word count of the corpus. Figure 3a: Frequency of URLs in tweethearts corpus over time (as percentage of total word count) Tweethearts corpus total word count URL frequency over time Figure 3b: Frequency of #hashtags in tweethearts corpus over time (as percentage of total word count) Tweethearts corpus total word count #hashtag frequency over time

134 REBECCA MAYBAUM Figure 3c: Frequency of retweets in tweethearts corpus over time (as percentage of total word count) RT (retweet) frequency over time Figure 3d: Frequency of @mentions in tweethearts corpus over time (as percentage of total word count) Tweethearts corpus total word count Tweethearts corpus total word count @mention frequency over time The URL graph (Figure 3a) shows a clear decrease in the frequency of hyperlinks over time, while the #hashtag graph (Figure 3b) shows the opposite trend. The retweet (Figure 3c) and @mention (Figure 3d) graphs are less clear in terms of an overall trend, but both show increased frequencies around January 2010, and the retweets results showed a clear decrease around June 2011. Both of these dates coincide with the slope changes marked in Figure 1a,

THE DIFFUSION OF A LEXICAL INNOVATION IN TWITTER 135 indicating accelerated adoption rates. In addition to showing overall trends, the monthly word list frequency graphs also provide a starting point for closer text analysis at specific time periods, connected with specific keywords. For example, Figure 4 shows the changing frequency of love, one of the unusually frequent words in the tweethearts corpus as compared with the HERMES corpus. Figure 4: Frequency of love in tweethearts corpus over time (as percentage of total word count) Tweethearts corpus total word count love frequency over time The most immediately noticeable feature in Figure 4 is the sharp increase in the frequency of love in June, July, and December 2011. A concordance analysis in Antconc reveals that the 3-word cluster love love love was a popularly posted and reposted phrase during those months. It turns out that love love love is the catchphrase of Christian Beadles, known as a friend of Justin Bieber, and a minor internet celebrity in his own right with a large Twitter following. His use of the love love love tagline in June, July, and December 2011, which was heavily retweeted by his Twitter followers, explains the sharp increase at those times. Another celebrity Twitter user, Ariana Grande, seems to be partially responsible for the increase in the months preceding the major Christian Beadles spikes, due to the retweeting of her posts containing the love and tweethearts by her Twitter followers. It would appear that celebrity posts are highly influential in the tweethearts corpus, as seen here in the love data. Another pattern evident in Figure 4 is the increase of love frequency during the month of February each year. This pattern is most clear in 2009, 2010, and 2012; an increase during February 2011 is not as clear, since it was obscured by the celebrity posts and reposts discussed earlier. The explanation for these increases turns out to be no surprise: Valentine s Day greetings to one s tweethearts often include the word love, which can be seen in the increased frequency each year during the month of February. A widely retweeted Christian Beadles love love love Valentine s Day post in February 2012 boosted that year s love frequency to an even higher level than in previous years.

136 REBECCA MAYBAUM 4. Discussion The study aims to gain insight into the mechanisms of language change by analyzing the overall diffusion pattern of tweethearts through the Twitter community, as well as the evolving use and meaning of the word over time. The diffusion pattern can be cautiously compared to the archetypal S-curve of innovation diffusion. The tweethearts cumulative adoption curve in Figure 2a shares characteristics with the prototypical curve depicted in Figure 1, but thus far the tweethearts data match only the first half of the predicted pattern. It seems likely that the spread of tweethearts would eventually slow and level off, forming the second half the S-curve, but for the complicating factor of the accelerating growth of the Twitter population itself. The S-curve model assumes a stable population, while in fact the population in Twitter is rising exponentially at the same time that tweethearts is spreading. As long as the overall population grows, there will continue to be new potential adopters, and the S-curve will not level off. Once the overall Twitter population itself stabilises, however, the number of tweethearts users can also be expected to stabilise. The corpus analysis comparing the most frequent words in the tweethearts and HERMES corpora reveals a few interesting points. Both HERMES and the tweethearts corpus are specific to Twitter, so it is not surprising to find a high proportion of internet- and Twitter-related terms in both corpora. However, the increased predominance of these terms in the tweethearts corpus requires further examination. A plausible explanation may be that Twitter users who employ tweethearts in their posts identify more strongly with being Twitter and online insiders, and therefore tend to publish more posts using Twitter-specific conventions. The corpus comparison with HERMES also highlighted unusually frequent words in the tweethearts corpus that were not related to Twitter itself or to the internet. Of these words love, good, morning, all, and have only love has a semantic connection to tweethearts. The others are part of phrases commonly used with tweethearts: (1) Happy Valentines Day to all my Tweethearts!! (2) Have a fab evening tweethearts! See you tomorrow! url (3) Good morning Tweethearts! Hope your day goes well! These phrases are likely representative of the typical functions of the Twitter people lemma as a whole, rather than tweethearts in particular. This can be assessed in future research comparing the tweethearts results with those of tweeps, tweeties, tweople, etc. The combination of accurate and detailed time data linked with text has great potential for uncovering patterns and exploring the way in which new linguistic phenomena behave. The examples reviewed here serve as a starting point for discovering the ways to take advantage of the unique properties of the dataset. General trends, as was seen with the url and hashtag graphs, as well as specific stories, such as the explanation for love s behavior, can be explored in new ways using large-scale quantitative analysis along with complementary qualitative analysis. 5. Conclusion The study presented here represents the beginning stages of a promising research trajectory. The paper has successfully shown some of the advantages of combining methods from diverse fields and technologies that open new avenues of data collection and analysis. The introduction of perspectives from outside linguistics has the potential to give new insight into the mechanisms and processes at work that drive the phenomenon of language change.

THE DIFFUSION OF A LEXICAL INNOVATION IN TWITTER 137 The particular dataset used in the study has some powerful advantages that give it great potential for future analysis. It is unusual, and possibly unique, to have a corpus of realworld language data that is both large enough to analyse using statistical methods, and also, crucially, includes fine-grained time data accurate to the second for each individual utterance in the corpus. Time-linked language data on such a large scale give us a rare opportunity to study the full life cycle of a variant from its initial appearance through its successful (or unsuccessful) diffusion and integration into a community s linguistic repertoire. Tweethearts has indeed become adopted by a large number of Twitter users, perhaps disproportionately by celebrity users. Future plans include analysis of other variants in the Twitter people lemma: tweeps, tweeties, tweeple, tweebs, tweople, tweetheads, twerps, twitterbugs, and twittertwatters, and comparison of these results. Other interesting avenues for future research include an analysis of user categories, i.e. differences in discourse practices among celebrities, laypeople, corporations, news organisations, etc. It may also be fruitful to analyse the changes over time at the individual level, and compare with the changes at the level of the community. Finally, mapping the social network structure of the Twitter community will allow us to see whether differences in network structure are correlated with observed patterns of language change. References Anthony, L. (2011). AntConc (Version 3.2.2) [Computer Software]. Tokyo, Japan: Waseda University. Available from http://www.antlab.sci.waseda.ac.jp/ Eckert, P. (2000). Linguistic Variation as Social Practice: The Linguistic Construction of Identity in Belten High. Oxford: Wiley-Blackwell. Eckert, P. (2012). Three waves of variation study: The emergence of meaning in the study of sociolinguistic variation. The Annual Review of Anthropology 41, 87 100. Ke, J., Gong, T., & S-Y Wang, W. (2008). Language change and social networks. Communications in Computational Physics 3(4), 935 949. Milroy, J., & Milroy, L. (1985). Linguistic change, social network and speaker innovation. Journal of Linguistics 21(2), 339 384. Nettle, D. (1999). Using social impact theory to simulate language change. Lingua 108, 95 117. Rogers, E. M. (1962). Diffusion of Innovations. New York: Free Press. Zappavigna, M. (2012). Discourse of Twitter and Social Media. London: Continuum. Rebecca Maybaum Department of English University of Haifa Israel beccamaybaum@gmail.com