[IEEE 2013 IEEE 2nd Network Science Workshop (NSW) - West Point, NY, USA (2013.04.29-2013.05.1)]...

4
Initial Indicators of Topic Success in Twitter Using topology entropy to predict the success of twitter hashtags Max Planck, Isis Lyman Pollard, Charles Brock, Alex George Institute for Complex Additive Systems Analysis New Mexico Institute of Mining and Technology Socorro, NM, USA {mplanck, ipollard, cbrock, ageorge} @icasa.nmt.edu Abstract—The recent and dramatic increase in social media use by the general population across the globe has proven to be a valuable resource for understanding social dynamics. In this paper we focus on metrics that provide early indicators of the eventual impact of events, and attempt to show correlations between these early indicators and real world events. Specifically, a measure of early-stage diffusion between social network communities is examined as a predictor of the eventual effect of a given meme. Online social media dynamics are examined in Twitter where we tracked hashtags related to the 2012 US elections and the now-infamous campaign by the European Commission to encourage female interest in science. Community Entropy—a measure of topological information spread—is used, and we introduce a new metric, Community Entropy Ratio, to further extend the idea. Community Entropy Ratio seems to allow direct comparison across different graph topologies and shows encouraging potential for its ability to predict the eventual persistence of a Twitter hashtag. Keywords—social media, graph communities, information diffusion, community topology, Twitter I. INTRODUCTION Social media and general online access are increasingly relevant to the flow of information and opinion in the modern world, from people tweeting about earthquakes in real time [1] to prediction of regional flu outbreak peaks based on social media dynamics [2]. The importance of this communication medium to real world events has even been pivotal to large disruptive socioeconomic events such as the Arab Spring [3,4]. It is therefore useful and relevant to provide predictive metrics for the real-world impact of events through the lens of social media. In this paper we use Twitter to track the social media signatures of two events: the 2012 United States Presidential debates/election; and “Science: It’s a Girl Thing,” the European Commission’s campaign to attract women to science, posted on 22 July 2012 [5]. It was received poorly and viewed as being sexist and superficial. A significant backlash occurred, including the counter-campaign tag #realwomeninscience. Twitter is a "microblogging" service that allows users to provide information to each other in the form of short text messages. Each message is a "tweet." Tweets about a particular idea are commonly marked with a "hashtag" consisting of the '#' character followed by the tag content [6]. This paper builds upon previous work in capturing early dynamics of social events and providing early indicators of event impact [7]. This involves showing a correlation between a measurement of community entropy and the real- world impact of social media events. We extend this work into Twitter, a social media mechanism with very rapid dynamics. We thus provide more evidence for the use of topology entropy as an early indicator, and investigate its applicability across social systems. High event impact is determined by either intrinsic social media metrics or real- world measurable effect, as appropriate. Topology/community entropy is a metric that measures the dispersion of a concept or meme across social network communities (densely clustered groups). Many social systems exhibit highly clustered topology [8] and information tends to spread more effectively in clustered networks, suggesting that this behavior could represent a signature optimum efficiency of human social interaction [9]. In this context, topology entropy is the proportion of such communities to which the idea has spread, and represents the dispersion of the meme or the topological diversity of those propagating the meme [7]. We use two in-house developed software packages: JCactus, an online content collector; and JNetDiscovery, a data analysis and processing environment [10]. A follower topology graph contains Twitter user nodes and follower/followee edges. A hashtag time series is the number of tweets with a given tag, binned over time. For analysis of the spread of information on follower graphs, the following algorithms are used: Modularity Communities, as defined in [11], identifies topological community structure in a graph. Community Entropy (CE), as defined in [7,12], is a measure of the proportional participation of the members of topological communities. More formally, (1) Where X i (t) is the fraction of nodes in community i that are participating in the topic of interest at time t. Community Entropy Ratio (CER) is an extension of CE to take into account a normalizing measure of a theoretical maximum possible entropy under the circumstances. (2) Where MPE is the maximum possible entropy, arrived at by distributing the known real-world volume of participants in the topological structure in such a way as to maximize the resultant CE value. All edge cases of MPE have not been investigated, but it is deterministic and so is as at least a proxy measure. It distributes participatory nodes to the smallest communities first, maximizing their entropy contribution. CE = X i i (t )log 2 X i (t ) CER = CE MPE 978-1-4799-0203-3/13/$31.00 ©2013 IEEE 160

Transcript of [IEEE 2013 IEEE 2nd Network Science Workshop (NSW) - West Point, NY, USA (2013.04.29-2013.05.1)]...

Page 1: [IEEE 2013 IEEE 2nd Network Science Workshop (NSW) - West Point, NY, USA (2013.04.29-2013.05.1)] 2013 IEEE 2nd Network Science Workshop (NSW) - Initial indicators of topic success

Initial Indicators of Topic Success in Twitter Using topology entropy to predict the success of twitter hashtags

Max Planck, Isis Lyman Pollard, Charles Brock, Alex George

Institute for Complex Additive Systems Analysis New Mexico Institute of Mining and Technology

Socorro, NM, USA {mplanck, ipollard, cbrock, ageorge} @icasa.nmt.edu

Abstract—The recent and dramatic increase in social media

use by the general population across the globe has proven to be a valuable resource for understanding social dynamics. In this paper we focus on metrics that provide early indicators of the eventual impact of events, and attempt to show correlations between these early indicators and real world events. Specifically, a measure of early-stage diffusion between social network communities is examined as a predictor of the eventual effect of a given meme. Online social media dynamics are examined in Twitter where we tracked hashtags related to the 2012 US elections and the now-infamous campaign by the European Commission to encourage female interest in science. Community Entropy—a measure of topological information spread—is used, and we introduce a new metric, Community Entropy Ratio, to further extend the idea. Community Entropy Ratio seems to allow direct comparison across different graph topologies and shows encouraging potential for its ability to predict the eventual persistence of a Twitter hashtag.

Keywords—social media, graph communities, information diffusion, community topology, Twitter

I. INTRODUCTION Social media and general online access are increasingly

relevant to the flow of information and opinion in the modern world, from people tweeting about earthquakes in real time [1] to prediction of regional flu outbreak peaks based on social media dynamics [2]. The importance of this communication medium to real world events has even been pivotal to large disruptive socioeconomic events such as the Arab Spring [3,4]. It is therefore useful and relevant to provide predictive metrics for the real-world impact of events through the lens of social media. In this paper we use Twitter to track the social media signatures of two events: the 2012 United States Presidential debates/election; and “Science: It’s a Girl Thing,” the European Commission’s campaign to attract women to science, posted on 22 July 2012 [5]. It was received poorly and viewed as being sexist and superficial. A significant backlash occurred, including the counter-campaign tag #realwomeninscience.

Twitter is a "microblogging" service that allows users to provide information to each other in the form of short text messages. Each message is a "tweet." Tweets about a particular idea are commonly marked with a "hashtag" consisting of the '#' character followed by the tag content [6].

This paper builds upon previous work in capturing early dynamics of social events and providing early indicators of event impact [7]. This involves showing a correlation between a measurement of community entropy and the real-world impact of social media events. We extend this work into Twitter, a social media mechanism with very rapid

dynamics. We thus provide more evidence for the use of topology entropy as an early indicator, and investigate its applicability across social systems. High event impact is determined by either intrinsic social media metrics or real-world measurable effect, as appropriate.

Topology/community entropy is a metric that measures the dispersion of a concept or meme across social network communities (densely clustered groups). Many social systems exhibit highly clustered topology [8] and information tends to spread more effectively in clustered networks, suggesting that this behavior could represent a signature optimum efficiency of human social interaction [9]. In this context, topology entropy is the proportion of such communities to which the idea has spread, and represents the dispersion of the meme or the topological diversity of those propagating the meme [7].

We use two in-house developed software packages: JCactus, an online content collector; and JNetDiscovery, a data analysis and processing environment [10].

A follower topology graph contains Twitter user nodes and follower/followee edges. A hashtag time series is the number of tweets with a given tag, binned over time.

For analysis of the spread of information on follower graphs, the following algorithms are used:

Modularity Communities, as defined in [11], identifies topological community structure in a graph.

Community Entropy (CE), as defined in [7,12], is a measure of the proportional participation of the members of topological communities. More formally, (1)

Where Xi(t) is the fraction of nodes in community i that are participating in the topic of interest at time t.

Community Entropy Ratio (CER) is an extension of CE to take into account a normalizing measure of a theoretical maximum possible entropy under the circumstances.

(2)

Where MPE is the maximum possible entropy, arrived at by distributing the known real-world volume of participants in the topological structure in such a way as to maximize the resultant CE value. All edge cases of MPE have not been investigated, but it is deterministic and so is as at least a proxy measure. It distributes participatory nodes to the smallest communities first, maximizing their entropy contribution.

CE = Xii∑ (t)log2 Xi (t)

CER = − CEMPE

978-1-4799-0203-3/13/$31.00 ©2013 IEEE 160

Page 2: [IEEE 2013 IEEE 2nd Network Science Workshop (NSW) - West Point, NY, USA (2013.04.29-2013.05.1)] 2013 IEEE 2nd Network Science Workshop (NSW) - Initial indicators of topic success

II. MOTIVATION Measures of intrinsic appeal or quality have been shown to

be remarkably poor predictors of eventual success or "virality" of a concept in social media [13,14]. Topological/community entropy has shown great potential as a highly applicable early indicator metric. The authors of [7] applied community entropy to blog hyperlink topologies to quantify the likelihood of a controversial topic leading to violence.

In this paper we expand the existing work by applying the entropy metric to Twitter content and topology to determine its utility in predictive analysis of Twitter dynamics as a relatively new dataset. We also develop a new metric to address potential shortfalls with the existing work as applied to Twitter. While work has been done to characterize the dynamics of "successful" hashtags in Twitter [15], to our knowledge the question of differentiating "successful" tags/memes from "failed" ones has not been addressed.

TABLE I. SUMMARY OF HASHTAGS Hashtag Alias Volume Birthdate #bindersofwomen #bindersfullofwomen 1643 16 October #cantafford4more 3780 02 October #mittmath 124 16 October #realromney #therealromney 690 10 October #romneyshambles 1933 26 July #sketchydeal 353 16 October #sciencegirlthing #realwomeninscience 645 22 June

III. METHODS Twitter provides an API (Application Programming

Interface). The API has two capture methods: real-time stream sampling, and historic tweet database queries. Twitter has three distinct real-time streams, but only the smallest, the “spritzer”, is freely available. The spritzer provides roughly 1%-2% of tweets with "statistically robust" sampling [16]. Historic queries return the first 1000 results from the past 10 days.

For each event, relevant hashtags were determined manually and used to obtain tweets through the API. Twitter limits the rate of queries, and to make the collection practical a node’s followers/followees were limited to 75. The global follower graph is constantly changing; the graph used in this analysis is necessarily a subgraph. Both spritzer and historic search were used in combination for optimum coverage. See Table I for tag information and Table II for data statistics. For Science Girl, both #sciencegirlthing and #realwomeninscience were binned together, as they both represent response to the video. The relevant user lists seeded the follower graphs.

TABLE II. SUMMARY OF DATASETS Event Tweets Users Follower Graph Data Window Election 8523 352 936,616 nodes,

2,994,022 edges 01 October - 17 October

Science Girl

645 156 523,150 nodes, 1,654,169 edges

21 June - 27 June

For the election event, success is defined as still being present during the 24 hours before the actual election (Table III). This measures tag persistence, and is a data-intrinsic measure. For the science girl event, success is defined as the

observable real-world outrage and activism inspired by the video, and is a data-agnostic measure. Thus a combination of intrinsic and real-world measures is used ([17—21]).

TABLE III. DATA-INTERNAL DEFINITIONS OF SUCCESS Hashtag Tweets just prior to election Successful #bindersofwomen 18 Yes #cantafford4more 102 Yes #mittmath 0 No #realromney 1 No #romneyshambles 12 Yes #sketchydeal 0 No

Note that this general procedure is adaptable to more than just Twitter; this is simply the medium we are now addressing.

1. Determine an event’s signature memes/hashtags. 2. Use the markers to capture early-days data. 3. Construct a follower topology from the collected data. 4. Run Modularity Communities on the resulting topology

graph to identify social communities. 5. Calculate CE and CER for each tag. 6. Use 5 to make early predictions about eventual impact.

In past work in hyperlink topologies, success was often measured by eventual high volume. As Twitter and web pages have radically different dynamics, this is inapplicable. While idiomatic tags (such as #dontyouhate) may have more gradual adoption, event-related tags seem to have a different dynamic. We expect to see a large initial volume spike from real-time tweets. As observed in [13], political and other controversial tags seem to have a particularly unique behavior in general.

IV. RESULTS CE, CER, and tweet volume timeseries plots are shown in

Figure 1. In general, entropy is represented by a red line and volume by a blue one. For CE, we select a value ten times the smallest average value in the dataset as a positive result. For CER we use a threshold of 0.01. Observing that the metric removes dependence upon specific graph topology supports this. Note that "early" in the event is relative to the event’s time scale. Figure 1 also shows that initial tweet volume is not correlated with eventual success.

The results indicate that CE is usually a good indicator of eventual success. It seems to work best with political hashtags, missing science girl. For example, with #bindersofwomen CE spikes early, and so provides a positive result. For #sciencegirlthing, CE was negative and CER was positive. Thus, CER improves upon CE, without introducing any failures. See Table IV.

TABLE IV. METRIC RESULTS AND GROUND TRUTH

Hashtag Successful CE Positive CER Positive

#bindersofwomen Yes Yes Yes #cantafford4more Yes Late Late #mittmath No No No #realromney No No No #romneyshambles Yes Yes Yes #sketchydeal No No No #sciencegirlthing Yes No Yes

161

Page 3: [IEEE 2013 IEEE 2nd Network Science Workshop (NSW) - West Point, NY, USA (2013.04.29-2013.05.1)] 2013 IEEE 2nd Network Science Workshop (NSW) - Initial indicators of topic success

Figure 1a. Entropy Metric and Volume Plots

162

Page 4: [IEEE 2013 IEEE 2nd Network Science Workshop (NSW) - West Point, NY, USA (2013.04.29-2013.05.1)] 2013 IEEE 2nd Network Science Workshop (NSW) - Initial indicators of topic success

Figure 1b. Entropy Metric and Volume Plots, Continued

V. CONCLUSIONS AND FUTURE WORK In this paper, we examined the dynamics of the spread of

information on Twitter, and re-applied a measure of entropy that has shown prior success on web page topologies to Twitter data. Further, we have introduced a metric, dubbed the Community Entropy Ratio. Based on the defined measures of success (hashtag persistence and/or observed real-world effect), Community Entropy alone was able to separate most of the events. However, the Community Entropy Ratio better predicted the success of the tags. Notably, the CER metric correctly categorized #sciencegirlthing, suggesting that the normalization allows for topology-agnostic result comparison. With CE alone, it is difficult at best to apply results from one topology to another, though both metrics have shown encouraging applicability to Twitter dynamics.

The hashtag #cantafford4more exhibits unexpected behavior, and that may be due to its being a specifically purchased and promoted tag of the Romney campaign. More examination into the behavior of promoted vs. organically spreading hashtags would be necessary for a better understanding of the possible complicating factors at work. Note again that this work was necessarily performed on a very small portion of the overall Twitter data, and yet yielded encouraging results toward the ability to determine ultimate success of a hashtag or concept by characterizing early diffusion across topological communities.

There are other approaches that could support further investigation of Twitter information diffusion. Susceptible-Infected-Recovered (SIR) models, often used in epidemiology, may prove useful in drilling down to more fine-grained resolution of information spread and potentially support the ability to forecast when a hashtag will be “successful”.

Applying this metric to more case studies will also provide more supporting evidence of its utility and applicability. At the same time, investigation into extensions upon the Community Entropy Ratio should be considered, in pursuit of an optimally versatile and predictive metric.

VI. REFERENCES [1] G. Lotan, "All shook up: mapping earthquake news on Twitter from

Virginia to Maine," SocialFlow, August 2011. Retrieved from http://blog.socialflow.com/post/7120244132/all-shook-up-mapping-earthquake-news-on-twitter-from-virginia-to-maine.

[2] J. Shaman and A. Karspeck, "Forecasting seasonal outbreaks of influenza," PNAS Early Edition, November 2012. Retrieved from http://www.pnas.org/content/early/2012/11/21/1208772109.abstract?sid=7757738c-6992-4178-bf8b-4a5838ab184b.

[3] L. Anderson, "Demystifying the Arab Spring," Foreign Affairs, April 2011. Retrieved from http://www.foreignaffairs.com/articles/67693/ lisaanderson/demystifying-the-arab-spring.

[4] T.M. Chen, "How networks changed the world," Network, IEEE vol. 25, no. 6, pp. 2-3, November 2011.

[5] European Commission, "Science: it's a girl thing" [video file]. Retrieved from http://www.wired.co.uk/news/archive/201206/22/ science-its-a-girl-thing.

[6] “What are hashtags (“#” symbols)?” Twitter.com. Retrieved from https://support.twitter.com/articles/49309-what-are-hashtags-symbols#.

[7] R. Colbaugh and K. Glass, "Early warning analysis for social diffusion events," Proc. 2010 International Conference on Intelligence and Security Informatics, May 2010.

[8] A. Java, X. Song, T. Finin, and B. Tseng, "Why we twitter: understanding microblogging usage and communities," Proc. 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, 2007.

[9] D. Centola, "The spread of behavior in an online social network experiment," Science, vol. 329, pp. 1194-1197, 2010.

[10] Institute for Complex Additive Systems Analysis, “ICASA analysis tools.” Retrieved from http://www.icasa.nmt.edu/index.php/dest/tools.

[11] A. Clauset, M. Newman, and C. Moore, “Finding community structure in very large networks,” Physical review E, vol. 70, no. 6, 2004.

[12] G. Ver Steeg and A. Galstyan, "Information transfer in social media," WWW 2012, April 2012.

[13] W. Walls, “Modeling movie success when ‘nobody knows anything’: Conditional stable- distribution analysis of film returns”, J. Cultural Economics, vol. 29, pp. 177-190, 2005.

[14] M. Salganik, P. Dodds, and D. Watts, “Experimental Study of Inequality and Unpredictability in an Artificial Cultural Market,” Science, vol. 311, pp. 854-856, 2006.

[15] D. Romero, B. Brendan, and J Kleinberg, "Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter," Proc. 20th international conference on world wide web, ACM, 2011.

[16] “Public streams,” Twitter.com. Retrieved from https://dev.twitter.com/ docs/streaming-apis/streams/public.

[17] "Internet goes wild over 'Binders full of women'," CNN Politics, October 2012. Retrieved from http://politicalticker.blogs.cnn.com/ 2012/10/16/binders-full-of-women-trending/.

[18] L. Petrecca, "'Binders full of women': jokes, then serious discussion," USA Today, October 2012. Retrieved from http://www.usatoday.com/ story/news/nation/2012/10/17/romneys-binders-full-of-women/ 1639089/.

[19] A. Kantrowitz, "Did a promoted hashtag foreshadow Romney's foreign policy debate strategy?" Forbes, October 2012. Retrieved from http://www.forbes.com/sites/alexkantrowitz/2012/10/25/did-a-promoted-hashtag-foreshadow-romneys-foreign-policy-debate-strategy/.

[20] T. Edwards, "US and UK journalists take delight in romneyshambles," The Week, July 2012. Retrieved from http://www.theweek.co.uk/ politics/us-election-2012/48196/us-and-uk-journalists-take-delight-romneyshambles.

[21] C. Rice, "Science: it's a girl thing! A viral fiasco," The Guardian, June 2012. Retrived from http://www.guardian.co.uk/science/blog/2012/jun/ 29/science-girl-thing-viral-fiasco

163