Towards a Set Theoretical Approach to Big Data Analytics Raghava Rao Mukkamala Postdoc, Software and Systems Section IT University of Copenhagen, Denmark! Joint work with Ravi Vatrapu Professor, Department of IT Management Abid Hussain PhD student, Department of IT Management Copenhagen Business School, Denmark IEEE BigData 2014, Alaska, USA July 01, 2014 1
Road Map Part 1: Introduction and Motivation Part 2: Social Data Model Part4: Conclusion & Future work Part 3: Social Data Analytics Example: H & M Company 2
Social Media as Business Pla@orm Most used platforms: Facebook, Twitter, Youtube, LinkedIn Content published worldwide by million of social media users Social Data Sets contain valuable information http://60secondmarketer.com/blog/ 2010/04/09/top-52-social-media-platforms/ Can provide meaningful facts and actionable insights Vatrapu, R. (2013). Understanding social business. In Emerging Dimensions of Technology Management (pp. 147-158). Springer India.! Giglietto, F., Rossi, L., & Bennato, D. (2012). The Open Laboratory: Limits and Possibilities of Using Facebook, Twitter, and YouTube as a Research Data Source. Journal of Technology in Human Services, 30(3-4), 145-159. 3
Facebook- Social Media Pla@orm A Facebook wall offers 1 a new and more informal, but a public means for users provision to enter into discussions and debate with other users engagement can range from interactions under normal conditions to activities under extra-ordinary events such as Arab Spring Facebook gives you friends, while Twitter gives you followers! Structure and Data availability 2 Twitter: Simple and public Facebook: Complicated and Restricted privacy settings Research wise: Mostly on Twitter, very less on Facebook 1) ROBERTSON, S., VATRAPU, R. & MEDINA, R. (2010). Online Video Friends Social Networking: Overlapping Online Public Spheres in the 2008 U.S. Presidential Election. Journal of Information Technology & Politics, 7, 182-201 2) Giglietto, F., Rossi, L., & Bennato, D. (2012). The Open Laboratory: Limits and Possibilities of Using Facebook, Twitter, and YouTube as a Research Data Source. Journal of Technology in Human Services, 30(3-4), 145-159. 4
Social Data Research Approaches Ethnographical Approaches 1 social meaning inferred from content usually qualitative techniques based on triangulation methods Statistical Approaches 1 Computational supported statistical approach (correlation, regression, etc) e.g. study of twitter usage in natural disasters 2 Computational Approaches Computational Social Science: interdisciplinary approach, social scientists + computer scientists + mathematicians 3 building models, methods and concepts for analysis of large volume of data 1) Giglietto, F., Rossi, L., & Bennato, D. (2012). The Open Laboratory: Limits and Possibilities of Using Facebook, Twitter, and YouTube as a Research Data Source. Journal of Technology in Human Services, 30(3-4), 145-159. 2) Earle, P. (2010). Earthquake Twitter. Nature Geoscience, 3(4), 221 222. 3) Conte, R, et al. "Manifesto of computational social science." The European Physical Journal Special Topics. 5
Our Research Approach Formal Methods to develop advanced data analysis techniques for social data Formal methods: technique to model complex phenomena as mathematical entities abstract, precise and complete Current techniques limited to Social Network Analysis based on graph-theoretical approach (Relational Sociology) Our approach: based on Set theory and Fuzzy Logics (Associational Sociology) Primarily focussed on Facebook data 6
Research Methodology! Using Integrated Modeling approach! Conceptual Model of Social Data! Formal Model based on Set Theory! Social Data Analytics Tool (SODATO) 7
Social Data Conceptual Model Social Graph Analysis: structure of relationships emerging from social media use Which actors involved? What actions they perform? What activities they undertake? What artifacts they create and interact with? A. Hussain, R. Vatrapu, D. Hardt, and Z. Jaffari, Social data analytics tool: A demonstrative case study of methodology and software. in Analysing Social Media Data and Web Networks, M. C. Rachel Gibson and S. Ward, Eds. Palgrave Macmillan, 2014 (in press) 8
Social Data Conceptual Model Social Text Analysis: substantive nature of the interactions How the topics are discussed? Which keywords appear? Which pronouns are used? How are far positive/negative sentiments expressed? A. Hussain, R. Vatrapu, D. Hardt, and Z. Jaffari, Social data analytics tool: A demonstrative case study of methodology and software. in Analysing Social Media Data and Web Networks, M. C. Rachel Gibson and S. Ward, Eds. Palgrave Macmillan, 2014 (in press) 9
Formal Model - Social Data 10
Formal Model - Social Graph 11
Formal Model - Social Text 12
Formal Model - AcLons 13
Facebook Post - Example 14
H & M Facebook Data set H&M Swedish fast fashion retail cloths company 2009/01/01 to 2013/12/31 Total entries: 12.60 Million Albums+ 18% Posts 1% Comments 2% 9.95 Million likes 112,000 posts, Likes 79% 300, 000 comments Albums + comments & likes on Albums: 2.26 Million 15
SenLment Analysis Artifact (text) can be analysed by machine learning tools such as Google Prediction API 1 default sentiment label: positive (+) /neutral (0)/negative (-) a sentiment score such as {(+):82, (0):15, (-): 03} Only posts and comments can have sentiments likes and shares carry forward their parent artifact s sentiment H&M (u0) post u4 share Featuring new summer collection cloths. (r1) (+):58, (0):35, (-): 07 Yes, I love summer (r2) (+):35, (0):55, (-): 10 I love H &M (r3) (+):82, (0):15, (-): 03 like u1 like comment u6 Finally my high heel sandals arrived, and they broke!!! :-( (r6) (+):12, (0):21, (-): 67 u2 comment like u5 comment u8 u3 u7 1) Google-Inc, Google prediction api, September 2012, https:// developers.google.com/prediction/. 16
ArLfact SenLment DistribuLon Positive Negative Nuetral (+) (-) 2% Negative (-) 18 % (20,521) (-) (0) 4% 1.400.000 1.225.000 1.050.000 875.000 Positive (+) 33% (36,784) (+) (-) (0) 6% (+) (0) 6% Neutral (-) 31% (33,910) 700.000 525.000 350.000 175.000 0 Quarters 2009-01 2009-02 2009-03 2009-04 2010-01 2010-02 2010-03 2010-04 2011-01 2011-02 2011-03 2011-04 2012-01 2012-02 2012-03 2012-04 2013-01 2013-02 2013-03 2013-04 Distribution of post artifacts based on the post and comments sentiments Temporal distribution of Artifact sentiments (quarterly) 17
Actor SenLments - Actor Profiling Actors don't carry any direct sentiment Actors perform Actions on Artifacts Actor sentiment can be derived from their actions on artifacts Distribution of actors sentiments based on the actions they do on artifacts sentiments Positive (+) 17.9% (667 307) (+) (-) 3.3% Negative (-) 10.3% (384 906) (+) (-) (0) 11.1% (+) (0) 14.6% (-) (0) 3.9% Neutral (-) 38.9% (1 451 635) Total: 3.8 million unique users 800.000 700.000 600.000 500.000 400.000 300.000 200.000 100.000 Positive Negative Nuetral 0 Quarters 2009-01 2009-02 2009-03 2009-04 2010-01 2010-02 2010-03 2010-04 2011-01 2011-02 2011-03 2011-04 2012-01 2012-02 2012-03 2012-04 2013-01 2013-02 2013-03 2013-04 18
Actor SenLments - Actor Profiling - II 5000% Ratio of -ve/+ve sentiment Average -ve /+ve ratio (154%) 3750% 2500% 1250% 0% 2013-51 2013-52 2013-48 2013-45 2013-42 2013-39 2013-36 2013-33 2013-30 2013-27 2013-24 2013-21 2013-18 2013-15 2013-12 2013-09 2013-06 2013-03 2012-53 2012-50 2012-47 2012-44 2012-41 2012-38 2012-35 2012-32 2012-29 2012-26 2012-23 2012-20 2012-17 2012-14 2012-11 2012-08 2012-05 2012-02 2011-52 2011-49 2011-46 2011-43 2011-40 2011-37 2011-34 2011-31 2011-28 2011-25 2011-22 2011-19 2011-16 2011-13 2011-10 2011-07 2011-04 2011-01 Temporal distribution of -ve/+ve actor sentiment ratio (Weekly basis) Some of the peaks corresponds to real-world events such as Factory collapses in Bangladesh Rana Plaza incident (week 2013-17 - 2013-20) where 1129 people died Seven people killed in week 2013-41 19
H & M Sales - ArLfact SenLments Strong correlation between sales and +ve comments on non-h&m posts Strong correlation between sales and -ve posts by non-h&m Strong correlation between sales and -ve comments on non-h&m posts Strong correlation between sales and neutral posts by non-h&m 20
Future Work Formal model can abstracted further to model data from other social media channels such as Twitter Use of fuzzy sets and fuzzy logic to develop advanced analysis techniques Modeling of networks of groups and friends of users in an online social media platform More case studies to study consumer behaviour in case of crisis events Questions & Comments? 21