Big data, big research?



Similar documents
The bad and the ugly: Online firestorms and hate propagation in social media networks

The Big Data Paradigm Shift. Insight Through Automation

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

IC05 Introduction on Networks &Visualization Nov

Advanced Big Data Analytics with R and Hadoop

DIGITS CENTER FOR DIGITAL INNOVATION, TECHNOLOGY, AND STRATEGY THOUGHT LEADERSHIP FOR THE DIGITAL AGE

Outline. What is Big data and where they come from? How we deal with Big data?

Large-Scale Data Processing

An NSA Big Graph experiment. Paul Burkhardt, Chris Waring. May 20, 2013

Exploring Big Data in Social Networks

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Collaborations between Official Statistics and Academia in the Era of Big Data

Can Flash help you ride the Big Data Wave? Steve Fingerhut Vice President, Marketing Enterprise Storage Solutions Corporation

BIG DATA: BIG BOOST TO BIG TECH

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

PROGRAM DIRECTOR: Arthur O Connor Contact: URL : THE PROGRAM Careers in Data Analytics Admissions Criteria CURRICULUM Program Requirements

Big Graph Processing: Some Background

AN INTRO TO DATA MANAGEMENT

The 4 Pillars of Technosoft s Big Data Practice

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

What is Data Science? Girl Develop It! Meetup Renée M. P. Teate, March 2015

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Hadoop. Sunday, November 25, 12

Information Flow and the Locus of Influence in Online User Networks: The Case of ios Jailbreak *

Big Analytics: A Next Generation Roadmap

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Big Data Analytics. Lucas Rego Drumond

5 Point Social Media Action Plan.

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Unlocking the Intelligence in. Big Data. Ron Kasabian General Manager Big Data Solutions Intel Corporation

Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

Big Data: Study in Structured and Unstructured Data

Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems

Temporal Visualization and Analysis of Social Networks

Are You Ready for Big Data?

Big Systems, Big Data

New Jersey Big Data Alliance

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

How To Handle Big Data With A Data Scientist

ONLINE SOCIAL NETWORK ANALYTICS

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Big Data a threat or a chance?

SOCIAL MEDIA ADVERTISING STRATEGIES THAT WORK

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

DATA MINING TECHNIQUES FOR CRM

Chapter 1. Contrasting traditional and visual analytics approaches

Sources: Summary Data is exploding in volume, variety and velocity timely

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Cray: Enabling Real-Time Discovery in Big Data

Big Data Mining: Challenges and Opportunities to Forecast Future Scenario

MLg. Big Data and Its Implication to Research Methodologies and Funding. Cornelia Caragea TARDIS November 7, Machine Learning Group

How To Scale Out Of A Nosql Database

The Data Engineer. Mike Tamir Chief Science Officer Galvanize. Steven Miller Global Leader Academic Programs IBM Analytics

BIG DATA CHALLENGES AND PERSPECTIVES

Data Centric Computing Revisited

DEVELOPING A SOCIAL MEDIA STRATEGY

What is Data Science? Data, Databases, and the Extraction of Knowledge Renée November 2014

Application Development. A Paradigm Shift

A Big Data Analytical Framework For Portfolio Optimization Abstract. Keywords. 1. Introduction

**NEW CLIENTS MAY NEED AN INITIAL SET- UP and ANALYSIS

We are Big Data A Sonian Whitepaper

Banking On A Customer-Centric Approach To Data

How To Understand The Benefits Of Big Data

NetApp Big Content Solutions: Agile Infrastructure for Big Data

CSC590: Selected Topics BIG DATA & DATA MINING. Lecture 2 Feb 12, 2014 Dr. Esam A. Alwagait

Big Data and Analytics: Challenges and Opportunities

Laurence Liew General Manager, APAC. Economics Is Driving Big Data Analytics to the Cloud

COMP9321 Web Application Engineering

Hadoop for Enterprises:

Putting Apache Kafka to Use!

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

How To Use Hadoop For Gis

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

Social-Sensed Multimedia Computing

[Hadoop, Storm and Couchbase: Faster Big Data]

Web Archiving and Scholarly Use of Web Archives

How Companies are! Using Spark

Smart Policing Initiative Website and Social Media

July 2014

Big Data: Image & Video Analytics

Big Data Executive Survey

The Emerging Ethics of Studying Social Media Use with a Heritage Twist

EVERYTHING THAT MATTERS IN ADVANCED ANALYTICS

SOCIAL NETWORK ANALYSIS EVALUATING THE CUSTOMER S INFLUENCE FACTOR OVER BUSINESS EVENTS

Transcription:

Big data, big research? Opportunities and constraints for computer supported social science Jürgen Pfeffer Digital Methods Vienna, Austria, November 2013

Agenda Look and feel of big data research How is big data research different from traditional social science research? Methodological problems Big data Online social networks How big are big data? Technical/algorithmic problems 2

Goals Understanding big data research approach Seeing the current limitations Feeling the future potentials 3

Jürgen Pfeffer Assistant Research Professor School of Computer Science Carnegie Mellon University Vienna University of Technology: BA: Computer Science PhD: Business Informatics Corporate Consultant, Freelancer Research Studios Austria Trainer for Rhetoric and Personal Performance 4

Jürgen Pfeffer Research focus: Computational analysis of organizations and societies Special emphasis on large scale systems Methodological and algorithmic challenges Methods: Network analysis theories and methods Visual analytics, geographic information systems Agent based simulations, system dynamics Center for Computational Analysis of Social and Organizational Systems 5

Challenges for Analyzing Large Scale Systems Data Mining Text Mining Data to Network Model Algorithms Change Detection Visual Analytics Geo Analysis Modeling Simulation Mining of large amounts of diverse data Automated data to network processing Dynamic network analysis and change detection Visual analytics of network data Modeling and simulation of real world networks Toward a Real Time Analysis of Large Scale Dynamic Socio Cultural Systems 6

Toward a Real Time Analysis of Large Scale Dynamic Socio Cultural Systems 7

Motivation & Hope A field is emerging that leverages the capacity to collect and analyze data at a scale that may reveal patterns of individual and group behaviors. access to terabytes of data describing minute by minute interactions and locations of entire populations of individuals [will] offer qualitatively new perspectives on collective human behavior. Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabási, A. L., Brewer, D., Christakis, N., Contractor, N., Fowler, J., Gutmann, M., Jebara, T., King, G., Macy, M., Roy, D., & Van Alstyne, M. (2009). Computational social science. Science, 323, 721 723. 8

Motivation & Hope Social media offers us the opportunity for the first time to both observe human behavior and interaction in real time and on a global scale. Golder, S. A., & Macy, M. W. (2012, January). Social science with social media. ASA footnotes, 40(1), 7. 9

Example: Interplay Social Media/Traditional Media Offline and online media reinforce one another Social media are an important information source for traditional media (Diakopoulos et al., 2012). Twitter is used as radar Social media hooks are connected to the media story Significant amount of dynamics are external events and factors outside the network (Myers et al., 2012) Online firestorms: Social Media Traditional Media Cross media dynamics 10

Interplay Social Media/Traditional Media Traditional Social Science approaches: Survey Twitter/Facebook users Interview journalists Observe media web sites Content analysis Etc. 11

Interplay Social Media/Traditional Media Data driven approach: Contrast Arabic tweets with English news articles (2 weeks): 7,763 English news articles ( Syria ) ( سوريا ( Syria, 61,633 Arabic written tweets from 10,186 users Arabic written keywords related to humanitarian crisis, e.g. violence, death, food, shelter, etc. to reduce tweets Pfeffer, J., Carley, K. M. (2012). Social Networks, Social Media, Social Change. Proceedings of the 2nd International Conference on Cross Cultural Decision Making: Focus 2012, San Francisco, CA. 12

Interplay Social Media/Traditional Media Data mining approach: Carlos Castillo (Qatar Computing Research Institute, Doha, Qatar) Mohammed El Haddad (Al Jazeera, Doha, Qatar) Matt Stempeck (MIT Media Lab, Cambridge, USA) Jürgen Pfeffer (Carnegie Mellon University, Pittsburgh, USA) 13

Data Collection AlJazeera.com beacon embedded in all article pages events are processed using Apache S4 collect and aggregate the visits with a 1 minute granularity data is stored using a Cassandra NoSQL database Facebook.com collect messages from Facebook discussing the articles using the Facebook Query Language API Twitter.com collect messages from Twitter discussing the articles Using the Twitter Search API 14

Data Collection Case Study, 1 week of data: Number of articles 606 Visits after 7 days 3.6 M Facebook shares 155 K# Tweets 80 K Where do the article visits come from 15

Interplay Social Media/Traditional Media Castillo, Carlos & El-Haddad, Mohammed & Pfeffer, Jürgen & Stempeck, Mat (2014, forthcoming). Characterizing the Life Cycle of Online News Stories Using Social Media Reactions. 17th ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW 2014), February 15-19, Baltimore, Maryland. 16

Interplay Traditional and Social Media Describing life cycle of online news stories Using early social media reactions 20 minutes of Social Media activities Can we estimate the 7 day visiting volume? Results: Social media reactions can contribute substantially to the understanding of visitation patterns in online news. After 20 Minutes In-depth News Facebook shares * * Twitter avg. followers *** - Volume of unique tweets - *** Twitter entropy *** *** 17 17

Al Jazeera Web Analytics Platform Al Jazeera launches predictive web analytics platform based on our research Media coverage: Qatar Tribune Doha News Gulf Times Fana News Albawaba Wan Ifra Rapid TV News Etc. 18

Big Data Principles: Collect All Data Collect all available data No sampling, N = all There are no unrelated data Messy data and bad data is good Thousands of ( independent ) variables We (the system) can decide later what is useful and what not 19

Data Driven Research Processes Social Science 1. Problem 2. Research Question/ Hypotheses 3. Theories 4. Methods 5. Data 6. Analysis 7. Result Presentation Typical Big Data Analysis 1. Methods 2. Data 3. Analysis 4. Result Presentation 5. Problem 20

Correlation not Cause: Babies and Storks Social Science Collect other (sociodemographic) variables Build hypotheses about underlying variables Figure out that education is a good predictor for babies and storks (non cities) Question: Why? Big Data Analysis Include ~1,200 variables in a regression like model. Number of storks and avg. car gas consumption are good enough predictors for number of babies Goodness of fit 21

Many Variables: Statistical Issues I 1 st example: 1 variable y, 100 elements, random 0 1 1 variable x, 100 elements, random 0 1 Cor(x,y) = ~0.00 2 nd example: 1 variable y, 100 elements, random 0 1 100 variable x n, 100 elements, random 0 1 Cor(x n,y) =? Cor(x n,y) Something always correlates x n 22

Many Variables: Statistical Issues II 1 st example: 1 variable y, 100 elements, random 0 1 1 variable x, 100 elements, random 0 1 r² lm(x,y) = ~.0 2 nd example: 1 variable y, 100 elements, random 0 1 100 variable x n, 100 elements, random 0 1 r² lm(x 1 x n,y) =? r² If you use enough variables, your r² is always high Number of variables 23

N = All Is it all? All of what? Is it all of what we want? Is it all of what we think it is? 24

Multi Level Bias Problem 1. Do the people online represent society? 2. Do the people that are online behave like offline? 3. Do the created data represent human behavior? 4. Do the analyzed data represent the created data? C A B 25

Do Created Data Represent Human Behavior? Pfeffer, J. & Zorbach, T. & Carley, K.M. (2013). Understanding online firestorms: Negative word of mouth dynamics in social media networks. Journal of Marketing Communications 26

Empirical Observations/Factors Hundreds of friends create many information Offline: Hierarchical groups of alters (Zhou et al., 2005) Strength of ties amount of time, the emotional intensity, the intimacy, and the reciprocal service (Granovetter, 1973) In social media, every connection gets the same amount of attention Massive unrestrained information flow 27

Empirical Observations/Factors Amplified epidemic spreading, network clusters Average Facebook user Ann: 130 friends Ben posts a very interesting piece of information Ben s friends like what Ben says (Homophily) Ben s friends are also friends with Ann (Transitivity) Ann receive a large amount of posts to one topic Amplifying effects of opinion forming: echo chambers (Key, 1966) Network clusters & echo chambers 28

Empirical Observations/Factors Amplified epidemic spreading, network clusters Transitive link creations (Heider, 1946) Interpersonal communication networks have significant local clustering (e.g. Pfeffer and Carley, 2011) Local clusters are important for diffusion (e.g. Pfeffer and Carley, 2013) C A B Pfeffer, Jürgen & Carley, Kathleen M. (2013). The Importance of Local Clusters for the Diffusion of Opinions and Beliefs in Interpersonal Communication Networks. International Journal of Innovation and Technology Management 10 (5) Pfeffer, Jürgen & Carley, Kathleen M. (2011). Modeling and Calibrating Real World Interpersonal Networks. Proceedings of the IEEE NSW 2011, 1st International Workshop on Network Science, 9 16. 29

Do Created Data Represent Human Behavior? Transitive link creations (Heider, 1946) Programmers of social media know this E.g. ~70% of new links on LinkedIn are triadic closure Groups to follow, etc. Social media systems intensify this effect with link suggestions Do we analyze society or a software implementation? C A B 30

Do the Analyzed Data Represent the System? Morstatter, F. & Pfeffer, J.& Liu, Huan & Carley, K.M. (2013). Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose. ICWSM, Boston, MA. 31

Sampling Twitter Data Firehose feed 100% costly. Streaming API feed 1% free. We don t know how Twitter samples data. Is the sampled data from the Streaming API representative of the true activity on Twitter s Firehose? 32

Sampling Twitter Data 33

Sampling Twitter Data 42% Overall Coverage Daily Coverage from 17% to 89%. Can we find the right key player? Measure k Average Agreement (min-max) All 28 Days In-Degree 10 4.21 (0-9) 4 In-Degree 100 53.4 (36-82) 73 Betweenness 100 54.8 (41-81) 55 Potential Reach 100 59.2 (32-83) 80 34

Multi Level Bias Problem? Do the people online represent society?? Do the people that are online behave like offline?? Do the created data represent human behavior?? Do the analyzed data represent the created data? C A B 35

How Big are Big Data? Facebook is collecting your data 500 terabytes a day 2.5 billion status updates, posts, photos, videos, comments per day 2.7 billion Likes per day 300 million photos uploaded per day $10M $20M/year http://gigaom.com/2012/08/22/facebook is collecting your data 500 terabytes a day/ Total world Internet traffic in 2012: 1.1 Exabytes per day (1000petabytes = 1 million terabytes = 1 billion gigabytes) http://www.cisco.com/web/solutions/sp/vni/vni_forecast_highlights/index.html Store just metadata! 36

Big Networks: Memory Space 1 billion vertices, 100 billion edges 111 PB adjacency matrix 2.92 TB adjacency list 2.92 TB edge list Burkhardt & Waring, An NSA Big Graph experiment 37

Top Supercomputer Installations Titan Cray XK7 at ORNL #1 Top500 in 2012 0.5 million cores 710 TB memory 8.2 Megawatts 4300 sq.ft. Sequoia IBM Blue Gene/Q at LLNL #1 Graph500 in 2012 1.5 million cores 1 PB memory 7.9 Megawatts 3000 sq.ft. $7 million per year energy costs Source: Burkhardt & Waring, An NSA Big Graph experiment 38

Calculation Time State of the art algorithms for SNA metrics require Θ space and run in Θ time, some in Θ or Θ with n = number of nodes, m = number of edges. Example: 50k nodes, 193k edges Betweenness centrality (Freeman 1979) 1 processor, laptop: 51.23 min 39

Calculation Time Betweenness Centrality with Facebook 264,399,256,813 min (500k years) With 1,000,000 cores: 0.5 years With 10x faster cores: 18.4 days Approximations and localized algorithms 40

Large Networks? 41

Large Networks? 42

Large Networks: New Algorithms? What are the interpretations of our traditional measures for large networks? E.g., does node nr. 1 sit in the center or rather on the fence? What does it mean to be the most central actor on Facebook? Approximation algorithms are new metrics! What are the research questions for large networked data at all? 43

Summary Big hopes and dreams related to big data especially to social media data Data driven research is different Combining this with traditional social science research is not trivial Multi level bias problem of social media data Sampling issues Representative Big data are really big But it is possible to handle big data/networks Many metrics have been developed for small groups, validity for big data is often not guaranteed 44

Other Issues Privacy Surveillance You are the product not the customer When correlations lead to predictions and interventions Predictive policing 45

What needs to be done? Don t be blinded by big data Ask questions: What do we learn from a study? Do the authors ask why? Good old research process is still important Don t be satisfied with one needle (especially, when you dream of the haystack) Let s utilize big data! But with care. 46 46

Conclusions: Mixed Methods The digital records of online behavior and social interaction hold the promise of opening up a new era in the social and behavioral sciences, but when and whether this opportunity is realized may depend on the involvement and leadership of sociologists with the necessary technical and computational skills. Online data should therefore be viewed as a complement to, and not substitute for, data collected by traditional methods. Indeed, in many cases, the value of online data may depend on opportunities to integrate with data obtained from surveys. Golder, S. A., & Macy, M. W. (2012, January). Social science with social media. ASA footnotes, 40(1), 7. 47

Conclusions Initially, computational social science needs to be the work of teams of social and computer scientists. In the long run, the question will be whether academia should nurture computational social scientists, or teams of computationally literate social scientists and socially literate computer scientists. Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabási, A. L., Brewer, D., Christakis, N., Contractor, N., Fowler, J., Gutmann, M., Jebara, T., King, G., Macy, M., Roy, D., & Van Alstyne, M. (2009). Computational social science. Science, 323, 721 723. Ph.D. program in Computation, Organizations and Society (COS) Computing About and For Society Apply: http://www.isri.cmu.edu/education/cos-phd/application.html 48

Our mission is to go forward, and it has only just begun. There's still much to do, still so much to learn. Engage! Jean-Luc Picard, TNG Season 1 Ep. 26 Jürgen Pfeffer jpfeffer@cs.cmu.edu