Exploring Big Data in Social Networks



Similar documents
International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Big Data a threat or a chance?

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Big Data and Analytics: Challenges and Opportunities

COMP9321 Web Application Engineering

Data Refinery with Big Data Aspects

Network-based spam filter on Twitter

Understanding Your Customer Journey by Extending Adobe Analytics with Big Data

Extracting Information from Social Networks

MLg. Big Data and Its Implication to Research Methodologies and Funding. Cornelia Caragea TARDIS November 7, Machine Learning Group

How To Handle Big Data With A Data Scientist

Big Data and Healthcare Payers WHITE PAPER

BIG DATA CHALLENGES AND PERSPECTIVES

Characterizing Task Usage Shapes in Google s Compute Clusters

Statistical Challenges with Big Data in Management Science

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Sunnie Chung. Cleveland State University

Object Popularity Distributions in Online Social Networks

Reconstruction and Analysis of Twitter Conversation Graphs

BIG DATA IN BUSINESS ENVIRONMENT

Massive Cloud Auditing using Data Mining on Hadoop

Mammoth Scale Machine Learning!

Open source Google-style large scale data analysis with Hadoop

DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights

Manifest for Big Data Pig, Hive & Jaql

Big Workflow: More than Just Intelligent Workload Management for Big Data

Twitter Analytics: Architecture, Tools and Analysis

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

International Journal of Innovative Research in Computer and Communication Engineering

Concept and Project Objectives

BIG DATA TRENDS AND TECHNOLOGIES

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

Value of. Clinical and Business Data Analytics for. Healthcare Payers NOUS INFOSYSTEMS LEVERAGING INTELLECT

Social Media Mining. Data Mining Essentials

Google+ or Google-? Dissecting the Evolution of the New OSN in its First Year

Information Management course

Enhanced Information Access to Social Streams. Enhanced Word Clouds with Entity Grouping

An Introduction to Data Mining

Chapter ML:XI. XI. Cluster Analysis

Government Technology Trends to Watch in 2014: Big Data

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Statistics for BIG data

Volume 3, Issue 8, August 2015 International Journal of Advance Research in Computer Science and Management Studies

How To Understand The Benefits Of Big Data

International Journal of Engineering Research ISSN: & Management Technology November-2015 Volume 2, Issue-6

Introduction to Data Mining

RevoScaleR Speed and Scalability

Social-Sensed Multimedia Computing

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Copyright 2013 Splunk Inc. Introducing Splunk 6

ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Sla Aware Load Balancing Algorithm Using Join-Idle Queue for Virtual Machines in Cloud Computing

Introduction to Data Mining

Connecting library content using data mining and text analytics on structured and unstructured data

Adobe Insight, powered by Omniture

Big Data. Fast Forward. Putting data to productive use

CloudRank-D:A Benchmark Suite for Private Cloud Systems

Hexaware E-book on Predictive Analytics

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Analyzing Big Data: The Path to Competitive Advantage

Big Data Explained. An introduction to Big Data Science.

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Transforming the Telecoms Business using Big Data and Analytics

We are Big Data A Sonian Whitepaper

Big Data Analytic and Mining with Machine Learning Algorithm

A U T H O R S : G a n e s h S r i n i v a s a n a n d S a n d e e p W a g h Social Media Analytics

North Highland Data and Analytics. Data Governance Considerations for Big Data Analytics

Unlocking The Value of the Deep Web. Harvesting Big Data that Google Doesn t Reach

A Review of Data Mining Techniques

DATA ANALYSIS IN PUBLIC SOCIAL NETWORKS

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

BIG DATA: BIG BOOST TO BIG TECH

Evaluating HDFS I/O Performance on Virtualized Systems

Data Centric Systems (DCS)

Spam Detection Using Customized SimHash Function

Ubuntu and Hadoop: the perfect match

BIG DATA AND ANALYTICS

Big Data Introduction, Importance and Current Perspective of Challenges

Big Data Analytics. Lucas Rego Drumond

Transcription:

Exploring Big Data in Social Networks virgilio@dcc.ufmg.br (meira@dcc.ufmg.br) INWEB National Science and Technology Institute for Web Federal University of Minas Gerais - UFMG May 2013

Some thoughts about computing, future and innovation

What happens in 60 seconds on the Internet?

Explosion of Web Data 4

BIG DATA: data collection, storage, management, automated large-scale analysis 5

Research interests algorithms around social networks VERY large graphs data mining analytics BIG DATA Algorithms and MACHINE LEARNING Systems Infrastructure cloud characterization SOCIAL and ECONOMICS characterization models incentives privacy network effects crowdsourcing anti-social behavior spam and malware s

The fundamental challenge of Big Data is not collecting data -- it's making sense of it. 1) What is the starting point? 2) What are the computation paths to discovery? 3) What are the appropriate algorithms? 3) How to visualize the findings?

Analysis Experimental Methodology Measure Analyze Model Synthesize Models What if questions: Distributions of Random Variables Algorithms Logs and Traces Synthetic Workloads Observations Validation Artifacts

Challenges in Online Social Networking Research Explosive growth in size, complexity, and unstructured data; Enabled by various experimental methods: observational studies, simulations,..., huge amount of data; It is big data, the vast sets of information gathered by researchers at companies like Facebook, Google and Microsoft from patterns of cellphone calls, text messages and Internet clicks by millions of users around the world. Companies often refuse to make such information public, sometimes for competitive reasons and sometimes to protect customers privacy. (New York Times, May 21)

Enablers of Big Data Hardware capability Storage capacity Network bandwidth Exponentially increasing capability at constant cost Processing capacity Applications & Algorithms Online social networking Algorithmic breakthroughs: machine learning and data mining Cloud: Cost reductions and scalability improvements in computation Sensors everywhere

Price of 1 gigabyte of storage over time Year Cost 1981 $300,000 1987 $50,000 1990 $10,000 1994 $1000 1997 $100 2000 $10 2004 $1 2012 $0.10 11

OSN Research Focus 1.Understand: characteristics of social graphs of real data; 2.Discover: properties of social graphs; 3.Engineer: social graph built.

OSN research approach Computational sociology: A natural sciences approach Gather and analyze OSN data to study problems in sociology Social computing: An engineering approach Build systems that support / leverage human social interactions Understand human behavior (as opposed of considering it annoying noise) Inspired by sociological theories

The Atlantic 15

16

Understanding Factors that Affect Response Rates in Twitter(*) Active users can receive 1000 tweets per day; Approximately 36% of all tweets worth reading, 39% are neutral and 25% are junk ; Interesting Questions Do Twitter users receive more information than they are able to consume? Is it possible to identify factors that affect interactions (replies and retweets)? (*) ACM Hypertext 2012, joint work with Giovanni Comarela, Mark Crovella, F. Benevenuto

Datasets: big data Collected in August/September 2009, it contains the following information: Users: 54,981,152 Tweets: 1,755,925,520 (almost a complete history) Social Graph: 1,963,263,821 social links It contains information related to Replies and Retweets (interactions)

Characterization Waiting Times (overload evidence) How long does a tweet wait in the timeline to be replied (retweeted)? Factors that affect interactions Message Age Previous Interactions Sending Rate

Waiting Times

Message Age

Previous interaction Are previously replied (retweeted) users more likely to be replied (retweeted) again? We computed for each user i the conditional probability that a message m will be replied (retweeted) by i given that i has replied (retweeted) the sender of m before;

Sending rate Are users with a higher sending rate more likely to be replied (retweeted)? For each user i, for each j Outi we compared the sending rate of j with the fraction of her tweets replied (retweeted) by i.

Reorganizing the Twitter Timeline Use the knowledge presented in order to create a new way to show tweets for the users More interesting tweets (more likely to be replied or retweeted) in the top of the timeline. Two schemes Naive Bayes (NB) Support Vector Machine (SVM) Three attributes Age(m): Age of m SR(m): Sending rate of the sender of m I(m): Binary indicator for previous interactions with the sender of m

Results

Google+ New Kid on the Block: Exploring the Google+ Social Graph, ACM Internet Measurement Conference, Sigcomm, 2012, Boston Joint work with: G. Magno, G. Comarela, D. Saez and Meeyong Cha. 26

Online Social Networks OSNs now reach 82% of the world s Internet-using population (1.2 billion) Social Networking accounts for 19% of all time spent online Social Networking is the most popular online activity worldwide Source: comscore, December 21, 2011 27

Google+ Growth # users Days Google+ is the fastest growing OSN 28

Goal: characterization Analyze how much and what kind of personal information people share in Google+ Measure statistics of the Google+ social graph and compare with other OSNs Evaluate the impact of geography on user behavior in Google+ 29

Dataset: big data Nov. 11th Dec. 27th (2011) 27,556,390 profiles 35,114,957 nodes 575,141,097 edges 30

What kind of information do people share more?

Privacy Concerns Users revealing more information on their profiles have greater risk in privacy In Facebook (young users, to friends)¹: 64.1% share e-mail 10.7% share telephone 10.7% share home address 32

What kind of information do people share more? In Google+ (public): 0.22% share Work contact 0.21% share Home contact 0.26% share telephone numbers (72,736 users) Users that shared telephone: tel-users 33

Number of fields shared in profile Tel-users share more information 34

Information shared by users Women are less likely to share phone number The majority of tel-users are single; a smaller fraction of them are in a relationship. Fraction of Indian users in the tel-users group is twice as big as in other countries 35

How are people connected on Google+?

Structural Characteristics of Social Graphs Hidden edges Higher avg. path length Higher reciprocity = More social Diameter similar to Twitter, lower than Facebook New network Lower number of friends 37

Structural Characteristics Clust. Coef. Higher Clustering Coefficient than Twitter 38

What is the impact of geography on the social relationships?

Geo-location Information Question: is the geographical location of users an important factor in the formation of social links? Extract GPS coordinates from map image Retrieve country information 6,621,644 users with valid country inf. 40

Patterns Across Geo-locations Average Path Miles 58% of friends were separated by less than a thousand miles Physical distance has influence on the intensity of the relationship 41

Social Links Across Geography are users in the same country more likely to be friends than users in different countries US is dominant on the influx of edges Populous countries have more self-loops 42

G+ Observations Google+ is more social than Twitter Higher reciprocity Higher clustering coefficient Reflects offline relationship Users exhibit different notions and expectations in Google+, based on geography Privacy Content Connections 43

Concluding Remarks Big data has created new opportunities for scientific discoveries in the realm of social computing: user preference understanding data mining summarization and aggregation explorative analysis of large data sets privacy scalable services