CMU SCS Large Graph Mining Patterns, Tools and Cascade analysis

Similar documents
Graph Mining Techniques for Social Media Analysis

Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations

Scaling Up HBase, Hive, Pegasus

Introduction to Graph Mining

Big Graph Mining: Algorithms and Discoveries

Big Data Analytics Process & Building Blocks

Graph Processing and Social Networks

Big Data Analytics Building Blocks. Simple Data Storage (SQLite)

Graph models for the Web and the Internet. Elias Koutsoupias University of Athens and UCLA. Crete, July 2003

Big Data Analytics Building Blocks. Simple Data Storage (SQLite)

Big Data Analytics Building Blocks; Simple Data Storage (SQLite)

Cost effective Outbreak Detection in Networks

Jure Leskovec Stanford University

The Internet Is Like A Jellyfish

Social Network Mining

MMap: Fast Billion-Scale Graph Computation on a PC via Memory Mapping

Search in BigData2 - When Big Text meets Big Graph 1. Introduction State of the Art on Big Data

Analyzing Big Data with AWS

Exploring Big Data in Social Networks

How To Find Out How A Graph Densifies

Reconstruction and Analysis of Twitter Conversation Graphs

Influence Propagation in Social Networks: a Data Mining Perspective

AN INTRODUCTION TO SOCIAL NETWORK DATA ANALYTICS

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

An Analysis of Verifications in Microblogging Social Networks - Sina Weibo

INFO 2950 Intro to Data Science. Lecture 17: Power Laws and Big Data

Social Networks and Social Media

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

HADI: Fast Diameter Estimation and Mining in Massive Graphs with Hadoop

Data Mining Meets HCI: Making Sense of Large Graphs Duen Horng (Polo) Chau

Dynamics of information spread on networks. Kristina Lerman USC Information Sciences Institute

Large-Scale Data Processing

Evaluating Online Payment Transaction Reliability using Rules Set Technique and Graph Model

Bayesian networks - Time-series models - Apache Spark & Scala

Marko Grobelnik Jozef Stefan Institute Ljubljana, Slovenia

The Big Data Paradigm Shift. Insight Through Automation

Big Graph Processing: Some Background

Social Prediction in Mobile Networks: Can we infer users emotions and social ties?

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Outline. What is Big data and where they come from? How we deal with Big data?

Mining Large Graphs. ECML/PKDD 2007 tutorial. Part 2: Diffusion and cascading behavior

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

MLg. Big Data and Its Implication to Research Methodologies and Funding. Cornelia Caragea TARDIS November 7, Machine Learning Group

Estimating PageRank Values of Wikipedia Articles using MapReduce

Introduction. A. Bellaachia Page: 1

Big Data Analytics. Lucas Rego Drumond

Introduction to Data Mining

BIG DATA TRENDS AND TECHNOLOGIES

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Using Data Mining to Detect Insurance Fraud

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Information Management course

Time Series Analysis and Forecasting Methods for Temporal Mining of Interlinked Documents

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Concept and Project Objectives

Social Network Analysis

Hadoop for Enterprises:

Part 1: Link Analysis & Page Rank

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Sunnie Chung. Cleveland State University


BIG DATA TOOLS. Top 10 open source technologies for Big Data

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Hadoop Parallel Data Processing

Large Scale Social Network Analysis

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Community-Aware Prediction of Virality Timing Using Big Data of Social Cascades

Unlocking the Intelligence in. Big Data. Ron Kasabian General Manager Big Data Solutions Intel Corporation

SOCIAL NETWORK DATA ANALYTICS

Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems

Dynamics of Large Networks. Jurij Leskovec

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Transcription:

Large Graph Mining Patterns, Tools and Cascade analysis Christos Faloutsos CMU

Roadmap Introduction Motivation Why big data Why (big) graphs? Patterns in graphs Tools: fraud detection on e-bay Conclusions C. Faloutsos (CMU) 2

Why big data Why big data? What is the problem definition? What are the major research challenges? C. Faloutsos (CMU) 3

Main message: Big data: often > experts Super Crunchers Why Thinking-By-Numbers is the New Way To Be Smart by Ian Ayres, 2008 Google won the machine translation competition 2005 http://www.itl.nist.gov/iad/mig//tests/mt/2005/doc/ mt05eval_official_results_release_20050801_v3.html C. Faloutsos (CMU) 4

Problem definition big picture Tera/Peta-byte data Analytics Insights, outliers C. Faloutsos (CMU) 5

Roadmap Introduction Motivation Why big data Why (big) graphs? Patterns in graphs Tools: fraud detection on e-bay Conclusions C. Faloutsos (CMU) 6

Graphs - why should we care? >$10B revenue Food Web [Martinez 91] >0.5B users Internet Map [lumeta.com] C. Faloutsos (CMU) 7

Graphs - why should we care? IR: bi-partite graphs (doc-terms) D 1...... T 1 web: hyper-text graph D N T M... and more: C. Faloutsos (CMU) 8

Graphs - why should we care? viral marketing web-log ( blog ) news propagation computer network security: email/ip traffic and anomaly detection... Subject-verb-object -> graph Many-to-many db relationship -> graph C. Faloutsos (CMU) 9

Outline Introduction Motivation Patterns in graphs Static graphs Time evolving graphs Radius, conn components Tools: fraud detection on e-bay Conclusions C. Faloutsos (CMU) 10

Problem - network and graph mining What does the Internet look like? What does FaceBook look like? What is normal / abnormal? which patterns/laws hold? C. Faloutsos (CMU) 11

Problem - network and graph mining What does the Internet look like? What does FaceBook look like? What is normal / abnormal? which patterns/laws hold? To spot anomalies (rarities), we have to discover patterns C. Faloutsos (CMU) 12

Problem - network and graph mining What does the Internet look like? What does FaceBook look like? What is normal / abnormal? which patterns/laws hold? To spot anomalies (rarities), we have to discover patterns Large datasets reveal patterns/anomalies that may be invisible otherwise C. Faloutsos (CMU) 13

Graph mining Are real graphs random? C. Faloutsos (CMU) 14

Laws and patterns Are real graphs random? A: NO!! Diameter in- and out- degree distributions other (surprising) patterns So, let s look at the data C. Faloutsos (CMU) 15

Solution# S.1 Power law in the degree distribution [SIGCOMM99] internet domains log(degree) att.com ibm.com log(rank) C. Faloutsos (CMU) 16

Solution# S.1 Power law in the degree distribution [SIGCOMM99] internet domains log(degree) att.com ibm.com -0.82 log(rank) C. Faloutsos (CMU) 17

But: How about graphs from other domains? C. Faloutsos (CMU) 18

More power laws: web hit counts [w/ A. Montgomery] Web Site Traffic Count (log scale) Zipf ``ebay users sites in-degree (log scale) C. Faloutsos (CMU) 19

epinions.com count who-trusts-whom [Richardson + Domingos, KDD 2001] trusts-2000-people user (out) degree C. Faloutsos (CMU) 20

And numerous more # of sexual contacts Income [Pareto] 80-20 distribution Duration of downloads [Bestavros+] Duration of UNIX jobs ( mice and elephants ) Size of files of a user Black swans C. Faloutsos (CMU) 21

Roadmap Introduction Motivation Problem#1: Patterns in graphs Static graphs degree, diameter, triangles Cliques Time evolving graphs C. Faloutsos (CMU) 22

Solution# S.2: Triangle Laws Real social networks have a lot of triangles C. Faloutsos (CMU) 23

Solution# S.2: Triangle Laws Real social networks have a lot of triangles Friends of friends are friends Any patterns? C. Faloutsos (CMU) 24

Triangle Law: #S.2 [Tsourakakis ICDM 2008] Reuters SN Epinions X-axis: degree Y-axis: mean # triangles n friends -> ~n 1.6 triangles C. Faloutsos (CMU) 25

Triangle Law: #S.2 [Tsourakakis ICDM 2008] Reuters SN Epinions X-axis: degree Y-axis: mean # triangles n friends -> ~n 1.6 triangles C. Faloutsos (CMU) 26

Triangle counting for large graphs???? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD 11] C. Faloutsos (CMU) 27

Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD 11] C. Faloutsos (CMU) 28

Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD 11] C. Faloutsos (CMU) 29

Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD 11] C. Faloutsos (CMU) 30

Roadmap Introduction Motivation Patterns in graphs Static graphs Time evolving graphs Tools C. Faloutsos (CMU) 31

Problem: Time evolution with Jure Leskovec (CMU -> Stanford) and Jon Kleinberg (Cornell sabb. @ CMU) C. Faloutsos (CMU) 32

T.1 Evolution of the Diameter Prior work on Power Law graphs hints at slowly growing diameter: diameter ~ O(log N) diameter ~ O(log log N) What is happening in real data? C. Faloutsos (CMU) 33

T.1 Evolution of the Diameter Prior work on Power Law graphs hints at slowly growing diameter: diameter ~ O(log N) diameter ~ O(log log N) What is happening in real data? Diameter shrinks over time C. Faloutsos (CMU) 34

T.1 Diameter Patents Patent citation network 25 years of data @1999 2.9 M nodes 16.5 M edges diameter time [years] C. Faloutsos (CMU) 35

T.2 Temporal Evolution of the Graphs N(t) nodes at time t E(t) edges at time t Suppose that N(t+1) = 2 * N(t) Q: what is your guess for E(t+1) =? 2 * E(t) C. Faloutsos (CMU) 36

T.2 Temporal Evolution of the Graphs N(t) nodes at time t E(t) edges at time t Suppose that N(t+1) = 2 * N(t) Q: what is your guess for E(t+1) =? 2 * E(t) A: over-doubled! But obeying the ``Densification Power Law C. Faloutsos (CMU) 37

T.2 Densification Patent Citations Citations among patents granted @1999 2.9 M nodes 16.5 M edges Each year is a datapoint E(t) 1.66 N(t) C. Faloutsos (CMU) 38

T.3 : popularity over time # in links 1 2 3 lag: days after post Post popularity drops-off exponentially? @t @t + lag C. Faloutsos (CMU) 39

T.3 : popularity over time # in links (log) Post popularity drops-off exponentially? POWER LAW! Exponent? days after post (log) C. Faloutsos (CMU) 40

T.3 : popularity over time # in links (log) -1.6 Post popularity drops-off exponentially? POWER LAW! Exponent? -1.6 close to -1.5: Barabasi s stack model and like the zero-crossings of a random walk days after post (log) C. Faloutsos (CMU) 41

-1.5 slope J. G. Oliveira & A.-L. Barabási Human Dynamics: The Correspondence Patterns of Darwin and Einstein. Nature 437, 1251 (2005). [PDF] Prob(RT > x) (log) Response time (log) C. Faloutsos (CMU) 42

Roadmap Introduction Motivation Patterns in graphs Diameter Connected components Tools: fraud detection on e-bay Conclusions C. Faloutsos (CMU) 43

HADI for diameter estimation Radius Plots for Mining Tera-byte Scale Graphs U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, SDM 10 Naively: diameter needs O(N**2) space and up to O(N**3) time prohibitive (N~1B) Our HADI: linear on E (~10B) Near-linear scalability wrt # machines Several optimizations -> 5x faster C. Faloutsos (CMU) 44

Count???? 19+ [Barabasi+] ~1999, ~1M nodes Radius C. Faloutsos (CMU) 45

Count?????? 19+ [Barabasi+] ~1999, ~1M nodes Radius YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Largest publicly available graph ever studied. C. Faloutsos (CMU) 46

Count 14 (dir.)???? ~7 (undir.) 19+? [Barabasi+] Radius YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Largest publicly available graph ever studied. C. Faloutsos (CMU) 47

Count 14 (dir.)???? ~7 (undir.) 19+? [Barabasi+] YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) 7 degrees of separation (!) Diameter: shrunk C. Faloutsos (CMU) Radius 48

Count???? ~7 (undir.) Radius YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Q: Shape? C. Faloutsos (CMU) 49

YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) effective diameter: surprisingly small. Multi-modality (?!) C. Faloutsos (CMU) 50

Roadmap Introduction Motivation Patterns in graphs Diameter Connected components Tools: fraud detection on e-bay Conclusions C. Faloutsos (CMU) 51

Generalized Iterated Matrix Vector Multiplication (GIMV) PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations. U Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. (ICDM) 2009, Miami, Florida, USA. Best Application Paper (runner-up). C. Faloutsos (CMU) 52

Example: GIM-V At Work Connected Components 4 observations: Count Size C. Faloutsos (CMU) 53

Example: GIM-V At Work Connected Components Count 1) 10K x larger than next Size C. Faloutsos (CMU) 54

Example: GIM-V At Work Connected Components Count 2) ~0.7B singleton nodes Size C. Faloutsos (CMU) 55

Example: GIM-V At Work Connected Components Count 3) SLOPE! Size C. Faloutsos (CMU) 56

Example: GIM-V At Work Connected Components Count 300-size cmpt X 500. 1100-size cmpt Why? X 65. Why? 4) Spikes! Size C. Faloutsos (CMU) 57

Example: GIM-V At Work Connected Components Count suspicious financial-advice sites (not existing now) Size C. Faloutsos (CMU) 58

Roadmap Introduction Motivation Patterns in graphs Tools: fraud detection on e-bay Conclusions C. Faloutsos (CMU) 59

E-bay Fraud detection w/ Polo Chau & Shashank Pandit, CMU [www 07] C. Faloutsos (CMU) 60

E-bay Fraud detection C. Faloutsos (CMU) 61

E-bay Fraud detection C. Faloutsos (CMU) 62

E-bay Fraud detection - NetProbe C. Faloutsos (CMU) 63

Popular press And less desirable attention: E-mail from Belgium police ( copy of your code? ) UMN CRAY, 2012 C. Faloutsos (CMU) 64

OVERALL CONCLUSIONS low level: Several new patterns (fortification, triangle-laws, conn. components, etc) New tools: belief propagation, gigatensor, etc Scalability: PEGASUS / hadoop C. Faloutsos (CMU) 65

OVERALL CONCLUSIONS high level BIG DATA: Large datasets reveal patterns/ outliers that are invisible otherwise C. Faloutsos (CMU) 66

Project info & thanks www.cs.cmu.edu/~pegasus Thanks to: NSF IIS-0705359, IIS-0534205, CTA-INARC; Yahoo (M45), LLNL, IBM, SPRINT, C. Faloutsos (CMU) 67 Google, INTEL, HP, ilab

Cast Akoglu, Leman Beutel, Alex Chau, Polo Kang, U Koutra, Danai McGlohon, Mary Prakash, Aditya Papalexakis, Vagelis Tong, Hanghang C. Faloutsos (CMU) 68

Take-home message Tera/Peta-byte data Analytics Insights, outliers Big data reveal insights that would be invisible otherwise (even to experts) C. Faloutsos (CMU) 69