Large Graph Mining Patterns, Tools and Cascade analysis Christos Faloutsos CMU
Roadmap Introduction Motivation Why big data Why (big) graphs? Patterns in graphs Tools: fraud detection on e-bay Conclusions C. Faloutsos (CMU) 2
Why big data Why big data? What is the problem definition? What are the major research challenges? C. Faloutsos (CMU) 3
Main message: Big data: often > experts Super Crunchers Why Thinking-By-Numbers is the New Way To Be Smart by Ian Ayres, 2008 Google won the machine translation competition 2005 http://www.itl.nist.gov/iad/mig//tests/mt/2005/doc/ mt05eval_official_results_release_20050801_v3.html C. Faloutsos (CMU) 4
Problem definition big picture Tera/Peta-byte data Analytics Insights, outliers C. Faloutsos (CMU) 5
Roadmap Introduction Motivation Why big data Why (big) graphs? Patterns in graphs Tools: fraud detection on e-bay Conclusions C. Faloutsos (CMU) 6
Graphs - why should we care? >$10B revenue Food Web [Martinez 91] >0.5B users Internet Map [lumeta.com] C. Faloutsos (CMU) 7
Graphs - why should we care? IR: bi-partite graphs (doc-terms) D 1...... T 1 web: hyper-text graph D N T M... and more: C. Faloutsos (CMU) 8
Graphs - why should we care? viral marketing web-log ( blog ) news propagation computer network security: email/ip traffic and anomaly detection... Subject-verb-object -> graph Many-to-many db relationship -> graph C. Faloutsos (CMU) 9
Outline Introduction Motivation Patterns in graphs Static graphs Time evolving graphs Radius, conn components Tools: fraud detection on e-bay Conclusions C. Faloutsos (CMU) 10
Problem - network and graph mining What does the Internet look like? What does FaceBook look like? What is normal / abnormal? which patterns/laws hold? C. Faloutsos (CMU) 11
Problem - network and graph mining What does the Internet look like? What does FaceBook look like? What is normal / abnormal? which patterns/laws hold? To spot anomalies (rarities), we have to discover patterns C. Faloutsos (CMU) 12
Problem - network and graph mining What does the Internet look like? What does FaceBook look like? What is normal / abnormal? which patterns/laws hold? To spot anomalies (rarities), we have to discover patterns Large datasets reveal patterns/anomalies that may be invisible otherwise C. Faloutsos (CMU) 13
Graph mining Are real graphs random? C. Faloutsos (CMU) 14
Laws and patterns Are real graphs random? A: NO!! Diameter in- and out- degree distributions other (surprising) patterns So, let s look at the data C. Faloutsos (CMU) 15
Solution# S.1 Power law in the degree distribution [SIGCOMM99] internet domains log(degree) att.com ibm.com log(rank) C. Faloutsos (CMU) 16
Solution# S.1 Power law in the degree distribution [SIGCOMM99] internet domains log(degree) att.com ibm.com -0.82 log(rank) C. Faloutsos (CMU) 17
But: How about graphs from other domains? C. Faloutsos (CMU) 18
More power laws: web hit counts [w/ A. Montgomery] Web Site Traffic Count (log scale) Zipf ``ebay users sites in-degree (log scale) C. Faloutsos (CMU) 19
epinions.com count who-trusts-whom [Richardson + Domingos, KDD 2001] trusts-2000-people user (out) degree C. Faloutsos (CMU) 20
And numerous more # of sexual contacts Income [Pareto] 80-20 distribution Duration of downloads [Bestavros+] Duration of UNIX jobs ( mice and elephants ) Size of files of a user Black swans C. Faloutsos (CMU) 21
Roadmap Introduction Motivation Problem#1: Patterns in graphs Static graphs degree, diameter, triangles Cliques Time evolving graphs C. Faloutsos (CMU) 22
Solution# S.2: Triangle Laws Real social networks have a lot of triangles C. Faloutsos (CMU) 23
Solution# S.2: Triangle Laws Real social networks have a lot of triangles Friends of friends are friends Any patterns? C. Faloutsos (CMU) 24
Triangle Law: #S.2 [Tsourakakis ICDM 2008] Reuters SN Epinions X-axis: degree Y-axis: mean # triangles n friends -> ~n 1.6 triangles C. Faloutsos (CMU) 25
Triangle Law: #S.2 [Tsourakakis ICDM 2008] Reuters SN Epinions X-axis: degree Y-axis: mean # triangles n friends -> ~n 1.6 triangles C. Faloutsos (CMU) 26
Triangle counting for large graphs???? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD 11] C. Faloutsos (CMU) 27
Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD 11] C. Faloutsos (CMU) 28
Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD 11] C. Faloutsos (CMU) 29
Triangle counting for large graphs? Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD 11] C. Faloutsos (CMU) 30
Roadmap Introduction Motivation Patterns in graphs Static graphs Time evolving graphs Tools C. Faloutsos (CMU) 31
Problem: Time evolution with Jure Leskovec (CMU -> Stanford) and Jon Kleinberg (Cornell sabb. @ CMU) C. Faloutsos (CMU) 32
T.1 Evolution of the Diameter Prior work on Power Law graphs hints at slowly growing diameter: diameter ~ O(log N) diameter ~ O(log log N) What is happening in real data? C. Faloutsos (CMU) 33
T.1 Evolution of the Diameter Prior work on Power Law graphs hints at slowly growing diameter: diameter ~ O(log N) diameter ~ O(log log N) What is happening in real data? Diameter shrinks over time C. Faloutsos (CMU) 34
T.1 Diameter Patents Patent citation network 25 years of data @1999 2.9 M nodes 16.5 M edges diameter time [years] C. Faloutsos (CMU) 35
T.2 Temporal Evolution of the Graphs N(t) nodes at time t E(t) edges at time t Suppose that N(t+1) = 2 * N(t) Q: what is your guess for E(t+1) =? 2 * E(t) C. Faloutsos (CMU) 36
T.2 Temporal Evolution of the Graphs N(t) nodes at time t E(t) edges at time t Suppose that N(t+1) = 2 * N(t) Q: what is your guess for E(t+1) =? 2 * E(t) A: over-doubled! But obeying the ``Densification Power Law C. Faloutsos (CMU) 37
T.2 Densification Patent Citations Citations among patents granted @1999 2.9 M nodes 16.5 M edges Each year is a datapoint E(t) 1.66 N(t) C. Faloutsos (CMU) 38
T.3 : popularity over time # in links 1 2 3 lag: days after post Post popularity drops-off exponentially? @t @t + lag C. Faloutsos (CMU) 39
T.3 : popularity over time # in links (log) Post popularity drops-off exponentially? POWER LAW! Exponent? days after post (log) C. Faloutsos (CMU) 40
T.3 : popularity over time # in links (log) -1.6 Post popularity drops-off exponentially? POWER LAW! Exponent? -1.6 close to -1.5: Barabasi s stack model and like the zero-crossings of a random walk days after post (log) C. Faloutsos (CMU) 41
-1.5 slope J. G. Oliveira & A.-L. Barabási Human Dynamics: The Correspondence Patterns of Darwin and Einstein. Nature 437, 1251 (2005). [PDF] Prob(RT > x) (log) Response time (log) C. Faloutsos (CMU) 42
Roadmap Introduction Motivation Patterns in graphs Diameter Connected components Tools: fraud detection on e-bay Conclusions C. Faloutsos (CMU) 43
HADI for diameter estimation Radius Plots for Mining Tera-byte Scale Graphs U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, SDM 10 Naively: diameter needs O(N**2) space and up to O(N**3) time prohibitive (N~1B) Our HADI: linear on E (~10B) Near-linear scalability wrt # machines Several optimizations -> 5x faster C. Faloutsos (CMU) 44
Count???? 19+ [Barabasi+] ~1999, ~1M nodes Radius C. Faloutsos (CMU) 45
Count?????? 19+ [Barabasi+] ~1999, ~1M nodes Radius YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Largest publicly available graph ever studied. C. Faloutsos (CMU) 46
Count 14 (dir.)???? ~7 (undir.) 19+? [Barabasi+] Radius YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Largest publicly available graph ever studied. C. Faloutsos (CMU) 47
Count 14 (dir.)???? ~7 (undir.) 19+? [Barabasi+] YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) 7 degrees of separation (!) Diameter: shrunk C. Faloutsos (CMU) Radius 48
Count???? ~7 (undir.) Radius YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Q: Shape? C. Faloutsos (CMU) 49
YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) effective diameter: surprisingly small. Multi-modality (?!) C. Faloutsos (CMU) 50
Roadmap Introduction Motivation Patterns in graphs Diameter Connected components Tools: fraud detection on e-bay Conclusions C. Faloutsos (CMU) 51
Generalized Iterated Matrix Vector Multiplication (GIMV) PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations. U Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. (ICDM) 2009, Miami, Florida, USA. Best Application Paper (runner-up). C. Faloutsos (CMU) 52
Example: GIM-V At Work Connected Components 4 observations: Count Size C. Faloutsos (CMU) 53
Example: GIM-V At Work Connected Components Count 1) 10K x larger than next Size C. Faloutsos (CMU) 54
Example: GIM-V At Work Connected Components Count 2) ~0.7B singleton nodes Size C. Faloutsos (CMU) 55
Example: GIM-V At Work Connected Components Count 3) SLOPE! Size C. Faloutsos (CMU) 56
Example: GIM-V At Work Connected Components Count 300-size cmpt X 500. 1100-size cmpt Why? X 65. Why? 4) Spikes! Size C. Faloutsos (CMU) 57
Example: GIM-V At Work Connected Components Count suspicious financial-advice sites (not existing now) Size C. Faloutsos (CMU) 58
Roadmap Introduction Motivation Patterns in graphs Tools: fraud detection on e-bay Conclusions C. Faloutsos (CMU) 59
E-bay Fraud detection w/ Polo Chau & Shashank Pandit, CMU [www 07] C. Faloutsos (CMU) 60
E-bay Fraud detection C. Faloutsos (CMU) 61
E-bay Fraud detection C. Faloutsos (CMU) 62
E-bay Fraud detection - NetProbe C. Faloutsos (CMU) 63
Popular press And less desirable attention: E-mail from Belgium police ( copy of your code? ) UMN CRAY, 2012 C. Faloutsos (CMU) 64
OVERALL CONCLUSIONS low level: Several new patterns (fortification, triangle-laws, conn. components, etc) New tools: belief propagation, gigatensor, etc Scalability: PEGASUS / hadoop C. Faloutsos (CMU) 65
OVERALL CONCLUSIONS high level BIG DATA: Large datasets reveal patterns/ outliers that are invisible otherwise C. Faloutsos (CMU) 66
Project info & thanks www.cs.cmu.edu/~pegasus Thanks to: NSF IIS-0705359, IIS-0534205, CTA-INARC; Yahoo (M45), LLNL, IBM, SPRINT, C. Faloutsos (CMU) 67 Google, INTEL, HP, ilab
Cast Akoglu, Leman Beutel, Alex Chau, Polo Kang, U Koutra, Danai McGlohon, Mary Prakash, Aditya Papalexakis, Vagelis Tong, Hanghang C. Faloutsos (CMU) 68
Take-home message Tera/Peta-byte data Analytics Insights, outliers Big data reveal insights that would be invisible otherwise (even to experts) C. Faloutsos (CMU) 69