Big Data Science. Prof. Lise Getoor University of Maryland, College Park. October 17, 2013


 Stephen Poole
 1 years ago
 Views:
Transcription
1 Big Data Science Prof Lise Getoor University of Maryland, College Park October 17, 2013
2 BIG Data is not flat lonnitaylor
3 Data is multimodal, multirelational, spatiotemporal, multimedia Entities and relationships are important!
4 NEED: Data Science for Graphs New V: Vinculate
5 Diverse applications areas: social networks & social media, computational biology, computer vision, computational linguistics, computational sustainability, computational social science, personalized medicine, cybersecurity, etc Research Overview Combining statistical & logical methods for machine learning and reasoning under uncertainty ML Visual Analytic Tools MIT Press 2007 Probabilistic databases, data integration and alignment DB DDupe VIZ CGroup
6 Statistical Relational Learning (SRL) AI/DB representations + statistics for multirelational data l Entities can be of different types l Entities can participate in a variety of relationships l examples: Markov logic networks, relational dependency networks, Bayesian logic programs, probabilistic relational models, many others Key ideas l Relational feature construction l Collective reasoning l Lifted representation, inference and learning For more details, see NIPS 2012 Tutorial, nips2012pdf
7 Common Graph Data Analysis Patterns Joint inference over large networks for: l Collective Classification inferring labels of nodes in graph l Link Prediction inferring the existence of edges in graph l Entity Resolution clustering nodes that refer to the same underlying entity
8 Common Graph Data Analysis Patterns Joint inference over large networks for: l Collective Classification inferring labels of nodes in graph l Link Prediction inferring the existence of edges in graph l Entity Resolution clustering nodes that refer to the same underlying entity
9 Common Graph Data Analysis Patterns Joint inference over large networks for: l Collective Classification inferring labels of nodes in graph l Link Prediction inferring the existence of edges in graph l Entity Resolution clustering nodes that refer to the same underlying entity
10 Common Graph Data Analysis Patterns Joint inference over large networks for: l Collective Classification inferring labels of nodes in graph l Link Prediction inferring the existence of edges in graph l Entity Resolution clustering nodes that refer to the same underlying entity
11 What s Needed Next? Methods which can perform and interleave these tasks Desiderata: Flexible, scalable, declarative support for collective classification, link prediction, entity resolution and other information fusion and information alignment problems
12 Probabilistic Soft Logic (PSL) Stephen Bach Matthias Broecheler Alex Memory Lily Mihalkova Stanley Kok Bert Huang Angelika Kimmig
13 Probabilistic Soft Logic (PSL) Declarative language based on logics to express collective probabilistic inference problems  Predicate = relationship or property  Atom = (continuous) random variable  Rule = capture dependency or constraint  Set = define aggregates PSL Program = Rules + Input DB
14 Probabilistic Soft Logic (PSL) Declarative language based on logics to express collective probabilistic inference problems  Predicate = relationship or property  Atom = (continuous) random variable  Rule = capture dependency or constraint  Set = define aggregates PSL Program = Rules + Input DB
15 Node Labeling?
16 Voter Opinion Modeling? Status update $ $ Tweet
17 Voter Opinion Modeling spouse colleague friend friend spouse friend spouse friend colleague
18 Voter Opinion Modeling vote(a,p) friend(b,a) à vote(b,p) : 03 spouse colleague friend friend spouse friend spouse friend colleague vote(a,p) spouse(b,a) à vote(b,p) : 08
19 Link Prediction Entities  People, s Attributes  Words in s Relationships  communication, work relationship Goal: Identify work relationships  Supervisor, subordinate, colleague #
20 Link Prediction People, s, words, communication, relations Use rules to express evidence  If content suggests type X, it is of type X  If A sends deadline s to B, then A is the supervisor of B  If A is the supervisor of B, and A is the supervisor of C, then B and C are colleagues #
21 Link Prediction People, s, words, communication, relations Use rules to express evidence  If content suggests type X, it is of type X  If A sends deadline s to B, then A is the supervisor of B  If A is the supervisor of B, and A is the supervisor of C, then B and C are colleagues # complete by due
22 Link Prediction People, s, words, communication, relations Use rules to express evidence  If content suggests type X, it is of type X  If A sends deadline s to B, then A is the supervisor of B  If A is the supervisor of B, and A is the supervisor of C, then B and C are colleagues #
23 Link Prediction People, s, words, communication, relations Use rules to express evidence  If content suggests type X, it is of type X  If A sends deadline s to B, then A is the supervisor of B  If A is the supervisor of B, and A is the supervisor of C, then B and C are colleagues #
24 Entity Resolution Entities  People References Attributes  Name Relationships  Friendship Goal: Identify references that denote the same person John Smith J Smith name name C E A B friend friend D F G = = H
25 Entity Resolution References, names, friendships Use rules to express evidence  If two people have similar names, they are probably the same  If two people have similar friends, they are probably the same  If A=B and B=C, then A and C must also denote the same person John Smith J Smith name name C E A B friend friend D F G = H =
26 Entity Resolution References, names, friendships Use rules to express evidence  If two people have similar names, they are probably the same  If two people have similar friends, they are probably the same  If A=B and B=C, then A and C must also denote the same person Aname {str_sim} Bname => A B : 08 C John Smith A E name friend J Smith B name D F G = = H friend
27 Entity Resolution References, names, friendships Use rules to express evidence  If two people have similar names, they are probably the same  If two people have similar friends, they are probably the same  If A=B and B=C, then A and C must also denote the same person John Smith J Smith name name C E A B friend friend D F G = H {Afriends} {} {Bfriends} => A B : 06 =
28 Entity Resolution References, names, friendships Use rules to express evidence  If two people have similar names, they are probably the same  If two people have similar friends, they are probably the same  If A=B and B=C, then A and C must also denote the same person John Smith J Smith name name C E A B friend friend D F G = H = A B ^ B C => A C :
29 Mathematical Foundation
30 Probabilistic Model Probability density over interpretation I Ground rules distance to satisfaction d r (I) = max{i r,body I r,head, 0} # P (I) = 1 Z exp " X r2r w r (d r (I)) p r Normalization constant Ground rules Rule weight Distance exponent (in {1, 2})
31 Hingeloss MRFs
32 Hingeloss Markov Random Fields 2 P (Y X) = 1 Z exp 4 3 mx w j max{`j(y, X), 0} p j 5 j=1 Continuous variables in [0,1] Potentials are hingeloss functions Subject to arbitrary linear constraints Logconcave!
33 Inference as Convex Optimization Maximum Aposteriori Probability (MAP) Objective: arg max P (Y X) Y mx = arg min Y This is convex! Can solve using offtheshelf convex optimization packages or custom solver j=1 w j max{`j(y, X), 0} p j
34 Consensus Optimization Idea: Decompose problem and solve subproblems independently (in parallel), then merge results  Subproblems are ground rules  Auxiliary variables enforce consensus across subproblems Framework: Alternating direction method of multipliers (ADMM) [Boyd, 2011] Inference with ADMM is fast, scalable, and straightforward to implement [Bach et al, NIPS 2012, UAI 2013]
35 [Bach et al, UAI 2013; London et al, 2013] Speed Average running time Cora Citeseer Epinions Ac/vity Discrete MRF 1109 s 1843 s 2124 s 3442 s HL MRF 04 s 07 s 12 s 06 s Variables 10k 10k 1k 8k Poten/als 14k 19k 18k 75k Inference in HLMRFs is orders of magnitude faster than in discrete MRFs which use MCMC approximate inference In practice, scales linearly with the number of potentials
36 Distributed Processing
37 Distributed MAP Inference ADMM consensus optimization problem can be implemented naturally in distributed setting For k+1 iteration, it consists three steps in which sub problems can run independently (1 st and 2 nd step): 1 Update Lagrangian multiplier Miao, Liu, Huang, Getoor, IEEE Big Data Update each sub problem OR 3 Update the global variables
38 Distributed MAP: MapReduce Job Bootstrap Mapper sub problem load global variable X as side data read global variable X read/write subproblem Pros: Straightforward Design Cons: Job bootstrapping cost between iterations Difficult to schedule subset of nodes to run local variable copy z1 z2 zq zp z1 zq z2 z1 Reducer update global component z1 z1 z1 z2 z2 zq zq zp HDFS or HBase write new global variable
39 Distributed MAP: GraphLab sub problem node global variable component Advantages: No need to touch disk, no job bootstrapping cost Easy to express local convergence conditions to schedule only subset of nodes gather get z gather get local z,y apply update y update x scatter notify z apply update z scatter unless converge notify X update i update i+1
40 Experimental Results Using PSL for knowledge graph cleaning task  16M+ vertices, 22M+ edges, for small running instances  Takes 100 minutes to finish in Java single machine implementation using 40G+ memory  Distributed GraphLab implementation takes less than 15 minutes using 4 smaller machines  Possible to use commodity machines on large models!
41 Miao, Liu, Huang, Getoor, BigData 13 Experimental Results Voter model using commodity machines Name Subproblem Consensus Edge Fit in One Machine? Run time (sec) m = 8 SN 1M 33M 11M 6M Yes 2230 SN 2M 66M 21M 12M No 3997 SN 3M 10M 31M 18M No 4395 SN 4M 13M 42M 24M No 5376 Machine: Intel Core2 Quad CPU 266GHz machines with 4GB RAM running Ubuntu 1204 Linux Running time (sec) Hyper Greedy Number of Machines (SN 2M ) Running time (sec) Hyper Greedy Number of machines Strong scaling with fixed dataset Weak scaling with increasing size
42 Applications
43 Application Domains Computer Vision  Lowlevel image reconstruction  Activity recognition in Video Computational Biology & Health Informatics  Drugtarget prediction  Event discovery in EMR data Computational Social Science  Voter Opinion Analysis & discovery of Latent Groups in Social Media  Inferring bias in political discourse  Psychological modeling on online social networks Information Integration & Extraction  Entity resolution  Ontology alignment & schema mapping  Knowledge graph identification Upcoming: climate graphs, discourse analysis, MOOCs, more!
44 Closing Comments Great opportunities to do good work and do useful things in the current era of big data, information overload and network science big data science Statistical relational learning provides some of the tools, much work still needed, developing theoretical bounds for relational learning, scalability, etc Compelling applications abound! Looking for students, programmers, & postdocs
45 pslumiacsumdedu
Lifted FirstOrder Belief Propagation
Lifted FirstOrder Belief Propagation Parag Singla Pedro Domingos Department of Computer Science and Engineering University of Washington Seattle, WA 981952350, U.S.A. {parag, pedrod}@cs.washington.edu
More informationTowards a Benchmark for Graph Data Management and Processing
Towards a Benchmark for Graph Data Management and Processing Michael Grossniklaus University of Konstanz Germany michael.grossniklaus @unikonstanz.de Stefania Leone University of Southern California United
More informationNo One (Cluster) Size Fits All: Automatic Cluster Sizing for Dataintensive Analytics
No One (Cluster) Size Fits All: Automatic Cluster Sizing for Dataintensive Analytics Herodotos Herodotou Duke University hero@cs.duke.edu Fei Dong Duke University dongfei@cs.duke.edu Shivnath Babu Duke
More informationMining Big Data: Current Status, and Forecast to the Future
Mining Big Data: Current Status, and Forecast to the Future ABSTRACT Wei Fan Huawei Noah s Ark Lab Hong Kong Science Park Shatin, Hong Kong david.fanwei@huawei.com Big Data is a new term used to identify
More informationTop 10 algorithms in data mining
Knowl Inf Syst (2008) 14:1 37 DOI 10.1007/s1011500701142 SURVEY PAPER Top 10 algorithms in data mining Xindong Wu Vipin Kumar J. Ross Quinlan Joydeep Ghosh Qiang Yang Hiroshi Motoda Geoffrey J. McLachlan
More information1 An Introduction to Conditional Random Fields for Relational Learning
1 An Introduction to Conditional Random Fields for Relational Learning Charles Sutton Department of Computer Science University of Massachusetts, USA casutton@cs.umass.edu http://www.cs.umass.edu/ casutton
More informationCost aware real time big data processing in Cloud Environments
Cost aware real time big data processing in Cloud Environments By Cristian Montero Under the supervision of Professor Rajkumar Buyya and Dr. Amir Vahid A minor project thesis submitted in partial fulfilment
More informationLearning Spatial Context: Using Stuff to Find Things
Learning Spatial Context: Using Stuff to Find Things Geremy Heitz Daphne Koller Department of Computer Science, Stanford University {gaheitz,koller}@cs.stanford.edu Abstract. The sliding window approach
More informationBig Data Analytic and Mining with Machine Learning Algorithm
International Journal of Information and Computation Technology. ISSN 09742239 Volume 4, Number 1 (2014), pp. 3340 International Research Publications House http://www. irphouse.com /ijict.htm Big Data
More informationA survey on platforms for big data analytics
Singh and Reddy Journal of Big Data 2014, 1:8 SURVEY PAPER Open Access A survey on platforms for big data analytics Dilpreet Singh and Chandan K Reddy * * Correspondence: reddy@cs.wayne.edu Department
More informationClean Answers over Dirty Databases: A Probabilistic Approach
Clean Answers over Dirty Databases: A Probabilistic Approach Periklis Andritsos University of Trento periklis@dit.unitn.it Ariel Fuxman University of Toronto afuxman@cs.toronto.edu Renée J. Miller University
More informationScaling Datalog for Machine Learning on Big Data
Scaling Datalog for Machine Learning on Big Data Yingyi Bu, Vinayak Borkar, Michael J. Carey University of California, Irvine Joshua Rosen, Neoklis Polyzotis University of California, Santa Cruz Tyson
More informationEmerging Methods in Predictive Analytics:
Emerging Methods in Predictive Analytics: Risk Management and Decision Making William H. Hsu Kansas State University, USA A volume in the Advances in Data Mining and Database Management (ADMDM) Book Series
More informationMultiway Clustering on Relation Graphs
Multiway Clustering on Relation Graphs Arindam Banerjee Sugato Basu Srujana Merugu Abstract A number of realworld domains such as social networks and ecommerce involve heterogeneous data that describes
More informationNESSI White Paper, December 2012. Big Data. A New World of Opportunities
NESSI White Paper, December 2012 Big Data A New World of Opportunities Contents 1. Executive Summary... 3 2. Introduction... 4 2.1. Political context... 4 2.2. Research and Big Data... 5 2.3. Purpose of
More informationISSUES, CHALLENGES, AND SOLUTIONS: BIG DATA MINING
ISSUES, CHALLENGES, AND SOLUTIONS: BIG DATA MINING Jaseena K.U. 1 and Julie M. David 2 1,2 Department of Computer Applications, M.E.S College, Marampally, Aluva, Cochin, India 1 jaseena.mes@gmail.com,
More informationScalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
More informationReasoning With Neural Tensor Networks for Knowledge Base Completion
Reasoning With Neural Tensor Networks for Knowledge Base Completion Richard Socher, Danqi Chen*, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305,
More informationStreaming Predictive Analytics on Apache Flink
DEGREE PROJECT, IN DISTRIBUTED SYSTEMS AND SERVICES, SECOND LEVEL STOCKHOLM, SWEDEN 2015 Streaming Predictive Analytics on Apache Flink DEGREE PROJECT IN DISTRIBUTED SYSTEMS AND SERVICES AT KTH INFORMATION
More informationWTF: The Who to Follow Service at Twitter
WTF: The Who to Follow Service at Twitter Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, Reza Zadeh Twitter, Inc. @pankaj @ashishgoel @lintool @aneeshs @dongwang218 @reza_zadeh ABSTRACT
More informationSumProduct Networks: A New Deep Architecture
SumProduct Networks: A New Deep Architecture Hoifung Poon and Pedro Domingos Computer Science & Engineering University of Washington Seattle, WA 98195, USA {hoifung,pedrod}@cs.washington.edu Abstract
More informationThe Performance Cost of Virtual Machines on Big Data Problems in Compute Clusters
The Performance Cost of Virtual Machines on Big Data Problems in Compute Clusters Neal Barcelo Denison University Granville, OH 4023 barcel_n@denison.edu Nick Legg Denison University Granville, OH 4023
More informationExpert Finding for Question Answering via Graph Regularized Matrix Completion
1 Expert Finding for Question Answering via Graph Regularized Matrix Completion Zhou Zhao, Lijun Zhang, Xiaofei He and Wilfred Ng Abstract Expert finding for question answering is a challenging problem
More informationA Secure Face Recognition System for Mobiledevices without The Need of Decryption
A Secure Face Recognition System for Mobiledevices without The Need of Decryption Shibnath Mukherjee Deloitte & Touche Consulting India Pvt. Ltd shibnath@gmail.com Aryya Gangopadhyay University of Maryland
More informationAn Efficient Approach for Cost Optimization of the Movement of Big Data
2015 by the authors; licensee RonPub, Lübeck, Germany. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).
More informationGetting Started with Big Data Analytics in Retail
Getting Started with Big Data Analytics in Retail Learn how Intel and Living Naturally* used big data to help a health store increase sales and reduce inventory carrying costs. SOLUTION BLUEPRINT Big Data
More informationPASIF A Framework for Supporting Smart Interactions with Predictive Analytics
PASIF A Framework for Supporting Smart Interactions with Predictive Analytics by Sarah Marie Matheson A thesis submitted to the School of Computing in conformity with the requirements for the degree of
More informationANALYTICS ON BIG FAST DATA USING REAL TIME STREAM DATA PROCESSING ARCHITECTURE
ANALYTICS ON BIG FAST DATA USING REAL TIME STREAM DATA PROCESSING ARCHITECTURE Dibyendu Bhattacharya ArchitectBig Data Analytics HappiestMinds Manidipa Mitra Principal Software Engineer EMC Table of Contents
More informationMaximizing the Spread of Influence through a Social Network
Maximizing the Spread of Influence through a Social Network David Kempe Dept. of Computer Science Cornell University, Ithaca NY kempe@cs.cornell.edu Jon Kleinberg Dept. of Computer Science Cornell University,
More informationClassification in Networked Data: A Toolkit and a Univariate Case Study
Journal of Machine Learning Research 8 (27) 935983 Submitted /5; Revised 6/6; Published 5/7 Classification in Networked Data: A Toolkit and a Univariate Case Study Sofus A. Macskassy Fetch Technologies,
More information