Filtering: A Method for Solving Graph Problems in MapReduce
|
|
- Winfred Nichols
- 7 years ago
- Views:
Transcription
1 Filtering: A Method for Solving Graph Problems in MapReduce Benjamin Moseley Silvio Lattanzi Siddharth Suri Sergei Vassilvitski UIUC Google Yahoo! Yahoo!
2 Overview Introduction to MapReduce model Our settings Our results Open questions
3 Introduction to the MapReduce model
4 XXL Data Huge amount of data Main problem is to analyze information quickly
5 XXL Data Huge amount of data Main problem is to analyze information quickly New tools Suitable efficient algorithms
6 MapReduce MapReduce is the platform of choice for processing massive data INPUT MAP PHASE REDUCE PHASE
7 MapReduce MapReduce is the platform of choice for processing massive data Data are represented as tuple INPUT < key, value > MAP PHASE REDUCE PHASE
8 MapReduce MapReduce is the platform of choice for processing massive data Data are represented as tuple INPUT < key, value > Mapper decides how data is distributed MAP PHASE REDUCE PHASE
9 MapReduce MapReduce is the platform of choice for processing massive data Data are represented as tuple INPUT < key, value > Mapper decides how data is distributed Reducer performs non-trivial computation locally MAP PHASE REDUCE PHASE
10 How can we model MapReduce? PRAM No limit on the number of processors Memory is uniformly accessible from any processor No limit on the memory available
11 How can we model MapReduce? STREAMING Just one processor There is a limited amount of memory No parallelization
12 MapReduce model [Karloff, Suri and Vassilvitskii] N is the input size and > 0 is some fixed constant
13 MapReduce model [Karloff, Suri and Vassilvitskii] N > 0 is the input size and is some fixed constant Less than N 1 machines
14 MapReduce model [Karloff, Suri and Vassilvitskii] N > 0 is the input size and is some fixed constant Less than Less than N 1 N 1 machines memory on each machine
15 MapReduce model [Karloff, Suri and Vassilvitskii] N > 0 is the input size and is some fixed constant Less than Less than N 1 N 1 machines memory on each machine MRC i : problem that can be solved in O(log i N) rounds
16 Combining map and reduce phase Mapper and Reducer work only on a subgraph
17 Combining map and reduce phase Mapper and Reducer work only on a subgraph Keep time in each round polynomial
18 Combining map and reduce phase Mapper and Reducer work only on a subgraph Keep time in each round polynomial Time is constrained by the number of rounds
19 Algorithmic challenges No machine can see the entire input
20 Algorithmic challenges No machine can see the entire input No communication between machines during each phase
21 Algorithmic challenges No machine can see the entire input No communication between machines during each phase Total memory is N
22 MapReduce vs MUD algorithms In MUD framework each reducer operates on a stream of data. In MUD, each reducer is restricted to only using polylogarithmic space.
23 Our settings
24 Our settings We study the Karloff, Suri and Vassilvitskii model
25 Our settings We study the Karloff, Suri and Vassilvitskii model We focus on class MRC 0
26 Our settings m = n 1+c We assume to work with dense graph, for some constant c>0
27 Our settings m = n 1+c We assume to work with dense graph, for some constant c>0 Empirical evidences that social networks are dense graphs [Leskovec, Kleinberg and Faloutsos]
28 Dense graph motivation [Leskovec, Kleinberg and Faloutsos] They study 9 different social networks
29 Dense graph motivation [Leskovec, Kleinberg and Faloutsos] They study 9 different social networks They show that several graphs from different domains have n 1+c edges
30 Dense graph motivation [Leskovec, Kleinberg and Faloutsos] They study 9 different social networks They show that several graphs from different domains have n 1+c edges Lowest value of c founded.08 and four graphs have c>.5
31 Our results
32 Results Constant rounds algorithms for Maximal matching Minimum cut 8-approx for maximum weighted matching -approx for vertex cover 3/-approx for edge cover
33 Notation G =(V,E) : Input graph n : number of nodes m : number of edges η : memory available on each machine N : input size
34 Filtering Part of the input is dropped or filtered on the first stage in parallel
35 Filtering Part of the input is dropped or filtered on the first stage in parallel Next some computation is performed on the filtered input
36 Filtering Part of the input is dropped or filtered on the first stage in parallel Next some computation is performed on the filtered input Finally some patchwork is done to ensure a proper solution
37 Filtering
38 Filtering FILTER
39 Filtering FILTER COMPUTE SOLUTION
40 Filtering FILTER COMPUTE SOLUTION PATCHWORK ON REST OF THE GRAPH
41 Filtering FILTER COMPUTE SOLUTION PATCHWORK ON REST OF THE GRAPH
42 Warm-up: Compute the minimum spanning tree m = n 1+c
43 Warm-up: Compute the minimum spanning tree m = n 1+c PARTITION EDGES n c/
44 Warm-up: Compute the minimum spanning tree m = n 1+c PARTITION EDGES n c/ n 1+c/ edges
45 Warm-up: Compute the minimum spanning tree n 1+c/ edges
46 Warm-up: Compute the minimum spanning tree n 1+c/ edges COMPUTE MST n c/
47 Warm-up: Compute the minimum spanning tree n 1+c/ edges COMPUTE MST n c/ SECOND ROUND n c/ n 1+c/ edges
48 Show that the algorithm is in MRC 0 The algorithm is correct
49 Show that the algorithm is in MRC 0 The algorithm is correct No edge in the final solution is discarded in partial solution
50 Show that the algorithm is in MRC 0 The algorithm is correct No edge in the final solution is discarded in partial solution The algorithm runs in two rounds
51 Show that the algorithm is in MRC 0 The algorithm is correct No edge in the final solution is discarded in partial solution The algorithm runs in two rounds No more than O n c machines are used
52 Show that the algorithm is in MRC 0 The algorithm is correct No edge in the final solution is discarded in partial solution The algorithm runs in two rounds No more than O n c machines are used No machine uses memory greater than O(n 1+c/ )
53 Maximal matching Algorithmic difficulty is that each machine can only see edges assigned to the machine
54 Maximal matching Algorithmic difficulty is that each machine can only see edges assigned to the machine Is a partitioning based algorithm feasible?
55 Partitioning algorithm
56 Partitioning algorithm PARTITIONING
57 Partitioning algorithm PARTITIONING COMBINE MATCHINGS
58 Partitioning algorithm PARTITIONING COMBINE MATCHINGS It is impossible to build a maximal matching using only red edges!!
59 Algorithmic insight Consider any subset of the edges E
60 Algorithmic insight Consider any subset of the edges E Let M G[E ] be a maximal matching on
61 Algorithmic insight Consider any subset of the edges E Let M G[E ] be a maximal matching on The unmatched vertices form a independent set
62 Algorithmic insight We pick each edge with probability p
63 Algorithmic insight We pick each edge with probability p Find a matching on a sample and then find a matching on unmatched vertices
64 Algorithmic insight We pick each edge with probability p Find a matching on a sample and then find a matching on unmatched vertices For dense portions of the graph, many edges should be sampled
65 Algorithmic insight We pick each edge with probability p Find a matching on a sample and then find a matching on unmatched vertices For dense portions of the graph, many edges should be sampled Sparse portions of the graph are small and can be placed on a single machine
66 Algorithm Sample each edge independently with probability 10 log n p = n c/
67 Algorithm Sample each edge independently with probability 10 log n p = n c/ Find a matching on the sample
68 Algorithm Sample each edge independently with probability 10 log n p = n c/ Find a matching on the sample Consider the induced subgraph on unmatched vertices
69 Algorithm Sample each edge independently with probability 10 log n p = n c/ Find a matching on the sample Consider the induced subgraph on unmatched vertices Find a matching on this graph and output the union
70 Algorithm
71 Algorithm SAMPLE
72 Algorithm SAMPLE FIRST MATCHING
73 Algorithm SAMPLE FIRST MATCHING MATCHING ON UNMATCHED NODES
74 Algorithm SAMPLE FIRST MATCHING MATCHING ON UNMATCHED NODES UNION
75 Correctness Consider the last step of the algorithm
76 Correctness Consider the last step of the algorithm All unmatched vertices are placed on a machine
77 Correctness Consider the last step of the algorithm All unmatched vertices are placed on a machine All unmatched vertices or are matched at the last step or have only matched neighbors
78 Bounding the rounds Three rounds: Sampling the edges
79 Bounding the rounds Three rounds: Sampling the edges Find a matching on a single machine for the sampled edges
80 Bounding the rounds Three rounds: Sampling the edges Find a matching on a single machine for the sampled edges Find a matching for the unmatched vertices
81 Bounding the memory Each edge was sampled with probability p = 10 log n n c/
82 Bounding the memory Each edge was sampled with probability p = 10 log n n c/ Using Chernoff the sampled graph has size with high probability Õ(n 1+c/ )
83 Bounding the memory Each edge was sampled with probability p = 10 log n n c/ Using Chernoff the sampled graph has size with high probability Õ(n 1+c/ ) Can we bound the size of the induced subgraph on the unmatched vertices?
84 Bounding the memory For a fixed induced subgraph with at least edges the probability an edge is not sampled is: 1 10 log n n c/ n 1+c/ exp( 10n log n) n 1+c/
85 Bounding the memory For a fixed induced subgraph with at least edges the probability an edge is not sampled is: 1 10 log n n c/ n 1+c/ exp( 10n log n) n 1+c/ Union bounding over all n induced subgraphs shows that at least one edge is sampled from every dense subgraph with probability 1 exp( 10 log n) 1 1 n 10
86 Using less memory Can we run the algorithm with less then memory? Õ(n 1+c/ )
87 Using less memory Can we run the algorithm with less then memory? Õ(n 1+c/ ) We can amplify our sampling technique!
88 Using less memory Sample as many edges as possible that fits in memory
89 Using less memory Sample as many edges as possible that fits in memory Find a matching
90 Using less memory Sample as many edges as possible that fits in memory Find a matching If the edges between unmatched vertices fit onto a single machine, find a matching on those vertices
91 Using less memory Sample as many edges as possible that fits in memory Find a matching If the edges between unmatched vertices fit onto a single machine, find a matching on those vertices Otherwise, recurse on the unmatched nodes
92 Using less memory Sample as many edges as possible that fits in memory Find a matching If the edges between unmatched vertices fit onto a single machine, find a matching on those vertices Otherwise, recurse on the unmatched nodes n 1+ With memory each iteration a factor of edges are removed resulting in rounds O(c/) n
93 Parallel computation power Maximal matching algorithm does not use parallelization
94 Parallel computation power Maximal matching algorithm does not use parallelization We use a single machine in every step
95 Parallel computation power Maximal matching algorithm does not use parallelization We use a single machine in every step When parallelization is used?
96 Maximum weighted matching
97 Maximum weighted matching SPLIT THE GRAPH
98 Maximum weighted matching
99 Maximum weighted matching MATCHINGS 7 4 3
100 Maximum weighted matching 7 4 3
101 Maximum weighted matching SCAN THE EDGE SEQUENTIALLY 7 4 3
102 Bounding the rounds and memory Four rounds: Splitting the graph and running the maximal matching: 3 rounds and memory Õ(n 1+c/ ) Compute the final solution: 1 round and memory O(n log n)
103 Approximation guarantee (sketch) An edge in the solution can block at most edges in each subgraph of smaller weight
104 Approximation guarantee (sketch) An edge in the solution can block at most edges in each subgraph of smaller weight We loose a factor or because we do not consider the weight
105 Other algorithms Based on the same intuition: -approx for vertex cover 3/-approx for edge cover
106 Minimum cut Partition does not work, because we loose structural informations
107 Minimum cut Partition does not work, because we loose structural informations Sampling does not seem to work either
108 Minimum cut Partition does not work, because we loose structural informations Sampling does not seem to work either We can use the first steps of Karger algorithm as a filtering technique
109 Minimum cut Partition does not work, because we loose structural informations Sampling does not seem to work either We can use the first steps of Karger algorithm as a filtering technique The random choices made in the early rounds succeed with high probability, whereas the later rounds have a much lower probability of success
110 Minimum cut 1
111 Minimum cut RANDOM WEIGHT n δ
112 Minimum cut RANDOM WEIGHT n δ COMPRESS EDGES WITH SMALL WEIGTH
113 Minimum cut RANDOM WEIGHT n δ COMPRESS EDGES WITH SMALL WEIGTH From Karger at least one graph will contain a min cut w.h.p.
114 Minimum cut
115 Minimum cut COMPUTE MINIMUM CUT We will find the minimum cut w.h.p.
116 Empirical result (matching) %#!" '()(**+*",*-.)/01" %!!" 30)+(/4-""!"#$%&'(%)#"*&+,' $#!" $!!" #!"!"!&!!$"!&!!#"!&!$"!&!%"!&!#"!&$" $" -.%/0&'134.4)0)*5'(/,'
117 Open problems
118 Open problems 1 Maximum matching Shortest path Dynamic programming
119 Open problems Algorithms for sparse graph Does connected components require more than rounds? Lower bounds
120 Thank you!
MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
More informationA Model of Computation for MapReduce
A Model of Computation for MapReduce Howard Karloff Siddharth Suri Sergei Vassilvitskii Abstract In recent years the MapReduce framework has emerged as one of the most widely used parallel computing platforms
More informationCIS 700: algorithms for Big Data
CIS 700: algorithms for Big Data Lecture 6: Graph Sketching Slides at http://grigory.us/big-data-class.html Grigory Yaroslavtsev http://grigory.us Sketching Graphs? We know how to sketch vectors: v Mv
More informationMapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12
MapReduce Algorithms A Sense of Scale At web scales... Mail: Billions of messages per day Search: Billions of searches per day Social: Billions of relationships 2 A Sense of Scale At web scales... Mail:
More informationhttp://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
More informationmap/reduce connected components
1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains
More informationAn Empirical Study of Two MIS Algorithms
An Empirical Study of Two MIS Algorithms Email: Tushar Bisht and Kishore Kothapalli International Institute of Information Technology, Hyderabad Hyderabad, Andhra Pradesh, India 32. tushar.bisht@research.iiit.ac.in,
More informationLossless Data Compression Standard Applications and the MapReduce Web Computing Framework
Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework Sergio De Agostino Computer Science Department Sapienza University of Rome Internet as a Distributed System Modern
More informationSubgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro
Subgraph Patterns: Network Motifs and Graphlets Pedro Ribeiro Analyzing Complex Networks We have been talking about extracting information from networks Some possible tasks: General Patterns Ex: scale-free,
More informationarxiv:1412.2333v1 [cs.dc] 7 Dec 2014
Minimum-weight Spanning Tree Construction in O(log log log n) Rounds on the Congested Clique Sriram V. Pemmaraju Vivek B. Sardeshmukh Department of Computer Science, The University of Iowa, Iowa City,
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationMining Large Datasets: Case of Mining Graph Data in the Cloud
Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large
More informationONLINE DEGREE-BOUNDED STEINER NETWORK DESIGN. Sina Dehghani Saeed Seddighin Ali Shafahi Fall 2015
ONLINE DEGREE-BOUNDED STEINER NETWORK DESIGN Sina Dehghani Saeed Seddighin Ali Shafahi Fall 2015 ONLINE STEINER FOREST PROBLEM An initially given graph G. s 1 s 2 A sequence of demands (s i, t i ) arriving
More informationAnalyzing the Facebook graph?
Logistics Big Data Algorithmic Introduction Prof. Yuval Shavitt Contact: shavitt@eng.tau.ac.il Final grade: 4 6 home assignments (will try to include programing assignments as well): 2% Exam 8% Big Data
More informationPart 2: Community Detection
Chapter 8: Graph Data Part 2: Community Detection Based on Leskovec, Rajaraman, Ullman 2014: Mining of Massive Datasets Big Data Management and Analytics Outline Community Detection - Social networks -
More informationSmall Maximal Independent Sets and Faster Exact Graph Coloring
Small Maximal Independent Sets and Faster Exact Graph Coloring David Eppstein Univ. of California, Irvine Dept. of Information and Computer Science The Exact Graph Coloring Problem: Given an undirected
More informationProtein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
More informationBIG DATA VISUALIZATION. Team Impossible Peter Vilim, Sruthi Mayuram Krithivasan, Matt Burrough, and Ismini Lourentzou
BIG DATA VISUALIZATION Team Impossible Peter Vilim, Sruthi Mayuram Krithivasan, Matt Burrough, and Ismini Lourentzou Let s begin with a story Let s explore Yahoo s data! Dora the Data Explorer has a new
More informationWhy? A central concept in Computer Science. Algorithms are ubiquitous.
Analysis of Algorithms: A Brief Introduction Why? A central concept in Computer Science. Algorithms are ubiquitous. Using the Internet (sending email, transferring files, use of search engines, online
More informationExponential time algorithms for graph coloring
Exponential time algorithms for graph coloring Uriel Feige Lecture notes, March 14, 2011 1 Introduction Let [n] denote the set {1,..., k}. A k-labeling of vertices of a graph G(V, E) is a function V [k].
More informationHow To Cluster Of Complex Systems
Entropy based Graph Clustering: Application to Biological and Social Networks Edward C Kenley Young-Rae Cho Department of Computer Science Baylor University Complex Systems Definition Dynamically evolving
More informationCSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) 3 4 4 7 5 9 6 16 7 8 8 4 9 8 10 4 Total 92.
Name: Email ID: CSE 326, Data Structures Section: Sample Final Exam Instructions: The exam is closed book, closed notes. Unless otherwise stated, N denotes the number of elements in the data structure
More informationDistance Degree Sequences for Network Analysis
Universität Konstanz Computer & Information Science Algorithmics Group 15 Mar 2005 based on Palmer, Gibbons, and Faloutsos: ANF A Fast and Scalable Tool for Data Mining in Massive Graphs, SIGKDD 02. Motivation
More informationCSC2420 Fall 2012: Algorithm Design, Analysis and Theory
CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin November 15, 2012; Lecture 10 1 / 27 Randomized online bipartite matching and the adwords problem. We briefly return to online algorithms
More informationGraph Processing and Social Networks
Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph
More informationPage 1. CSCE 310J Data Structures & Algorithms. CSCE 310J Data Structures & Algorithms. P, NP, and NP-Complete. Polynomial-Time Algorithms
CSCE 310J Data Structures & Algorithms P, NP, and NP-Complete Dr. Steve Goddard goddard@cse.unl.edu CSCE 310J Data Structures & Algorithms Giving credit where credit is due:» Most of the lecture notes
More informationSocial Media Mining. Graph Essentials
Graph Essentials Graph Basics Measures Graph and Essentials Metrics 2 2 Nodes and Edges A network is a graph nodes, actors, or vertices (plural of vertex) Connections, edges or ties Edge Node Measures
More informationPLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.
PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo. VLDB 2009 CS 422 Decision Trees: Main Components Find Best Split Choose split
More informationStatistical Learning Theory Meets Big Data
Statistical Learning Theory Meets Big Data Randomized algorithms for frequent itemsets Eli Upfal Brown University Data, data, data In God we trust, all others (must) bring data Prof. W.E. Deming, Statistician,
More informationGraphs over Time Densification Laws, Shrinking Diameters and Possible Explanations
Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations Jurij Leskovec, CMU Jon Kleinberg, Cornell Christos Faloutsos, CMU 1 Introduction What can we do with graphs? What patterns
More informationMinimum spanning tree based classification model for massive data with MapReduce implementation
2010 IEEE International Conference on Data Mining Workshops Minimum spanning tree based classification model for massive data with MapReduce implementation Jin Chang, Jun Luo, Joshua Zhexue Huang, Shengzhong
More informationGraph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis. Contents. Introduction. Maarten van Steen. Version: April 28, 2014
Graph Theory and Complex Networks: An Introduction Maarten van Steen VU Amsterdam, Dept. Computer Science Room R.0, steen@cs.vu.nl Chapter 0: Version: April 8, 0 / Contents Chapter Description 0: Introduction
More informationAlgorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)
Algorithmic Aspects of Big Data Nikhil Bansal (TU Eindhoven) Algorithm design Algorithm: Set of steps to solve a problem (by a computer) Studied since 1950 s. Given a problem: Find (i) best solution (ii)
More informationOutline. NP-completeness. When is a problem easy? When is a problem hard? Today. Euler Circuits
Outline NP-completeness Examples of Easy vs. Hard problems Euler circuit vs. Hamiltonian circuit Shortest Path vs. Longest Path 2-pairs sum vs. general Subset Sum Reducing one problem to another Clique
More informationNetwork Design with Coverage Costs
Network Design with Coverage Costs Siddharth Barman 1 Shuchi Chawla 2 Seeun William Umboh 2 1 Caltech 2 University of Wisconsin-Madison APPROX-RANDOM 2014 Motivation Physical Flow vs Data Flow vs. Commodity
More informationNetwork Algorithms for Homeland Security
Network Algorithms for Homeland Security Mark Goldberg and Malik Magdon-Ismail Rensselaer Polytechnic Institute September 27, 2004. Collaborators J. Baumes, M. Krishmamoorthy, N. Preston, W. Wallace. Partially
More information5.1 Bipartite Matching
CS787: Advanced Algorithms Lecture 5: Applications of Network Flow In the last lecture, we looked at the problem of finding the maximum flow in a graph, and how it can be efficiently solved using the Ford-Fulkerson
More informationBig Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationGraph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis
Graph Theory and Complex Networks: An Introduction Maarten van Steen VU Amsterdam, Dept. Computer Science Room R4.0, steen@cs.vu.nl Chapter 06: Network analysis Version: April 8, 04 / 3 Contents Chapter
More informationSoftware tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team
Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction
More informationScheduling Shop Scheduling. Tim Nieberg
Scheduling Shop Scheduling Tim Nieberg Shop models: General Introduction Remark: Consider non preemptive problems with regular objectives Notation Shop Problems: m machines, n jobs 1,..., n operations
More informationBig Data Systems CS 5965/6965 FALL 2015
Big Data Systems CS 5965/6965 FALL 2015 Today General course overview Expectations from this course Q&A Introduction to Big Data Assignment #1 General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2015.html
More informationBicolored Shortest Paths in Graphs with Applications to Network Overlay Design
Bicolored Shortest Paths in Graphs with Applications to Network Overlay Design Hongsik Choi and Hyeong-Ah Choi Department of Electrical Engineering and Computer Science George Washington University Washington,
More informationLoad Balancing. Load Balancing 1 / 24
Load Balancing Backtracking, branch & bound and alpha-beta pruning: how to assign work to idle processes without much communication? Additionally for alpha-beta pruning: implementing the young-brothers-wait
More informationHadoop SNS. renren.com. Saturday, December 3, 11
Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December
More informationThe Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,
The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge, cheapest first, we had to determine whether its two endpoints
More informationIntroduction to DISC and Hadoop
Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and
More informationApproximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs
Approximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs Yong Zhang 1.2, Francis Y.L. Chin 2, and Hing-Fung Ting 2 1 College of Mathematics and Computer Science, Hebei University,
More informationReminder: Complexity (1) Parallel Complexity Theory. Reminder: Complexity (2) Complexity-new
Reminder: Complexity (1) Parallel Complexity Theory Lecture 6 Number of steps or memory units required to compute some result In terms of input size Using a single processor O(1) says that regardless of
More informationReminder: Complexity (1) Parallel Complexity Theory. Reminder: Complexity (2) Complexity-new GAP (2) Graph Accessibility Problem (GAP) (1)
Reminder: Complexity (1) Parallel Complexity Theory Lecture 6 Number of steps or memory units required to compute some result In terms of input size Using a single processor O(1) says that regardless of
More informationEvery tree contains a large induced subgraph with all degrees odd
Every tree contains a large induced subgraph with all degrees odd A.J. Radcliffe Carnegie Mellon University, Pittsburgh, PA A.D. Scott Department of Pure Mathematics and Mathematical Statistics University
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany MapReduce II MapReduce II 1 / 33 Outline 1. Introduction
More informationProblem Set 7 Solutions
8 8 Introduction to Algorithms May 7, 2004 Massachusetts Institute of Technology 6.046J/18.410J Professors Erik Demaine and Shafi Goldwasser Handout 25 Problem Set 7 Solutions This problem set is due in
More informationDistributed Structured Prediction for Big Data
Distributed Structured Prediction for Big Data A. G. Schwing ETH Zurich aschwing@inf.ethz.ch T. Hazan TTI Chicago M. Pollefeys ETH Zurich R. Urtasun TTI Chicago Abstract The biggest limitations of learning
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationVisual Data Mining. Motivation. Why Visual Data Mining. Integration of visualization and data mining : Chidroop Madhavarapu CSE 591:Visual Analytics
Motivation Visual Data Mining Visualization for Data Mining Huge amounts of information Limited display capacity of output devices Chidroop Madhavarapu CSE 591:Visual Analytics Visual Data Mining (VDM)
More informationPrivacy-Preserving Big Data Publishing
Privacy-Preserving Big Data Publishing Hessam Zakerzadeh 1, Charu C. Aggarwal 2, Ken Barker 1 SSDBM 15 1 University of Calgary, Canada 2 IBM TJ Watson, USA Data Publishing OECD * declaration on access
More informationAll trees contain a large induced subgraph having all degrees 1 (mod k)
All trees contain a large induced subgraph having all degrees 1 (mod k) David M. Berman, A.J. Radcliffe, A.D. Scott, Hong Wang, and Larry Wargo *Department of Mathematics University of New Orleans New
More informationPractical Graph Mining with R. 5. Link Analysis
Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities
More informationKEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS
ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,
More informationA Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
More informationPOMPDs Make Better Hackers: Accounting for Uncertainty in Penetration Testing. By: Chris Abbott
POMPDs Make Better Hackers: Accounting for Uncertainty in Penetration Testing By: Chris Abbott Introduction What is penetration testing? Methodology for assessing network security, by generating and executing
More informationParallel Algorithms for Small-world Network. David A. Bader and Kamesh Madduri
Parallel Algorithms for Small-world Network Analysis ayssand Partitioning atto g(s (SNAP) David A. Bader and Kamesh Madduri Overview Informatics networks, small-world topology Community Identification/Graph
More informationCSC2420 Spring 2015: Lecture 3
CSC2420 Spring 2015: Lecture 3 Allan Borodin January 22, 2015 1 / 1 Announcements and todays agenda Assignment 1 due next Thursday. I may add one or two additional questions today or tomorrow. Todays agenda
More informationHome Page. Data Structures. Title Page. Page 1 of 24. Go Back. Full Screen. Close. Quit
Data Structures Page 1 of 24 A.1. Arrays (Vectors) n-element vector start address + ielementsize 0 +1 +2 +3 +4... +n-1 start address continuous memory block static, if size is known at compile time dynamic,
More informationTHE PROBLEM WORMS (1) WORMS (2) THE PROBLEM OF WORM PROPAGATION/PREVENTION THE MINIMUM VERTEX COVER PROBLEM
1 THE PROBLEM OF WORM PROPAGATION/PREVENTION I.E. THE MINIMUM VERTEX COVER PROBLEM Prof. Tiziana Calamoneri Network Algorithms A.y. 2014/15 2 THE PROBLEM WORMS (1)! A computer worm is a standalone malware
More informationFrom GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop
More informationHealthcare data analytics. Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw
Healthcare data analytics Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw Outline Data Science Enabling technologies Grand goals Issues Google flu trend Privacy Conclusion Analytics
More informationDistributed Dynamic Load Balancing for Iterative-Stencil Applications
Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,
More informationB669 Sublinear Algorithms for Big Data
B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Now about the Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index of over 19 billion web pages : over 40 billion of
More informationA Sublinear Bipartiteness Tester for Bounded Degree Graphs
A Sublinear Bipartiteness Tester for Bounded Degree Graphs Oded Goldreich Dana Ron February 5, 1998 Abstract We present a sublinear-time algorithm for testing whether a bounded degree graph is bipartite
More informationNimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff
Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear
More informationAnalysis of Computer Algorithms. Algorithm. Algorithm, Data Structure, Program
Analysis of Computer Algorithms Hiroaki Kobayashi Input Algorithm Output 12/13/02 Algorithm Theory 1 Algorithm, Data Structure, Program Algorithm Well-defined, a finite step-by-step computational procedure
More informationThe Classes P and NP. mohamed@elwakil.net
Intractable Problems The Classes P and NP Mohamed M. El Wakil mohamed@elwakil.net 1 Agenda 1. What is a problem? 2. Decidable or not? 3. The P class 4. The NP Class 5. TheNP Complete class 2 What is a
More informationParallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationScheduling Home Health Care with Separating Benders Cuts in Decision Diagrams
Scheduling Home Health Care with Separating Benders Cuts in Decision Diagrams André Ciré University of Toronto John Hooker Carnegie Mellon University INFORMS 2014 Home Health Care Home health care delivery
More informationCS 598CSC: Combinatorial Optimization Lecture date: 2/4/2010
CS 598CSC: Combinatorial Optimization Lecture date: /4/010 Instructor: Chandra Chekuri Scribe: David Morrison Gomory-Hu Trees (The work in this section closely follows [3]) Let G = (V, E) be an undirected
More informationA number of tasks executing serially or in parallel. Distribute tasks on processors so that minimal execution time is achieved. Optimal distribution
Scheduling MIMD parallel program A number of tasks executing serially or in parallel Lecture : Load Balancing The scheduling problem NP-complete problem (in general) Distribute tasks on processors so that
More informationEfficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.
Algorithms Efficiency of algorithms Computational resources: time and space Best, worst and average case performance How to compare algorithms: machine-independent measure of efficiency Growth rate Complexity
More informationGuessing Game: NP-Complete?
Guessing Game: NP-Complete? 1. LONGEST-PATH: Given a graph G = (V, E), does there exists a simple path of length at least k edges? YES 2. SHORTEST-PATH: Given a graph G = (V, E), does there exists a simple
More informationSIMS 255 Foundations of Software Design. Complexity and NP-completeness
SIMS 255 Foundations of Software Design Complexity and NP-completeness Matt Welsh November 29, 2001 mdw@cs.berkeley.edu 1 Outline Complexity of algorithms Space and time complexity ``Big O'' notation Complexity
More informationCloud Computing. Lectures 10 and 11 Map Reduce: System Perspective 2014-2015
Cloud Computing Lectures 10 and 11 Map Reduce: System Perspective 2014-2015 1 MapReduce in More Detail 2 Master (i) Execution is controlled by the master process: Input data are split into 64MB blocks.
More informationMining Social Network Graphs
Mining Social Network Graphs Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata November 13, 17, 2014 Social Network No introduc+on required Really? We s7ll need to understand
More informationAn Approximation Algorithm for Bounded Degree Deletion
An Approximation Algorithm for Bounded Degree Deletion Tomáš Ebenlendr Petr Kolman Jiří Sgall Abstract Bounded Degree Deletion is the following generalization of Vertex Cover. Given an undirected graph
More informationDistributed Computing over Communication Networks: Maximal Independent Set
Distributed Computing over Communication Networks: Maximal Independent Set What is a MIS? MIS An independent set (IS) of an undirected graph is a subset U of nodes such that no two nodes in U are adjacent.
More informationLecture 22: November 10
CS271 Randomness & Computation Fall 2011 Lecture 22: November 10 Lecturer: Alistair Sinclair Based on scribe notes by Rafael Frongillo Disclaimer: These notes have not been subjected to the usual scrutiny
More informationWell-Separated Pair Decomposition for the Unit-disk Graph Metric and its Applications
Well-Separated Pair Decomposition for the Unit-disk Graph Metric and its Applications Jie Gao Department of Computer Science Stanford University Joint work with Li Zhang Systems Research Center Hewlett-Packard
More informationPrivate Approximation of Clustering and Vertex Cover
Private Approximation of Clustering and Vertex Cover Amos Beimel, Renen Hallak, and Kobbi Nissim Department of Computer Science, Ben-Gurion University of the Negev Abstract. Private approximation of search
More informationBig Data Begets Big Database Theory
Big Data Begets Big Database Theory Dan Suciu University of Washington 1 Motivation Industry analysts describe Big Data in terms of three V s: volume, velocity, variety. The data is too big to process
More information8.1 Min Degree Spanning Tree
CS880: Approximations Algorithms Scribe: Siddharth Barman Lecturer: Shuchi Chawla Topic: Min Degree Spanning Tree Date: 02/15/07 In this lecture we give a local search based algorithm for the Min Degree
More informationData Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based
More informationSeminar. Path planning using Voronoi diagrams and B-Splines. Stefano Martina stefano.martina@stud.unifi.it
Seminar Path planning using Voronoi diagrams and B-Splines Stefano Martina stefano.martina@stud.unifi.it 23 may 2016 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International
More informationNan Kong, Andrew J. Schaefer. Department of Industrial Engineering, Univeristy of Pittsburgh, PA 15261, USA
A Factor 1 2 Approximation Algorithm for Two-Stage Stochastic Matching Problems Nan Kong, Andrew J. Schaefer Department of Industrial Engineering, Univeristy of Pittsburgh, PA 15261, USA Abstract We introduce
More informationOn the k-path cover problem for cacti
On the k-path cover problem for cacti Zemin Jin and Xueliang Li Center for Combinatorics and LPMC Nankai University Tianjin 300071, P.R. China zeminjin@eyou.com, x.li@eyou.com Abstract In this paper we
More informationSteiner Tree Approximation via IRR. Randomized Rounding
Steiner Tree Approximation via Iterative Randomized Rounding Graduate Program in Logic, Algorithms and Computation μπλ Network Algorithms and Complexity June 18, 2013 Overview 1 Introduction Scope Related
More informationRethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
More informationAnalysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model
More informationDiscuss the size of the instance for the minimum spanning tree problem.
3.1 Algorithm complexity The algorithms A, B are given. The former has complexity O(n 2 ), the latter O(2 n ), where n is the size of the instance. Let n A 0 be the size of the largest instance that can
More information