Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)

Size: px
Start display at page:

Download "Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)"

Transcription

1 Algorithmic Aspects of Big Data Nikhil Bansal (TU Eindhoven)

2 Goal: Look from the lens of theoretical CS Theory of Algorithms: Past, present and future? Some cool ideas Some work we have done.

3 Algorithm design Algorithm: Set of steps to solve a problem (by a computer) Studied since 1950 s. Given a problem: Find (i) best solution (ii) quickly Traveling salesman problem (TSP) n! possibilities ( for n=60) Ideally: Polynomial running time (n 2, n 3 ) 33 cities, 1962 competition

4 70 s: Problems Polynomial time (n log n, n 2,n 3 ) E.g. Shortest path, matching, max-flow,... NP-Hard: TSP, 3-SAT, most problems (brute force 2 n = only option) Late 80 s-now: Coping with NP-hardness Approximation algorithms: Even if NP-Hard, may be a 95% optimal solution can be found in polynomial time? (very rich theory/connections)

5 3-SAT: (x 1 x 7 x 13 ) (x 2 x 1 x 4 ) (maximize number of satisfied clauses) 7/8 approximation trivial: Why?

6 3-SAT: (x 1 x 7 x 13 ) (x 2 x 1 x 4 ) (maximize number of satisfied clauses) 7/8 approximation trivial: Random assignment PCP theorem(90 s): Complexity, coding, fourier analysis, Better than 7/8 approximation implies P = NP. Good understanding: Though many questions still open. All possible 3-SAT instances 7/8- Hard instance

7 Random inputs: Very rich area Assume input chosen from some nice distribution 3-SAT: random clauses TSP: points randomly chosen in plane All possible 3-SAT instances Well- behaved random inputs 7/8- Hard instance

8 Future (1) Document Clustering New York Times database All possible instances Some ground truth: should make it easier Can encode 3-SAT Approaches: Semi-random models, HMM, smoothed analysis (just the beginning) Explain performance of heuristics: K-means; SVMs; Deep Learning

9 Future (2) Polynomial/non-polynomial view is too limited (Even n 2 time is prohibitive for huge n) Google: Figure out which documents are similar (various reasons: show diverse pages for a query) Which ad to show on a click Don t care about perfect answer. Pages change/disappear, No best answer anyway Fine-grained view of polynomial time (weak understanding)

10 Needed: New ways of thinking Discard old beliefs Beautiful new ideas emerging

11 Rest of the talk A glimpse of some ideas 1) Streaming/ fast algorithms 2) Local Partitioning 3) Clustering via eigenvalues New ways of looking at the problem

12 Counting distinct elements Input: Stream of numbers (say in range [1,n] ) Example: Goal: Compute number of distinct elements. Here 5 because we saw {3,4,2,17,11} Simple Solution: Just maintain a list of items seen thus far Stream: List: { }

13 Counting distinct elements Input: Stream of numbers (say in range [1,n] ) Example: Goal: Compute number of distinct elements. Here 5 because we saw {3,4,2,17,11} Simple Solution: Just maintain a list of items seen thus far Stream: List: { 3 }

14 Counting distinct elements Input: Stream of numbers (say in range [1,n] ) Example: Goal: Compute number of distinct elements. Here 5 because we saw {3,4,2,17,11} Simple Solution: Just maintain a list of items seen thus far Stream: List: { 3, 4}

15 Counting distinct elements Input: Stream of numbers (say in range [1,n] ) Example: Goal: Compute number of distinct elements. Here 5 because we saw {3,4,2,17,11} Simple Solution: Just maintain a list of items seen thus far Stream: List: { 3, 4}

16 Counting distinct elements Input: Stream of numbers (say in range [1,n] ) Example: Goal: Compute number of distinct elements. Here 5 because we saw {3,4,2,17,11} Simple Solution: Just maintain a list of items seen thus far Stream: List: { 3, 4, 2}

17 Counting Distinct elements Note: The algorithm tracks the numbers seen thus far. Question: What if it can remember only 1 number? (very limited memory) Trouble: Can barely remember anything about past. Stream: When you scan next element, no clue if already seen?

18 Seems completely hopeless? Intuition only partly right Cannot hope to count exactly. But who cares if answer is 3,425,269 or 3,425,587? Approximate counting possible!! Technique: Min-hashing (beautiful use of approximation and randomization)

19 Number of distinct elements Basic Idea [Flajolet-Martin 82]: Use a random hash function (map). (e.g. encryption function) h:[1,n] -> [1,n ] say n >> n Algorithm: Keep track of min h(i) Stream = h(2) n n

20 Number of distinct elements Basic Idea [Flajolet-Martin 82]: Use a random hash function (map). (e.g. encryption function) h:[1,n] -> [1,n ] say n >> n Algorithm: Keep track of min h(i) Stream = h(8) n n

21 Number of distinct elements Basic Idea [Flajolet-Martin 82]: Use a random hash function (map). (e.g. encryption function) h:[1,n] -> [1,n ] say n >> n Algorithm: Keep track of min h(i) Stream = h(8) n n Note: h(i) is same every time i is encountered.

22 Number of distinct elements Basic Idea [Flajolet-Martin 82]: Use a random hash function (map). (e.g. encryption function) h:[1,n] -> [1,n ] say n >> n Algorithm: Keep track of min h(i) Stream = h(8) n n Note: h(i) is same every time i is encountered.

23 Number of distinct elements Keep track of min h(i) Suppose 1 distinct element (stream = ) Min h(i) n / n n 1 n

24 Number of distinct elements Keep track of min h(i) Suppose 2 distinct elements Min h(i) n / n n 1 n If k items seen, expect min-value to be around n /(k+1) Solution: Estimate of # elements = (n / min h(i)) 1

25 Number of distinct elements Randomness could mess things up. E.g. May just 1 element, But min(h(i)) could be far. expect min(h(i)) = n /2 Standard trick: O(1) such hash functions, take median entry. Theorem: For any ε > 0, can estimate distinct elements to within 1 ± ε factor accuracy with high probability. Space = O 1 ε 2

26 A closer look Random hash function h. We need that h(i) value be same every time we see i. One has to store each h(i) some where. h(1), h(2), h(3),, h(n) need n log n space?? Did we just disguise our inherent problem? There is a fix! Key idea: Do not need full power of randomness

27 What is randomness? Do not need full randomness Pairwise independence: For any a 1, a 2, x 1, x 2 Pr [ h(x 1 ) = a 1 and h(x 2 ) = a 2 ] = 1/n n n Such an h is very simple to store h(x) = ax + b mod (p) [just need to store a and b]

28 Min-Hashing: Applications Similarity of Web pages (if mostly similar words) Google: Page -> Few min-hash values (few bits) Similar page detection: quadratic -> Linear time Sketching Complex Simple Tons of amazing applications ( several researchers )

29 The Frequency moments Stream of m numbers in range [1,n] Ex: A= , S(A) = (m 1,,m n ) summarizes the stream. m i : occurrences of i in the stream A. How to compute L p norms of S(A), 1-pass, O(log n) space? L 0 = # of distinct occurrences (non-zero entries) L 1 = # length of stream (m) ( i m i ) L 2 = skewness ( i m 2 i ) Saw how to compute L 0 in O(log n) space.

30 Computing L 2 ( m i2 ) Notation: m i is # of occurrences of i. Algorithm (Tug of War): [Alon, Mathias, Szegedy 96] Let h be a random hash fn. h(i) -> {-1,+1} w/ prob ½ h is random, yet consistent for a given i. Every time you see an element i compute h(i) Track X sum of hash values X = i m i h(i) Output: Estimate Y=X 2

31 Pf: E[Y] = E[X 2 ] = E[( m i h(i)) 2 ] (expectation over hash fns) = E[ m 2 i h i 2 + i,j:i j h i h(j) m i m j ] Now, E[h(i) 2 ] = 1, (because h(i) = ±1) E[h(i) h(j)] = 0 (h(i) and h(j) independent) So, E[Y]= m i 2 As usual, we take O(1) such estiamtes + their average Can show: Estimator has low variance 4-wise independent hash functions suffice.

32 Approximate Max-flow Major breakthroughs in last couple of years. (1+ε)-approximate max-flow in near linear time Decompose Graph into simpler structures (expander-like) Expanders: All cuts are well Captured by simpler vertex cuts Get the rough flow right; Fix errors in subsequent iterations. Graph Expander Tree of Expanders

33 Local Partitioning (Light Networks, Philips) Project of Britt Mathijsen w/ Johan van Leeuwaarden + Bansal

34 Light Networks Wireless capability: Control, monitor, exchange performance data Segment controller for a region

35 Light Networks Goal: Partition network in pieces, s.t. each piece (i) Good intra-connectvity (ii) Roughly equal size (iii) Small diameter (few hops) (iv) Low failure probability Impossible to approach via traditional algorithms Idea: Local partitioning algorithms are easy to tailor.

36 Local Partitioning Algorithms Studied by Spielman-Teng and Andersen-Peres (inspired by big data) Find a well-connected piece containing node v. v Time proportional to size of output piece

37 Local Algorithms

38 Local Algorithms

39 Local Algorithms

40 Local Algorithms

41 ASML Error log Problem (clustering with eigenvalues) Tom Slenders w/ Peter van den Hamer (ASML) + Bansal

42 Problem description Slide 42 3/31/2015

43 Problem description Slide 43 3/31/2015

44 Problem description Slide 44 3/31/2015 Site A Site B Similar?

45 ASML has huge database of error logs Elaborate pattern-matching system; using domain knowledge For us: No clue what these errors mean Starting point: Unsupervised learning (Cluster via eigenvalues, SVD)

46 Clustering with Eigenvalues Document = Bag of words (100k dimensional vector) Suppose reality = Few topics (k) Document = 0.3 topic topic 2 +. Can we automatically find topics? Word 1 Word 2. Word m Rank k Factorization: Basic linear algebra (mxn matrix) M = AB SVD: Best rank k approximation (space spanned by k largest singular values) M m x n = A m x k B k x n

47 Very widely used Finds hidden topics without even knowing them Recommendation systems (netflix): Users preferences = combination of genres Movies = combination of genres In practice: Also combine with semi-supervised methods etc.

48 Drawing with eigenvectors Eigenvector v: Av = λv Some eigenvector of the Laplacian: [ , 0.26, 0.44, 0.26] What can you do with this?

49 3 rd smallest eigenvector Drawing with eigenvectors Eigenvectors of Laplacian 2 nd smallest eigenvector

50 Drawing with eigenvectors Slide 50 3/31/2015 Eigenvectors of Laplacian

51 Drawing with eigenvectors

52 Drawing with eigenvectors Eigenvectors of Laplacian

53 Drawing with eigenvectors Eigenvectors of Normalized Laplacian

54 Demo on NY times database (Tom Slenders)

55 Why does it work? My motivation was fake: Can have negative combinations Document = -1.2 topic topic 2 + Right notion: Non-negative rank factorization (NP-hard) Perhaps easy in non-worst case instances? (big open question)

56 Clustering (2): new angle Clustering: Points in space (optimize some objective: k-means, min-sum, k-center, ) Hope: There is some ground truth Optimizing some objective will get you close. Apply factor 100 approximation algorithm. New Model [Balcan, Blum, Gupta]: Let us make this hope explicit Assume: Every <= optimal clustering is ε-close to target. Result(s): Can find 1+O ε close clustering, even though is much smaller than best approximation possible.

57 All possible instances Instances with the property Algorithm Idea: Strong properties on how input looks like (e.g. few outliers). Exploit these. Punchline: These algorithms may actually perform much better than standard k-means, min-sum type algorithms.

58 Correlation Clustering (new model motivated by big data) [Bansal, Blum, Chawla]

59 Clustering Traditional Approach: Objects -> points in some high dim. space Some distance function Some objective (k-median, k-means, ) Hope something useful comes out

60 Clustering Document clustering: Bag of words (traditional approach) Another approach (Bansal, Blum, Chawla) Correlation Clustering: Clustering via pairwise similarity. Classifier: Takes two documents and tells how similar they are dissimilar Doc 1 Doc 2 similar Idea: Use this classifier for clustering.

61 Correlation Clustering E.g. Run classifier on every pair of items dissimilar similar

62 In general, there could be inconsistencies dissimilar similar Any clustering, disagrees on at least one edge

63 In general, there could be inconsistencies dissimilar similar Any clustering, disagrees on at least one edge

64 In general, there could be inconsistencies dissimilar similar Any clustering, disagrees on at least one edge Goal: Find clustering agreeing on most edges Interesting approximation algorithms Quite successful Several extensions (not all pairs, which pair to probe, )

65 Concluding Remarks Very small glimpse (streaming and sketching, statistical learning, machine learning, dealing with noisy data, sub-linear algorithms, ) Exciting new algorithmic problems 1) Huge impact 2) Beautiful ideas 3) Interdisciplinary: Diverse skills + knowledge

66 Next Talk

67 Thanks for your attention!

Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)

Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven) Algorithmic Aspects of Big Data Nikhil Bansal (TU Eindhoven) Algorithm design Algorithm: Set of steps to solve a problem (by a computer) Studied since 1950 s. Given a problem: Find (i) best solution (ii)

More information

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear

More information

DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II. Matrix Algorithms DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

More information

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or

More information

Guessing Game: NP-Complete?

Guessing Game: NP-Complete? Guessing Game: NP-Complete? 1. LONGEST-PATH: Given a graph G = (V, E), does there exists a simple path of length at least k edges? YES 2. SHORTEST-PATH: Given a graph G = (V, E), does there exists a simple

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

Linear Algebra Notes

Linear Algebra Notes Linear Algebra Notes Chapter 19 KERNEL AND IMAGE OF A MATRIX Take an n m matrix a 11 a 12 a 1m a 21 a 22 a 2m a n1 a n2 a nm and think of it as a function A : R m R n The kernel of A is defined as Note

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Applied Algorithm Design Lecture 5

Applied Algorithm Design Lecture 5 Applied Algorithm Design Lecture 5 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 5 1 / 86 Approximation Algorithms Pietro Michiardi (Eurecom) Applied Algorithm Design

More information

Part 2: Community Detection

Part 2: Community Detection Chapter 8: Graph Data Part 2: Community Detection Based on Leskovec, Rajaraman, Ullman 2014: Mining of Massive Datasets Big Data Management and Analytics Outline Community Detection - Social networks -

More information

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions. Algebra I Overview View unit yearlong overview here Many of the concepts presented in Algebra I are progressions of concepts that were introduced in grades 6 through 8. The content presented in this course

More information

(67902) Topics in Theory and Complexity Nov 2, 2006. Lecture 7

(67902) Topics in Theory and Complexity Nov 2, 2006. Lecture 7 (67902) Topics in Theory and Complexity Nov 2, 2006 Lecturer: Irit Dinur Lecture 7 Scribe: Rani Lekach 1 Lecture overview This Lecture consists of two parts In the first part we will refresh the definition

More information

Linear Algebra Review. Vectors

Linear Algebra Review. Vectors Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka kosecka@cs.gmu.edu http://cs.gmu.edu/~kosecka/cs682.html Virginia de Sa Cogsci 8F Linear Algebra review UCSD Vectors The length

More information

Discuss the size of the instance for the minimum spanning tree problem.

Discuss the size of the instance for the minimum spanning tree problem. 3.1 Algorithm complexity The algorithms A, B are given. The former has complexity O(n 2 ), the latter O(2 n ), where n is the size of the instance. Let n A 0 be the size of the largest instance that can

More information

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) 3 4 4 7 5 9 6 16 7 8 8 4 9 8 10 4 Total 92.

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) 3 4 4 7 5 9 6 16 7 8 8 4 9 8 10 4 Total 92. Name: Email ID: CSE 326, Data Structures Section: Sample Final Exam Instructions: The exam is closed book, closed notes. Unless otherwise stated, N denotes the number of elements in the data structure

More information

Distance Degree Sequences for Network Analysis

Distance Degree Sequences for Network Analysis Universität Konstanz Computer & Information Science Algorithmics Group 15 Mar 2005 based on Palmer, Gibbons, and Faloutsos: ANF A Fast and Scalable Tool for Data Mining in Massive Graphs, SIGKDD 02. Motivation

More information

Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013

Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013 Notes on Orthogonal and Symmetric Matrices MENU, Winter 201 These notes summarize the main properties and uses of orthogonal and symmetric matrices. We covered quite a bit of material regarding these topics,

More information

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method Lecture 3 3B1B Optimization Michaelmas 2015 A. Zisserman Linear Programming Extreme solutions Simplex method Interior point method Integer programming and relaxation The Optimization Tree Linear Programming

More information

3 Some Integer Functions

3 Some Integer Functions 3 Some Integer Functions A Pair of Fundamental Integer Functions The integer function that is the heart of this section is the modulo function. However, before getting to it, let us look at some very simple

More information

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #18: Dimensionality Reduc7on Dimensionality Reduc=on Assump=on: Data lies on or near a low d- dimensional subspace Axes of this subspace

More information

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Review Jeopardy. Blue vs. Orange. Review Jeopardy Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?

More information

CMPSCI611: Approximating MAX-CUT Lecture 20

CMPSCI611: Approximating MAX-CUT Lecture 20 CMPSCI611: Approximating MAX-CUT Lecture 20 For the next two lectures we ll be seeing examples of approximation algorithms for interesting NP-hard problems. Today we consider MAX-CUT, which we proved to

More information

B490 Mining the Big Data. 2 Clustering

B490 Mining the Big Data. 2 Clustering B490 Mining the Big Data 2 Clustering Qin Zhang 1-1 Motivations Group together similar documents/webpages/images/people/proteins/products One of the most important problems in machine learning, pattern

More information

Complexity Theory. IE 661: Scheduling Theory Fall 2003 Satyaki Ghosh Dastidar

Complexity Theory. IE 661: Scheduling Theory Fall 2003 Satyaki Ghosh Dastidar Complexity Theory IE 661: Scheduling Theory Fall 2003 Satyaki Ghosh Dastidar Outline Goals Computation of Problems Concepts and Definitions Complexity Classes and Problems Polynomial Time Reductions Examples

More information

Chapter 11. 11.1 Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Chapter 11. 11.1 Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling Approximation Algorithms Chapter Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should I do? A. Theory says you're unlikely to find a poly-time algorithm. Must sacrifice one

More information

Linearly Independent Sets and Linearly Dependent Sets

Linearly Independent Sets and Linearly Dependent Sets These notes closely follow the presentation of the material given in David C. Lay s textbook Linear Algebra and its Applications (3rd edition). These notes are intended primarily for in-class presentation

More information

Lecture 1: Course overview, circuits, and formulas

Lecture 1: Course overview, circuits, and formulas Lecture 1: Course overview, circuits, and formulas Topics in Complexity Theory and Pseudorandomness (Spring 2013) Rutgers University Swastik Kopparty Scribes: John Kim, Ben Lund 1 Course Information Swastik

More information

10-601. Machine Learning. http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html

10-601. Machine Learning. http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html 10-601 Machine Learning http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html Course data All up-to-date info is on the course web page: http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html

More information

Vieta s Formulas and the Identity Theorem

Vieta s Formulas and the Identity Theorem Vieta s Formulas and the Identity Theorem This worksheet will work through the material from our class on 3/21/2013 with some examples that should help you with the homework The topic of our discussion

More information

Math 4310 Handout - Quotient Vector Spaces

Math 4310 Handout - Quotient Vector Spaces Math 4310 Handout - Quotient Vector Spaces Dan Collins The textbook defines a subspace of a vector space in Chapter 4, but it avoids ever discussing the notion of a quotient space. This is understandable

More information

Chapter 17. Orthogonal Matrices and Symmetries of Space

Chapter 17. Orthogonal Matrices and Symmetries of Space Chapter 17. Orthogonal Matrices and Symmetries of Space Take a random matrix, say 1 3 A = 4 5 6, 7 8 9 and compare the lengths of e 1 and Ae 1. The vector e 1 has length 1, while Ae 1 = (1, 4, 7) has length

More information

Notes on Factoring. MA 206 Kurt Bryan

Notes on Factoring. MA 206 Kurt Bryan The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

SIMS 255 Foundations of Software Design. Complexity and NP-completeness

SIMS 255 Foundations of Software Design. Complexity and NP-completeness SIMS 255 Foundations of Software Design Complexity and NP-completeness Matt Welsh November 29, 2001 mdw@cs.berkeley.edu 1 Outline Complexity of algorithms Space and time complexity ``Big O'' notation Complexity

More information

Computer Algorithms. NP-Complete Problems. CISC 4080 Yanjun Li

Computer Algorithms. NP-Complete Problems. CISC 4080 Yanjun Li Computer Algorithms NP-Complete Problems NP-completeness The quest for efficient algorithms is about finding clever ways to bypass the process of exhaustive search, using clues from the input in order

More information

Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

More information

The last three chapters introduced three major proof techniques: direct,

The last three chapters introduced three major proof techniques: direct, CHAPTER 7 Proving Non-Conditional Statements The last three chapters introduced three major proof techniques: direct, contrapositive and contradiction. These three techniques are used to prove statements

More information

1 Solving LPs: The Simplex Algorithm of George Dantzig

1 Solving LPs: The Simplex Algorithm of George Dantzig Solving LPs: The Simplex Algorithm of George Dantzig. Simplex Pivoting: Dictionary Format We illustrate a general solution procedure, called the simplex algorithm, by implementing it on a very simple example.

More information

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Why? A central concept in Computer Science. Algorithms are ubiquitous. Analysis of Algorithms: A Brief Introduction Why? A central concept in Computer Science. Algorithms are ubiquitous. Using the Internet (sending email, transferring files, use of search engines, online

More information

MATH 4330/5330, Fourier Analysis Section 11, The Discrete Fourier Transform

MATH 4330/5330, Fourier Analysis Section 11, The Discrete Fourier Transform MATH 433/533, Fourier Analysis Section 11, The Discrete Fourier Transform Now, instead of considering functions defined on a continuous domain, like the interval [, 1) or the whole real line R, we wish

More information

Collaborative Filtering. Radek Pelánek

Collaborative Filtering. Radek Pelánek Collaborative Filtering Radek Pelánek 2015 Collaborative Filtering assumption: users with similar taste in past will have similar taste in future requires only matrix of ratings applicable in many domains

More information

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm. Approximation Algorithms 11 Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of three

More information

A Working Knowledge of Computational Complexity for an Optimizer

A Working Knowledge of Computational Complexity for an Optimizer A Working Knowledge of Computational Complexity for an Optimizer ORF 363/COS 323 Instructor: Amir Ali Ahmadi TAs: Y. Chen, G. Hall, J. Ye Fall 2014 1 Why computational complexity? What is computational

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28 Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bag-of-words Spatial pyramids Neural Networks Object

More information

Epipolar Geometry. Readings: See Sections 10.1 and 15.6 of Forsyth and Ponce. Right Image. Left Image. e(p ) Epipolar Lines. e(q ) q R.

Epipolar Geometry. Readings: See Sections 10.1 and 15.6 of Forsyth and Ponce. Right Image. Left Image. e(p ) Epipolar Lines. e(q ) q R. Epipolar Geometry We consider two perspective images of a scene as taken from a stereo pair of cameras (or equivalently, assume the scene is rigid and imaged with a single camera from two different locations).

More information

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm. Approximation Algorithms Chapter Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of

More information

Basics of Polynomial Theory

Basics of Polynomial Theory 3 Basics of Polynomial Theory 3.1 Polynomial Equations In geodesy and geoinformatics, most observations are related to unknowns parameters through equations of algebraic (polynomial) type. In cases where

More information

Graph Mining and Social Network Analysis

Graph Mining and Social Network Analysis Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

B669 Sublinear Algorithms for Big Data

B669 Sublinear Algorithms for Big Data B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Now about the Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index of over 19 billion web pages : over 40 billion of

More information

Module 3: Correlation and Covariance

Module 3: Correlation and Covariance Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis

More information

Bindel, Spring 2012 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Feb 8

Bindel, Spring 2012 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Feb 8 Spaces and bases Week 3: Wednesday, Feb 8 I have two favorite vector spaces 1 : R n and the space P d of polynomials of degree at most d. For R n, we have a canonical basis: R n = span{e 1, e 2,..., e

More information

Hidden Markov Models

Hidden Markov Models 8.47 Introduction to omputational Molecular Biology Lecture 7: November 4, 2004 Scribe: Han-Pang hiu Lecturer: Ross Lippert Editor: Russ ox Hidden Markov Models The G island phenomenon The nucleotide frequencies

More information

How To Solve A K Path In Time (K)

How To Solve A K Path In Time (K) What s next? Reductions other than kernelization Dániel Marx Humboldt-Universität zu Berlin (with help from Fedor Fomin, Daniel Lokshtanov and Saket Saurabh) WorKer 2010: Workshop on Kernelization Nov

More information

Chapter 3. if 2 a i then location: = i. Page 40

Chapter 3. if 2 a i then location: = i. Page 40 Chapter 3 1. Describe an algorithm that takes a list of n integers a 1,a 2,,a n and finds the number of integers each greater than five in the list. Ans: procedure greaterthanfive(a 1,,a n : integers)

More information

Network Flow I. Lecture 16. 16.1 Overview. 16.2 The Network Flow Problem

Network Flow I. Lecture 16. 16.1 Overview. 16.2 The Network Flow Problem Lecture 6 Network Flow I 6. Overview In these next two lectures we are going to talk about an important algorithmic problem called the Network Flow Problem. Network flow is important because it can be

More information

BALTIC OLYMPIAD IN INFORMATICS Stockholm, April 18-22, 2009 Page 1 of?? ENG rectangle. Rectangle

BALTIC OLYMPIAD IN INFORMATICS Stockholm, April 18-22, 2009 Page 1 of?? ENG rectangle. Rectangle Page 1 of?? ENG rectangle Rectangle Spoiler Solution of SQUARE For start, let s solve a similar looking easier task: find the area of the largest square. All we have to do is pick two points A and B and

More information

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge, The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge, cheapest first, we had to determine whether its two endpoints

More information

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014 LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph

More information

Manifold Learning Examples PCA, LLE and ISOMAP

Manifold Learning Examples PCA, LLE and ISOMAP Manifold Learning Examples PCA, LLE and ISOMAP Dan Ventura October 14, 28 Abstract We try to give a helpful concrete example that demonstrates how to use PCA, LLE and Isomap, attempts to provide some intuition

More information

Linear functions Increasing Linear Functions. Decreasing Linear Functions

Linear functions Increasing Linear Functions. Decreasing Linear Functions 3.5 Increasing, Decreasing, Max, and Min So far we have been describing graphs using quantitative information. That s just a fancy way to say that we ve been using numbers. Specifically, we have described

More information

6.3 Conditional Probability and Independence

6.3 Conditional Probability and Independence 222 CHAPTER 6. PROBABILITY 6.3 Conditional Probability and Independence Conditional Probability Two cubical dice each have a triangle painted on one side, a circle painted on two sides and a square painted

More information

The PageRank Citation Ranking: Bring Order to the Web

The PageRank Citation Ranking: Bring Order to the Web The PageRank Citation Ranking: Bring Order to the Web presented by: Xiaoxi Pang 25.Nov 2010 1 / 20 Outline Introduction A ranking for every page on the Web Implementation Convergence Properties Personalized

More information

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression The SVD is the most generally applicable of the orthogonal-diagonal-orthogonal type matrix decompositions Every

More information

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations Roy D. Williams, 1990 Presented by Chris Eldred Outline Summary Finite Element Solver Load Balancing Results Types Conclusions

More information

1 Formulating The Low Degree Testing Problem

1 Formulating The Low Degree Testing Problem 6.895 PCP and Hardness of Approximation MIT, Fall 2010 Lecture 5: Linearity Testing Lecturer: Dana Moshkovitz Scribe: Gregory Minton and Dana Moshkovitz In the last lecture, we proved a weak PCP Theorem,

More information

Quadratic forms Cochran s theorem, degrees of freedom, and all that

Quadratic forms Cochran s theorem, degrees of freedom, and all that Quadratic forms Cochran s theorem, degrees of freedom, and all that Dr. Frank Wood Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 1, Slide 1 Why We Care Cochran s theorem tells us

More information

Factor Analysis. Chapter 420. Introduction

Factor Analysis. Chapter 420. Introduction Chapter 420 Introduction (FA) is an exploratory technique applied to a set of observed variables that seeks to find underlying factors (subsets of variables) from which the observed variables were generated.

More information

NOTES ON LINEAR TRANSFORMATIONS

NOTES ON LINEAR TRANSFORMATIONS NOTES ON LINEAR TRANSFORMATIONS Definition 1. Let V and W be vector spaces. A function T : V W is a linear transformation from V to W if the following two properties hold. i T v + v = T v + T v for all

More information

MATH 551 - APPLIED MATRIX THEORY

MATH 551 - APPLIED MATRIX THEORY MATH 55 - APPLIED MATRIX THEORY FINAL TEST: SAMPLE with SOLUTIONS (25 points NAME: PROBLEM (3 points A web of 5 pages is described by a directed graph whose matrix is given by A Do the following ( points

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

More information

1. Write the number of the left-hand item next to the item on the right that corresponds to it.

1. Write the number of the left-hand item next to the item on the right that corresponds to it. 1. Write the number of the left-hand item next to the item on the right that corresponds to it. 1. Stanford prison experiment 2. Friendster 3. neuron 4. router 5. tipping 6. small worlds 7. job-hunting

More information

Well-Separated Pair Decomposition for the Unit-disk Graph Metric and its Applications

Well-Separated Pair Decomposition for the Unit-disk Graph Metric and its Applications Well-Separated Pair Decomposition for the Unit-disk Graph Metric and its Applications Jie Gao Department of Computer Science Stanford University Joint work with Li Zhang Systems Research Center Hewlett-Packard

More information

Near Optimal Solutions

Near Optimal Solutions Near Optimal Solutions Many important optimization problems are lacking efficient solutions. NP-Complete problems unlikely to have polynomial time solutions. Good heuristics important for such problems.

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Au = = = 3u. Aw = = = 2w. so the action of A on u and w is very easy to picture: it simply amounts to a stretching by 3 and 2, respectively.

Au = = = 3u. Aw = = = 2w. so the action of A on u and w is very easy to picture: it simply amounts to a stretching by 3 and 2, respectively. Chapter 7 Eigenvalues and Eigenvectors In this last chapter of our exploration of Linear Algebra we will revisit eigenvalues and eigenvectors of matrices, concepts that were already introduced in Geometry

More information

Introduction to Learning & Decision Trees

Introduction to Learning & Decision Trees Artificial Intelligence: Representation and Problem Solving 5-38 April 0, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning What is learning? - more than just memorizing

More information

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n Principles of Data Mining Pham Tho Hoan hoanpt@hnue.edu.vn References [1] David Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining, MIT press, 2002 [2] Jiawei Han and Micheline Kamber,

More information

Infrastructures for big data

Infrastructures for big data Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)

More information

Lecture 3: Finding integer solutions to systems of linear equations

Lecture 3: Finding integer solutions to systems of linear equations Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture

More information

Vector and Matrix Norms

Vector and Matrix Norms Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty

More information

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin November 15, 2012; Lecture 10 1 / 27 Randomized online bipartite matching and the adwords problem. We briefly return to online algorithms

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Shortest Inspection-Path. Queries in Simple Polygons

Shortest Inspection-Path. Queries in Simple Polygons Shortest Inspection-Path Queries in Simple Polygons Christian Knauer, Günter Rote B 05-05 April 2005 Shortest Inspection-Path Queries in Simple Polygons Christian Knauer, Günter Rote Institut für Informatik,

More information

Going Big in Data Dimensionality:

Going Big in Data Dimensionality: LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für

More information

MATH 304 Linear Algebra Lecture 9: Subspaces of vector spaces (continued). Span. Spanning set.

MATH 304 Linear Algebra Lecture 9: Subspaces of vector spaces (continued). Span. Spanning set. MATH 304 Linear Algebra Lecture 9: Subspaces of vector spaces (continued). Span. Spanning set. Vector space A vector space is a set V equipped with two operations, addition V V (x,y) x + y V and scalar

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Compact Summaries for Large Datasets

Compact Summaries for Large Datasets Compact Summaries for Large Datasets Big Data Graham Cormode University of Warwick G.Cormode@Warwick.ac.uk The case for Big Data in one slide Big data arises in many forms: Medical data: genetic sequences,

More information

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1. MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column

More information

Practical Survey on Hash Tables. Aurelian Țuțuianu

Practical Survey on Hash Tables. Aurelian Țuțuianu Practical Survey on Hash Tables Aurelian Țuțuianu In memoriam Mihai Pătraşcu (17 July 1982 5 June 2012) I have no intention to ever teach computer science. I want to teach the love for computer science,

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Practical Graph Mining with R. 5. Link Analysis

Practical Graph Mining with R. 5. Link Analysis Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities

More information

Examples of Functions

Examples of Functions Examples of Functions In this document is provided examples of a variety of functions. The purpose is to convince the beginning student that functions are something quite different than polynomial equations.

More information

CALCULATIONS & STATISTICS

CALCULATIONS & STATISTICS CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents

More information

Binary Adders: Half Adders and Full Adders

Binary Adders: Half Adders and Full Adders Binary Adders: Half Adders and Full Adders In this set of slides, we present the two basic types of adders: 1. Half adders, and 2. Full adders. Each type of adder functions to add two binary bits. In order

More information

Practical Guide to the Simplex Method of Linear Programming

Practical Guide to the Simplex Method of Linear Programming Practical Guide to the Simplex Method of Linear Programming Marcel Oliver Revised: April, 0 The basic steps of the simplex algorithm Step : Write the linear programming problem in standard form Linear

More information

Lecture Topic: Low-Rank Approximations

Lecture Topic: Low-Rank Approximations Lecture Topic: Low-Rank Approximations Low-Rank Approximations We have seen principal component analysis. The extraction of the first principle eigenvalue could be seen as an approximation of the original

More information

Introduction to computer science

Introduction to computer science Introduction to computer science Michael A. Nielsen University of Queensland Goals: 1. Introduce the notion of the computational complexity of a problem, and define the major computational complexity classes.

More information

Microeconomic Theory: Basic Math Concepts

Microeconomic Theory: Basic Math Concepts Microeconomic Theory: Basic Math Concepts Matt Van Essen University of Alabama Van Essen (U of A) Basic Math Concepts 1 / 66 Basic Math Concepts In this lecture we will review some basic mathematical concepts

More information