Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)

Size: px
Start display at page:

Download "Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)"

Transcription

1 Algorithmic Aspects of Big Data Nikhil Bansal (TU Eindhoven)

2 Algorithm design Algorithm: Set of steps to solve a problem (by a computer) Studied since 1950 s. Given a problem: Find (i) best solution (ii) quickly Traveling salesman problem (TSP) n! possibilities ( for n=60) Ideally: Polynomial running time (n 2, n 3 ) 33 cities, 1962 competition

3 70 s: Problems Polynomial time (n log n, n 2,n 3 ) E.g. Shortest path, matching, max-flow,... NP-Hard: TSP + most other problems (brute force 2 n = only option) Late 80 s-now: Coping with NP-hardness Approximation algorithms: Even if NP-Hard, may be a 95% optimal solution can be found in polynomial time? (very rich theory/connections) Heuristic Approaches: (often successful in practice)

4 Heuristic methods (TSP) 120 Germany, USA, ,509 USA, ,978 Sweden, ,900 VLSI ,000 USA, still unsolved

5 Tools seemed quite powerful Tools Problems

6 Last few years

7 Google: Billions of webpages Example Figure out which documents are similar (various reasons: show diverse pages for a query) Polynomial/non-polynomial view is too limited (Even n 2 time is prohibitive for huge n) Don t care about perfect answer. Pages change/disappear, No perfect notion of similarity anyway Want a quick solution. Some error is alright.

8 Example 2 Facebook: Which 10,000 users (among many millions) should be shown this ad? Want a quick solution. Some error is alright.

9 Very different questions Needed: Totally new ways of thinking Discard old beliefs Beautiful new ideas emerging

10 Rest of the talk A glimpse of some ideas 1) Counting distinct elements 2) Correlation Clustering 3) Local Partitioning Concluding Remarks

11 Counting distinct elements Input: Stream of numbers (say in range [1,n] ) Example: Goal: Compute number of distinct elements. Here 5 because we saw {3,4,2,17,11} Simple Solution: Just maintain a list of items seen thus far Stream: List: { }

12 Counting distinct elements Input: Stream of numbers (say in range [1,n] ) Example: Goal: Compute number of distinct elements. Here 5 because we saw {3,4,2,17,11} Simple Solution: Just maintain a list of items seen thus far Stream: List: { 3 }

13 Counting distinct elements Input: Stream of numbers (say in range [1,n] ) Example: Goal: Compute number of distinct elements. Here 5 because we saw {3,4,2,17,11} Simple Solution: Just maintain a list of items seen thus far Stream: List: { 3, 4}

14 Counting distinct elements Input: Stream of numbers (say in range [1,n] ) Example: Goal: Compute number of distinct elements. Here 5 because we saw {3,4,2,17,11} Simple Solution: Just maintain a list of items seen thus far Stream: List: { 3, 4}

15 Counting distinct elements Input: Stream of numbers (say in range [1,n] ) Example: Goal: Compute number of distinct elements. Here 5 because we saw {3,4,2,17,11} Simple Solution: Just maintain a list of items seen thus far Stream: List: { 3, 4, 2}

16 Counting Distinct elements Note: The algorithm tracks the numbers seen thus far. Question: What if it can remember only 1 number? (very limited memory) Trouble: Can barely remember anything about past. Stream: When you scan next element, no clue if already seen?

17 Seems completely hopeless? Intuition only partly right Cannot hope to count exactly. But who cares if answer is 3,425,269 or 3,425,587? Approximate counting possible!! Technique: Min-hashing (beautiful use of approximation and randomization)

18 Number of distinct elements Basic Idea [Flajolet-Martin 82]: Use a random hash function (map). (e.g. encryption function) h:[1,n] -> [1,n ] say n >> n Algorithm: Keep track of min h(i) Stream = h(2) n n

19 Number of distinct elements Basic Idea [Flajolet-Martin 82]: Use a random hash function (map). (e.g. encryption function) h:[1,n] -> [1,n ] say n >> n Algorithm: Keep track of min h(i) Stream = h(8) n n

20 Number of distinct elements Basic Idea [Flajolet-Martin 82]: Use a random hash function (map). (e.g. encryption function) h:[1,n] -> [1,n ] say n >> n Algorithm: Keep track of min h(i) Stream = h(8) n n Note: h(i) is same every time i is encountered.

21 Number of distinct elements Basic Idea [Flajolet-Martin 82]: Use a random hash function (map). (e.g. encryption function) h:[1,n] -> [1,n ] say n >> n Algorithm: Keep track of min h(i) Stream = h(8) n n Note: h(i) is same every time i is encountered.

22 Number of distinct elements Keep track of min h(i) Suppose 1 distinct element (stream = ) Min h(i) n / n n 1 n

23 Number of distinct elements Keep track of min h(i) Suppose 2 distinct elements Min h(i) n / n n 1 n If k items seen, expect min-value to be around n /(k+1) Solution: Estimate of # elements = n / min h(i) 1

24 Number of distinct elements Randomness could mess things up. E.g. May just 1 element, But min(h(i)) could be far. expect min(h(i)) = n /2 Standard trick: O(1) such hash functions, take median entry. Theorem: For any ε > 0, can estimate distinct elements to within 1 ± ε factor accuracy with high probability. Space = O 1 ε 2

25 A closer look Random hash function h. We need that h(i) value be same every time we see i. One has to store each h(i) some where. h(1), h(2), h(3),, h(n) need n log n space?? Did we just disguise our inherent problem? There is a fix! Key idea: Do not need full power of randomness

26 What is randomness? Do not need full randomness Pairwise independence: For any a 1, a 2, x 1, x 2 Pr [ h(x 1 ) = a 1 and h(x 2 ) = a 2 ] = 1/n n n Such an h is very simple to store h(x) = ax + b mod (p) [just need to store a and b]

27 Min-Hashing: Applications Similarity of Web pages (if mostly similar words) Google: Page -> Few min-hash values (few bits) Similar page detection: quadratic -> Linear time Sketching Complex Simple Tons of amazing applications ( several researchers )

28 Correlation Clustering (new model motivated by big data)

29 Clustering Cluster documents by topic Cluster users by items bought One of the most fundamental operations in data mining Traditional Approach: Objects -> points in some high dim. space Some distance function Some objective (k-median, k-means, ) Hope something useful comes out

30 Clustering Document clustering: Bag of words (traditional approach) Another approach (Bansal, Blum, Chawla) Correlation Clustering: Clustering via pairwise similarity. Classifier: Takes two documents and tells how similar they are dissimilar Doc 1 Doc 2 similar Idea: Use this classifier for clustering.

31 Correlation Clustering E.g. Run classifier on every pair of items dissimilar similar

32 In general, there could be inconsistencies dissimilar similar Any clustering, disagrees on at least one edge

33 In general, there could be inconsistencies dissimilar similar Any clustering, disagrees on at least one edge

34 In general, there could be inconsistencies dissimilar similar Any clustering, disagrees on at least one edge Goal: Find clustering agreeing on most edges Interesting approximation algorithms Quite successful Several extensions (not all pairs, which pair to probe, )

35 Local Partitioning (Light Networks, Philips)

36 Light Networks Wireless capability: Control, monitor, exchange performance data Segment controller for a region

37 Light Networks Goal: Partition network in pieces, s.t. each piece (i) Good intra-connectvity (ii) Roughly equal size (iii) Small diameter (few hops) (iv) Low failure probability Impossible to approach via traditional algorithms Idea (Bansal, Leeuwaarden, Mathijsen) Local partitioning algorithms are easy to tailor.

38 Local Partitioning Algorithms Studied by Spielman-Teng and Andersen-Peres (inspired by big data) Find a well-connected piece containing node v. v Time proportional to size of output piece

39 Local Algorithms

40 Local Algorithms

41 Local Algorithms

42 Local Algorithms Details: Poster of Britt Mathijsen

43 Concluding Remarks Very small glimpse (streaming and sketching, statistical learning, machine learning, dealing with noisy data, sub-linear algorithms, ) Exciting new algorithmic problems 1) Huge impact 2) Beautiful ideas 3) Interdisciplinary DSCE a platform to bring diverse groups and skills together

44 Thanks for your attention!

B490 Mining the Big Data. 0 Introduction

B490 Mining the Big Data. 0 Introduction B490 Mining the Big Data 0 Introduction Qin Zhang 1-1 Data Mining What is Data Mining? A definition : Discovery of useful, possibly unexpected, patterns in data. 2-1 Data Mining What is Data Mining? A

More information

Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),

More information

Computer Algorithms. NP-Complete Problems. CISC 4080 Yanjun Li

Computer Algorithms. NP-Complete Problems. CISC 4080 Yanjun Li Computer Algorithms NP-Complete Problems NP-completeness The quest for efficient algorithms is about finding clever ways to bypass the process of exhaustive search, using clues from the input in order

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear

More information

Distance Degree Sequences for Network Analysis

Distance Degree Sequences for Network Analysis Universität Konstanz Computer & Information Science Algorithmics Group 15 Mar 2005 based on Palmer, Gibbons, and Faloutsos: ANF A Fast and Scalable Tool for Data Mining in Massive Graphs, SIGKDD 02. Motivation

More information

A Working Knowledge of Computational Complexity for an Optimizer

A Working Knowledge of Computational Complexity for an Optimizer A Working Knowledge of Computational Complexity for an Optimizer ORF 363/COS 323 Instructor: Amir Ali Ahmadi TAs: Y. Chen, G. Hall, J. Ye Fall 2014 1 Why computational complexity? What is computational

More information

1 Message Authentication

1 Message Authentication Theoretical Foundations of Cryptography Lecture Georgia Tech, Spring 200 Message Authentication Message Authentication Instructor: Chris Peikert Scribe: Daniel Dadush We start with some simple questions

More information

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Scalable Machine Learning - or what to do with all that Big Data infrastructure - or what to do with all that Big Data infrastructure TU Berlin blog.mikiobraun.de Strata+Hadoop World London, 2015 1 Complex Data Analysis at Scale Click-through prediction Personalized Spam Detection

More information

Introduction to computer science

Introduction to computer science Introduction to computer science Michael A. Nielsen University of Queensland Goals: 1. Introduce the notion of the computational complexity of a problem, and define the major computational complexity classes.

More information

Data Science Center Eindhoven. Big Data: Challenges and Opportunities for Mathematicians. Alessandro Di Bucchianico

Data Science Center Eindhoven. Big Data: Challenges and Opportunities for Mathematicians. Alessandro Di Bucchianico Data Science Center Eindhoven Big Data: Challenges and Opportunities for Mathematicians Alessandro Di Bucchianico Dutch Mathematical Congress April 15, 2015 Contents 1. Big Data terminology 2. Various

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m. Universal hashing No matter how we choose our hash function, it is always possible to devise a set of keys that will hash to the same slot, making the hash scheme perform poorly. To circumvent this, we

More information

Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set

More information

SIMS 255 Foundations of Software Design. Complexity and NP-completeness

SIMS 255 Foundations of Software Design. Complexity and NP-completeness SIMS 255 Foundations of Software Design Complexity and NP-completeness Matt Welsh November 29, 2001 mdw@cs.berkeley.edu 1 Outline Complexity of algorithms Space and time complexity ``Big O'' notation Complexity

More information

Complexity Theory. IE 661: Scheduling Theory Fall 2003 Satyaki Ghosh Dastidar

Complexity Theory. IE 661: Scheduling Theory Fall 2003 Satyaki Ghosh Dastidar Complexity Theory IE 661: Scheduling Theory Fall 2003 Satyaki Ghosh Dastidar Outline Goals Computation of Problems Concepts and Definitions Complexity Classes and Problems Polynomial Time Reductions Examples

More information

Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12 MapReduce Algorithms A Sense of Scale At web scales... Mail: Billions of messages per day Search: Billions of searches per day Social: Billions of relationships 2 A Sense of Scale At web scales... Mail:

More information

Lecture 4 Online and streaming algorithms for clustering

Lecture 4 Online and streaming algorithms for clustering CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line

More information

B669 Sublinear Algorithms for Big Data

B669 Sublinear Algorithms for Big Data B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Now about the Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index of over 19 billion web pages : over 40 billion of

More information

Chapter 11. 11.1 Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Chapter 11. 11.1 Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling Approximation Algorithms Chapter Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should I do? A. Theory says you're unlikely to find a poly-time algorithm. Must sacrifice one

More information

Graph Theory and Complex Networks: An Introduction. Chapter 08: Computer networks

Graph Theory and Complex Networks: An Introduction. Chapter 08: Computer networks Graph Theory and Complex Networks: An Introduction Maarten van Steen VU Amsterdam, Dept. Computer Science Room R4.20, steen@cs.vu.nl Chapter 08: Computer networks Version: March 3, 2011 2 / 53 Contents

More information

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm. Approximation Algorithms 11 Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of three

More information

Applied Algorithm Design Lecture 5

Applied Algorithm Design Lecture 5 Applied Algorithm Design Lecture 5 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 5 1 / 86 Approximation Algorithms Pietro Michiardi (Eurecom) Applied Algorithm Design

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Topological Properties

Topological Properties Advanced Computer Architecture Topological Properties Routing Distance: Number of links on route Node degree: Number of channels per node Network diameter: Longest minimum routing distance between any

More information

B490 Mining the Big Data. 2 Clustering

B490 Mining the Big Data. 2 Clustering B490 Mining the Big Data 2 Clustering Qin Zhang 1-1 Motivations Group together similar documents/webpages/images/people/proteins/products One of the most important problems in machine learning, pattern

More information

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman To motivate the Bloom-filter idea, consider a web crawler. It keeps, centrally, a list of all the URL s it has found so far. It

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

Near Optimal Solutions

Near Optimal Solutions Near Optimal Solutions Many important optimization problems are lacking efficient solutions. NP-Complete problems unlikely to have polynomial time solutions. Good heuristics important for such problems.

More information

PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS. PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS Project Project Title Area of Abstract No Specialization 1. Software

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Analyzing the Facebook graph?

Analyzing the Facebook graph? Logistics Big Data Algorithmic Introduction Prof. Yuval Shavitt Contact: shavitt@eng.tau.ac.il Final grade: 4 6 home assignments (will try to include programing assignments as well): 2% Exam 8% Big Data

More information

Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman

Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman 2 In a DBMS, input is under the control of the programming staff. SQL INSERT commands or bulk loaders. Stream management is important

More information

NP-Completeness I. Lecture 19. 19.1 Overview. 19.2 Introduction: Reduction and Expressiveness

NP-Completeness I. Lecture 19. 19.1 Overview. 19.2 Introduction: Reduction and Expressiveness Lecture 19 NP-Completeness I 19.1 Overview In the past few lectures we have looked at increasingly more expressive problems that we were able to solve using efficient algorithms. In this lecture we introduce

More information

Discuss the size of the instance for the minimum spanning tree problem.

Discuss the size of the instance for the minimum spanning tree problem. 3.1 Algorithm complexity The algorithms A, B are given. The former has complexity O(n 2 ), the latter O(2 n ), where n is the size of the instance. Let n A 0 be the size of the largest instance that can

More information

Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh sviglas@inf.ed.ac.uk. Stratis Viglas Extreme Computing 1

Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh sviglas@inf.ed.ac.uk. Stratis Viglas Extreme Computing 1 Extreme Computing Big Data Stratis Viglas School of Informatics University of Edinburgh sviglas@inf.ed.ac.uk Stratis Viglas Extreme Computing 1 Petabyte Age Big Data Challenges Stratis Viglas Extreme Computing

More information

Outline. NP-completeness. When is a problem easy? When is a problem hard? Today. Euler Circuits

Outline. NP-completeness. When is a problem easy? When is a problem hard? Today. Euler Circuits Outline NP-completeness Examples of Easy vs. Hard problems Euler circuit vs. Hamiltonian circuit Shortest Path vs. Longest Path 2-pairs sum vs. general Subset Sum Reducing one problem to another Clique

More information

Part 2: Community Detection

Part 2: Community Detection Chapter 8: Graph Data Part 2: Community Detection Based on Leskovec, Rajaraman, Ullman 2014: Mining of Massive Datasets Big Data Management and Analytics Outline Community Detection - Social networks -

More information

VEHICLE ROUTING PROBLEM

VEHICLE ROUTING PROBLEM VEHICLE ROUTING PROBLEM Readings: E&M 0 Topics: versus TSP Solution methods Decision support systems for Relationship between TSP and Vehicle routing problem () is similar to the Traveling salesman problem

More information

Private Approximation of Clustering and Vertex Cover

Private Approximation of Clustering and Vertex Cover Private Approximation of Clustering and Vertex Cover Amos Beimel, Renen Hallak, and Kobbi Nissim Department of Computer Science, Ben-Gurion University of the Negev Abstract. Private approximation of search

More information

Guessing Game: NP-Complete?

Guessing Game: NP-Complete? Guessing Game: NP-Complete? 1. LONGEST-PATH: Given a graph G = (V, E), does there exists a simple path of length at least k edges? YES 2. SHORTEST-PATH: Given a graph G = (V, E), does there exists a simple

More information

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is

More information

CAS CS 565, Data Mining

CAS CS 565, Data Mining CAS CS 565, Data Mining Course logistics Course webpage: http://www.cs.bu.edu/~evimaria/cs565-10.html Schedule: Mon Wed, 4-5:30 Instructor: Evimaria Terzi, evimaria@cs.bu.edu Office hours: Mon 2:30-4pm,

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Mining Data Streams. Chapter 4. 4.1 The Stream Data Model

Mining Data Streams. Chapter 4. 4.1 The Stream Data Model Chapter 4 Mining Data Streams Most of the algorithms described in this book assume that we are mining a database. That is, all our data is available when and if we want it. In this chapter, we shall make

More information

Data Structures in Java. Session 15 Instructor: Bert Huang http://www1.cs.columbia.edu/~bert/courses/3134

Data Structures in Java. Session 15 Instructor: Bert Huang http://www1.cs.columbia.edu/~bert/courses/3134 Data Structures in Java Session 15 Instructor: Bert Huang http://www1.cs.columbia.edu/~bert/courses/3134 Announcements Homework 4 on website No class on Tuesday Midterm grades almost done Review Indexing

More information

Lecture 9 - Message Authentication Codes

Lecture 9 - Message Authentication Codes Lecture 9 - Message Authentication Codes Boaz Barak March 1, 2010 Reading: Boneh-Shoup chapter 6, Sections 9.1 9.3. Data integrity Until now we ve only been interested in protecting secrecy of data. However,

More information

File Management. Chapter 12

File Management. Chapter 12 Chapter 12 File Management File is the basic element of most of the applications, since the input to an application, as well as its output, is usually a file. They also typically outlive the execution

More information

Research Statement Immanuel Trummer www.itrummer.org

Research Statement Immanuel Trummer www.itrummer.org Research Statement Immanuel Trummer www.itrummer.org We are collecting data at unprecedented rates. This data contains valuable insights, but we need complex analytics to extract them. My research focuses

More information

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P) Distributed Computing over Communication Networks: Topology (with an excursion to P2P) Some administrative comments... There will be a Skript for this part of the lecture. (Same as slides, except for today...

More information

Well-Separated Pair Decomposition for the Unit-disk Graph Metric and its Applications

Well-Separated Pair Decomposition for the Unit-disk Graph Metric and its Applications Well-Separated Pair Decomposition for the Unit-disk Graph Metric and its Applications Jie Gao Department of Computer Science Stanford University Joint work with Li Zhang Systems Research Center Hewlett-Packard

More information

1 Formulating The Low Degree Testing Problem

1 Formulating The Low Degree Testing Problem 6.895 PCP and Hardness of Approximation MIT, Fall 2010 Lecture 5: Linearity Testing Lecturer: Dana Moshkovitz Scribe: Gregory Minton and Dana Moshkovitz In the last lecture, we proved a weak PCP Theorem,

More information

Algorithmic Techniques for Big Data Analysis. Barna Saha AT&T Lab-Research

Algorithmic Techniques for Big Data Analysis. Barna Saha AT&T Lab-Research Algorithmic Techniques for Big Data Analysis Barna Saha AT&T Lab-Research Challenges of Big Data VOLUME Large amount of data VELOCITY Needs to be analyzed quickly VARIETY Different types of structured

More information

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm. Approximation Algorithms Chapter Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of

More information

Statistical Learning Theory Meets Big Data

Statistical Learning Theory Meets Big Data Statistical Learning Theory Meets Big Data Randomized algorithms for frequent itemsets Eli Upfal Brown University Data, data, data In God we trust, all others (must) bring data Prof. W.E. Deming, Statistician,

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Diversity Coloring for Distributed Data Storage in Networks 1

Diversity Coloring for Distributed Data Storage in Networks 1 Diversity Coloring for Distributed Data Storage in Networks 1 Anxiao (Andrew) Jiang and Jehoshua Bruck California Institute of Technology Pasadena, CA 9115, U.S.A. {jax, bruck}@paradise.caltech.edu Abstract

More information

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it KNIME TUTORIAL Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it Outline Introduction on KNIME KNIME components Exercise: Market Basket Analysis Exercise: Customer Segmentation Exercise:

More information

Ian Stewart on Minesweeper

Ian Stewart on Minesweeper Ian Stewart on Minesweeper It's not often you can win a million dollars by analysing a computer game, but by a curious conjunction of fate, there's a chance that you might. However, you'll only pick up

More information

Security-Aware Beacon Based Network Monitoring

Security-Aware Beacon Based Network Monitoring Security-Aware Beacon Based Network Monitoring Masahiro Sasaki, Liang Zhao, Hiroshi Nagamochi Graduate School of Informatics, Kyoto University, Kyoto, Japan Email: {sasaki, liang, nag}@amp.i.kyoto-u.ac.jp

More information

CS335 Sample Questions for Exam #2

CS335 Sample Questions for Exam #2 CS335 Sample Questions for Exam #2.) Compare connection-oriented with connectionless protocols. What type of protocol is IP? How about TCP and UDP? Connection-oriented protocols Require a setup time to

More information

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs. Multimedia Databases Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Previous Lecture 13 Indexes for Multimedia Data 13.1

More information

A Sublinear Bipartiteness Tester for Bounded Degree Graphs

A Sublinear Bipartiteness Tester for Bounded Degree Graphs A Sublinear Bipartiteness Tester for Bounded Degree Graphs Oded Goldreich Dana Ron February 5, 1998 Abstract We present a sublinear-time algorithm for testing whether a bounded degree graph is bipartite

More information

IC05 Introduction on Networks &Visualization Nov. 2009. <mathieu.bastian@gmail.com>

IC05 Introduction on Networks &Visualization Nov. 2009. <mathieu.bastian@gmail.com> IC05 Introduction on Networks &Visualization Nov. 2009 Overview 1. Networks Introduction Networks across disciplines Properties Models 2. Visualization InfoVis Data exploration

More information

U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, 2009. Notes on Algebra

U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, 2009. Notes on Algebra U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, 2009 Notes on Algebra These notes contain as little theory as possible, and most results are stated without proof. Any introductory

More information

P vs NP problem in the field anthropology

P vs NP problem in the field anthropology Research Article P vs NP problem in the field anthropology Michael.A. Popov, Oxford, UK Email Michael282.eps@gmail.com Keywords P =?NP - complexity anthropology - M -decision - quantum -like game - game-theoretical

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

Infrastructures for big data

Infrastructures for big data Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)

More information

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,

More information

1. Write the number of the left-hand item next to the item on the right that corresponds to it.

1. Write the number of the left-hand item next to the item on the right that corresponds to it. 1. Write the number of the left-hand item next to the item on the right that corresponds to it. 1. Stanford prison experiment 2. Friendster 3. neuron 4. router 5. tipping 6. small worlds 7. job-hunting

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Six Degrees of Separation in Online Society

Six Degrees of Separation in Online Society Six Degrees of Separation in Online Society Lei Zhang * Tsinghua-Southampton Joint Lab on Web Science Graduate School in Shenzhen, Tsinghua University Shenzhen, Guangdong Province, P.R.China zhanglei@sz.tsinghua.edu.cn

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany MapReduce II MapReduce II 1 / 33 Outline 1. Introduction

More information

Methods & Tools Peer-to-Peer Jakob Jenkov

Methods & Tools Peer-to-Peer Jakob Jenkov Methods & Tools Peer-to-Peer Jakob Jenkov Peer-to-Peer (P2P) Definition(s) Potential Routing and Locating Proxy through firewalls and NAT Searching Security Pure P2P There is no central server or router.

More information

Offline sorting buffers on Line

Offline sorting buffers on Line Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

More information

Continuous Fastest Path Planning in Road Networks by Mining Real-Time Traffic Event Information

Continuous Fastest Path Planning in Road Networks by Mining Real-Time Traffic Event Information Continuous Fastest Path Planning in Road Networks by Mining Real-Time Traffic Event Information Eric Hsueh-Chan Lu Chi-Wei Huang Vincent S. Tseng Institute of Computer Science and Information Engineering

More information

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere! Interconnection Networks Interconnection Networks Interconnection networks are used everywhere! Supercomputers connecting the processors Routers connecting the ports can consider a router as a parallel

More information

Lecture 15 - Digital Signatures

Lecture 15 - Digital Signatures Lecture 15 - Digital Signatures Boaz Barak March 29, 2010 Reading KL Book Chapter 12. Review Trapdoor permutations - easy to compute, hard to invert, easy to invert with trapdoor. RSA and Rabin signatures.

More information

Introduction to Machine Learning Using Python. Vikram Kamath

Introduction to Machine Learning Using Python. Vikram Kamath Introduction to Machine Learning Using Python Vikram Kamath Contents: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Introduction/Definition Where and Why ML is used Types of Learning Supervised Learning Linear Regression

More information

QUANTUM COMPUTERS AND CRYPTOGRAPHY. Mark Zhandry Stanford University

QUANTUM COMPUTERS AND CRYPTOGRAPHY. Mark Zhandry Stanford University QUANTUM COMPUTERS AND CRYPTOGRAPHY Mark Zhandry Stanford University Classical Encryption pk m c = E(pk,m) sk m = D(sk,c) m??? Quantum Computing Attack pk m aka Post-quantum Crypto c = E(pk,m) sk m = D(sk,c)

More information

Zeros of Polynomial Functions

Zeros of Polynomial Functions Review: Synthetic Division Find (x 2-5x - 5x 3 + x 4 ) (5 + x). Factor Theorem Solve 2x 3-5x 2 + x + 2 =0 given that 2 is a zero of f(x) = 2x 3-5x 2 + x + 2. Zeros of Polynomial Functions Introduction

More information

Practical Survey on Hash Tables. Aurelian Țuțuianu

Practical Survey on Hash Tables. Aurelian Țuțuianu Practical Survey on Hash Tables Aurelian Țuțuianu In memoriam Mihai Pătraşcu (17 July 1982 5 June 2012) I have no intention to ever teach computer science. I want to teach the love for computer science,

More information

Exploring Big Data in Social Networks

Exploring Big Data in Social Networks Exploring Big Data in Social Networks virgilio@dcc.ufmg.br (meira@dcc.ufmg.br) INWEB National Science and Technology Institute for Web Federal University of Minas Gerais - UFMG May 2013 Some thoughts about

More information

Managing Incompleteness, Complexity and Scale in Big Data

Managing Incompleteness, Complexity and Scale in Big Data Managing Incompleteness, Complexity and Scale in Big Data Nick Duffield Electrical and Computer Engineering Texas A&M University http://nickduffield.net/work Three Challenges for Big Data Complexity Problem:

More information

The Feasibility of Supporting Large-Scale Live Streaming Applications with Dynamic Application End-Points

The Feasibility of Supporting Large-Scale Live Streaming Applications with Dynamic Application End-Points The Feasibility of Supporting Large-Scale Live Streaming Applications with Dynamic Application End-Points Kay Sripanidkulchai, Aditya Ganjam, Bruce Maggs, and Hui Zhang Instructor: Fabian Bustamante Presented

More information

Deduplication as security issue in cloud services, and its representation in Terms of Service Agreements

Deduplication as security issue in cloud services, and its representation in Terms of Service Agreements Deduplication as security issue in cloud services, and its representation in Terms of Service Agreements Cecilia Wirfelt Louise Wallin Email: {cecwi155, louwa538}@student.liu.se Supervisor: Jan-Åke Larsson,

More information

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Outline Big Data How to extract information? Data clustering

More information

Map-Reduce for Machine Learning on Multicore

Map-Reduce for Machine Learning on Multicore Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,

More information

Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015

Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015 Clustering Big Data Efficient Data Mining Technologies J Singh and Teresa Brooks June 4, 2015 Hello Bulgaria (http://hello.bg/) A website with thousands of pages... Some pages identical to other pages

More information

Introduction to Learning & Decision Trees

Introduction to Learning & Decision Trees Artificial Intelligence: Representation and Problem Solving 5-38 April 0, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning What is learning? - more than just memorizing

More information

Single machine models: Maximum Lateness -12- Approximation ratio for EDD for problem 1 r j,d j < 0 L max. structure of a schedule Q...

Single machine models: Maximum Lateness -12- Approximation ratio for EDD for problem 1 r j,d j < 0 L max. structure of a schedule Q... Lecture 4 Scheduling 1 Single machine models: Maximum Lateness -12- Approximation ratio for EDD for problem 1 r j,d j < 0 L max structure of a schedule 0 Q 1100 11 00 11 000 111 0 0 1 1 00 11 00 11 00

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

The Classes P and NP

The Classes P and NP The Classes P and NP We now shift gears slightly and restrict our attention to the examination of two families of problems which are very important to computer scientists. These families constitute the

More information

Factoring & Primality

Factoring & Primality Factoring & Primality Lecturer: Dimitris Papadopoulos In this lecture we will discuss the problem of integer factorization and primality testing, two problems that have been the focus of a great amount

More information

Outline. Computer Science 418. Digital Signatures: Observations. Digital Signatures: Definition. Definition 1 (Digital signature) Digital Signatures

Outline. Computer Science 418. Digital Signatures: Observations. Digital Signatures: Definition. Definition 1 (Digital signature) Digital Signatures Outline Computer Science 418 Digital Signatures Mike Jacobson Department of Computer Science University of Calgary Week 12 1 Digital Signatures 2 Signatures via Public Key Cryptosystems 3 Provable 4 Mike

More information

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) 3 4 4 7 5 9 6 16 7 8 8 4 9 8 10 4 Total 92.

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) 3 4 4 7 5 9 6 16 7 8 8 4 9 8 10 4 Total 92. Name: Email ID: CSE 326, Data Structures Section: Sample Final Exam Instructions: The exam is closed book, closed notes. Unless otherwise stated, N denotes the number of elements in the data structure

More information

DATA ANALYSIS IN PUBLIC SOCIAL NETWORKS

DATA ANALYSIS IN PUBLIC SOCIAL NETWORKS International Scientific Conference & International Workshop Present Day Trends of Innovations 2012 28 th 29 th May 2012 Łomża, Poland DATA ANALYSIS IN PUBLIC SOCIAL NETWORKS Lubos Takac 1 Michal Zabovsky

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information