Big Data. Lecture 6: Locality Sensitive Hashing (LSH)

Size: px
Start display at page:

Download "Big Data. Lecture 6: Locality Sensitive Hashing (LSH)"

Transcription

1 Big Data Lecture 6: Locality Sensitive Hashing (LSH)

2 Nearest Neighbor Given a set P of n oints in R d

3 Nearest Neighbor Want to build a data structure to answer nearest neighbor queries

4 Voronoi Diagram Build a Voronoi diagram & a oint location data structure

5 Curse of dimensionality In R 2 the Voronoi diagram is of size O(n) Query takes O(logn) time In R d the comlexity is O(n d/2 ) Other techniques also scale bad with the dimension

6 Locality Sensitive Hashing We will use a family of hash functions such that close oints tend to hash to the same bucket. Put all oints of P in their buckets, ideally we want the query q to find its nearest neighbor in its bucket

7 Locality Sensitive Hashing Def (Charikar): A family H of functions is locality sensitive with resect to a similarity function 0 sim(,q) 1 if Pr[h() = h(q)] = sim(,q)

8 Examle Hamming Similarity Think of the oints as strings of m bits and consider the similarity sim(,q) = 1-ham(,q)/m H={h i () = the i-th bit of } is locality sensitive wrt sim(,q) = 1-ham(,q)/m Pr[h() = h(q)] = 1 ham(,q)/m 1-sim(,q) = ham(,q)/m

9 Examle - Jaacard Think of and q as sets sim(,q) = jaccard(,q) = q / q H={h () = min in of the items in } Pr[h () = h (q)] = jaccard(,q) Need to ick from a min-wise ind. family of ermutations

10 Ma to {0,1} Draw a function b to 0/1 from a airwise ind. family B So: h() h(q) b(h()) = b(h(q)) = 1/2 H ={b(h()) hh, bb} (1 sim(, q)) 1 sim(, q) Pr b( h( )) b( h( q)) sim(, q) 2 2

11 Another examle ( simhash ) H = {h r () = 1 if r > 0, 0 otherwise r is a random unit vector} r

12 Another examle H = {h r () = 1 if r > 0, 0 otherwise r is a random unit vector} Pr[h r () = h r (q)] =?

13 Another examle H = {h r () = 1 if r > 0, 0 otherwise r is a random unit vector} θ Pr[ hr = hr q] 1-

14 Another examle H = {h r () = 1 if r > 0, 0 otherwise r is a random unit vector} θ Pr[ hr = hr q] 1- sim(, q)

15 Another examle H = {h r () = 1 if r > 0, 0 otherwise r is a random unit vector} Pr[ hr = hr q] 1- sim(, q) θ For binary vectors (like term-doc) incidence vectors: cos 1 A B AB

16 How do we really use it? Reduce the number of false ositives by concatenating hash function to get new hash functions ( signature ) sig() = h 1 ()h 2 () h 3 ()h 4 () = Very close documents are hashed to the same bucket or to close buckets (ham(sig(),sig(q)) is small) See aers on removing almost dulicates

17 A theoretical result on NN

18 Locality Sensitive Hashing Thm: If there exists a family H of hash functions such that Pr[h() = h(q)] = sim(,q) then d(,q) = 1-sim(,q) satisfies the triangle inequality

19 Locality Sensitive Hashing Alternative Def (Indyk-Motwani): A family H of functions is (r 1 < r 2, 1 > 2 )- sensitive if d(,q) r 1 Pr[h() = h(q)] 1 d(,q) r 2 Pr[h() = h(q)] 2 r 1 r 2 If d(,q) = 1-sim(,q) then this holds with 1 = 1-r 1 and 2 =1-r 2 r 1, r 2

20 Locality Sensitive Hashing Alternative Def (Indyk-Motwani): A family H of functions is (r 1 < r 2, 1 > 2 )- sensitive if d(,q) r 1 Pr[h() = h(q)] 1 d(,q) r 2 Pr[h() = h(q)] 2 r 1 r 2 If d(,q) = ham(,q) then this holds with 1 = 1-r 1 /m and 2 =1-r 2 /m r 1, r 2

21 (r,ε)-neighbor roblem 1) If there is a neighbor, such that d(,q)r, return, s.t. d(,q) (1+ε)r. 2) If there is no s.t. d(,q)(1+ε)r return nothing. ((1) is the real req. since if we satisfy (1) only, we can satisfy (2) by filtering answers that are too far)

22 (r,ε)-neighbor roblem 1) If there is a neighbor, such that d(,q)r, return, s.t. d(,q) (1+ε)r. r (1+ε)r

23 (r,ε)-neighbor roblem 2) Never return such that d(,q) > (1+ε)r r (1+ε)r

24 (r,ε)-neighbor roblem We can return, s.t. r d(,q) (1+ε)r. r (1+ε)r

25 (r,ε)-neighbor roblem Lets construct a data structure that succeeds with constant robability Focus on the hamming distance first

26 NN using locality sensitive hashing Take a (r 1 < r 2, 1 > 2 ) = (r < (1+)r, 1-r/m > 1-(1+)r/m) - sensitive family If there is a neighbor at distance r we catch it with robability 1

27 NN using locality sensitive hashing Take a (r 1 < r 2, 1 > 2 ) = (r < (1+)r, 1-r/m > 1-(1+)r/m) - sensitive family If there is a neighbor at distance r we catch it with robability 1 so to guarantee catching it we need 1/ 1 functions..

28 NN using locality sensitive hashing Take a (r 1 < r 2, 1 > 2 ) = (r < (1+)r, 1-r/m > 1-(1+)r/m) - sensitive family If there is a neighbor at distance r we catch it with robability 1 so to guarantee catching it we need 1/ 1 functions.. But we also get false ositives in our 1/ 1 buckets, how many?

29 NN using locality sensitive hashing Take a (r 1 < r 2, 1 > 2 ) = (r < (1+)r, 1-r/m > 1-(1+)r/m) - sensitive family If there is a neighbor at distance r we catch it with robability 1 so to guarantee catching it we need 1/ 1 functions.. But we also get false ositives in our 1/ 1 buckets, how many? n 2 / 1

30 NN using locality sensitive hashing Take a (r 1 < r 2, 1 > 2 ) = (r < (1+)r, 1-r/m > 1-(1+)r/m) - sensitive family Make a new function by concatenating k of these basic functions We get a (r 1 < r 2, ( 1 ) k > ( 2 ) k ) If there is a neighbor at distance r we catch it with robability ( 1 ) k so to guarantee catching it we need 1/( 1 ) k functions.. But we also get false ositives in our 1/( 1 ) k buckets, how many? n( 2 ) k /( 1 ) k

31 (r,ε)-neighbor with constant rob Scan the first 4n( 2 ) k /( 1 ) k oints in the buckets and return the closest A close neighbor ( r 1 ) is in one of the buckets with robability 1-(1/e) There are 4n( 2 ) k /( 1 ) k false ositives with robability 3/4 Both events haen with constant rob.

32 Analysis Total query time: (each o takes time ro. to the dim.) 1 k k 2 n 1 k We want to choose k to minimize this. time 2*min k

33 Analysis Total query time: (each o takes time ro. to the dim.) 1 k k 2 n 1 k We want to choose k to minimize this: k n 2 k n k 1 k 2 k log ( n) (loglog n) 1 2

34 Total query time: Put: Summary 1 2 k 1 k 2 n 1 k log ( n) (loglog n) k log ( n) n log log n Total sace: nn

35 What is? Query time: log ( n) n log log n Total sace: n n 1 log r log 1 1 log 1 m 1 1 log 2 (1 ) r 1 log log 1 m 2

36 (1+ε)-aroximate NN Given q find such that d(q,) (1+ε)d(q, ) We can use our solution to the (r,)- neighbor roblem

37 (1+ε)-aroximate NN vs (r,ε)- neighbor roblem If we know r min and r max we can find (1+ε)- aroximate NN using log(r max /r min ) (r,ε ε/2)-neighbor roblems r (1+ε)r

38 LSH using -stable distributions Definition: A distribution D is 2-stable if when X 1,,X d are drawn from D, v i X i = v X where X is drawn from D. So what do we do with this? h() = i X i h()-h(q) = i X i - q i X i = ( i -q i )X i = -q X

39 LSH using -stable distributions Definition: A distribution D is 2-stable if when X 1,,X d are drawn from D, v i X i = v X where X is drawn from D. So what do we do with this? h() = (X+b)/r Pick r to maximize ρ r

40 Bibliograhy M. Charikar: Similarity estimation techniques from rounding algorithms. STOC 2002: P. Indyk, R. Motwani: Aroximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC 1998: A. Gionis, P. Indyk, R. Motwani: Similarity Search in High Dimensions via Hashing. VLDB 1999: M. R. Henzinger: Finding near-dulicate web ages: a largescale evaluation of algorithms. SIGIR 2006: G. S. Manku, A. Jain, A. Das Sarma: Detecting neardulicates for web crawling. WWW 2007:

Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015

Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015 Clustering Big Data Efficient Data Mining Technologies J Singh and Teresa Brooks June 4, 2015 Hello Bulgaria (http://hello.bg/) A website with thousands of pages... Some pages identical to other pages

More information

Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS)

Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS) Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS) Anshumali Shrivastava Department of Computer Science Computing and Information Science Cornell University Ithaca, NY 4853, USA

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

SimHash-based Effective and Efficient Detecting of Near-Duplicate Short Messages

SimHash-based Effective and Efficient Detecting of Near-Duplicate Short Messages ISBN 978-952-5726-07-7 (Print), 978-952-5726-08-4 (CD-ROM) Proceedings of the Second Symposium International Computer Science and Computational Technology(ISCSCT 09) Huangshan, P. R. China, 26-28,Dec.

More information

Challenges in Finding an Appropriate Multi-Dimensional Index Structure with Respect to Specific Use Cases

Challenges in Finding an Appropriate Multi-Dimensional Index Structure with Respect to Specific Use Cases Challenges in Finding an Appropriate Multi-Dimensional Index Structure with Respect to Specific Use Cases Alexander Grebhahn grebhahn@st.ovgu.de Reimar Schröter rschroet@st.ovgu.de David Broneske dbronesk@st.ovgu.de

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

Clustering using Simhash and Locality Sensitive Hashing in Hadoop HDFS : An Infrastructure Extension

Clustering using Simhash and Locality Sensitive Hashing in Hadoop HDFS : An Infrastructure Extension ISTE-ACEEE Int. J. in Computer Science, Vol. 1, No. 1, March 2014 Clustering using Simhash and Locality Sensitive Hashing in Hadoop HDFS : An Infrastructure Extension Kala Karun.A and Chitharanjan. K Sree

More information

Point Location. Preprocess a planar, polygonal subdivision for point location queries. p = (18, 11)

Point Location. Preprocess a planar, polygonal subdivision for point location queries. p = (18, 11) Point Location Prerocess a lanar, olygonal subdivision for oint location ueries. = (18, 11) Inut is a subdivision S of comlexity n, say, number of edges. uild a data structure on S so that for a uery oint

More information

Fast Matching of Binary Features

Fast Matching of Binary Features Fast Matching of Binary Features Marius Muja and David G. Lowe Laboratory for Computational Intelligence University of British Columbia, Vancouver, Canada {mariusm,lowe}@cs.ubc.ca Abstract There has been

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Large-Scale Distributed Locality-Sensitive Hashing for General Metric Data

Large-Scale Distributed Locality-Sensitive Hashing for General Metric Data Large-Scale Distributed Locality-Sensitive Hashing for General Metric Data Eliezer Silva 2, Thiago Teixeira 1, George Teodoro 1, and Eduardo Valle 2 1 Dep. of Computer Science, University of Brasilia thiagotei@gmail.com,

More information

Clustering and Load Balancing Optimization for Redundant Content Removal

Clustering and Load Balancing Optimization for Redundant Content Removal Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu Ask.com Alexandra Potapova University of California Santa Barbara Maha Alabduljalil University of California Santa

More information

Efficient Approximate Similarity Search Using Random Projection Learning

Efficient Approximate Similarity Search Using Random Projection Learning Efficient Approximate Similarity Search Using Random Projection Learning Peisen Yuan, Chaofeng Sha, Xiaoling Wang 2,BinYang, and Aoying Zhou 2 School of Computer Science, Shanghai Key Laboratory of Intelligent

More information

Fast Prototype Based Noise Reduction

Fast Prototype Based Noise Reduction Fast Prototype Based Noise Reduction Kajsa Tibell, Hagen Spies, and Magnus Borga Sapheneia Commercial Products AB, Teknikringen 8, 583 3 Linkoping, SWEDEN Department of Biomedical Engineering, Linkoping

More information

Manual for BEAR Big Data Ensemble of Adaptations for Regression Version 1.0

Manual for BEAR Big Data Ensemble of Adaptations for Regression Version 1.0 Manual for BEAR Big Data Ensemble of Adaptations for Regression Version 1.0 Vahid Jalali David Leake August 9, 2015 Abstract BEAR is a case-based regression learner tailored for big data processing. It

More information

THE concept of Big Data refers to systems conveying

THE concept of Big Data refers to systems conveying EDIC RESEARCH PROPOSAL 1 High Dimensional Nearest Neighbors Techniques for Data Cleaning Anca-Elena Alexandrescu I&C, EPFL Abstract Organisations from all domains have been searching for increasingly more

More information

Learning Binary Hash Codes for Large-Scale Image Search

Learning Binary Hash Codes for Large-Scale Image Search Learning Binary Hash Codes for Large-Scale Image Search Kristen Grauman and Rob Fergus Abstract Algorithms to rapidly search massive image or video collections are critical for many vision applications,

More information

Content Delivery Network (CDN) and P2P Model

Content Delivery Network (CDN) and P2P Model A multi-agent algorithm to improve content management in CDN networks Agostino Forestiero, forestiero@icar.cnr.it Carlo Mastroianni, mastroianni@icar.cnr.it ICAR-CNR Institute for High Performance Computing

More information

Email Spam Detection Using Customized SimHash Function

Email Spam Detection Using Customized SimHash Function International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email

More information

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets Dudu Lazarov, Gil David, Amir Averbuch School of Computer Science, Tel-Aviv University Tel-Aviv 69978, Israel Abstract

More information

Large-Scale IP Traceback in High-Speed Internet: Practical Techniques and Theoretical Foundation

Large-Scale IP Traceback in High-Speed Internet: Practical Techniques and Theoretical Foundation Large-Scale IP Traceback in High-Seed Internet: Practical Techniques and Theoretical Foundation Jun Li Minho Sung Jun (Jim) Xu College of Comuting Georgia Institute of Technology {junli,mhsung,jx}@cc.gatech.edu

More information

Monitoring Frequency of Change By Li Qin

Monitoring Frequency of Change By Li Qin Monitoring Frequency of Change By Li Qin Abstract Control charts are widely used in rocess monitoring roblems. This aer gives a brief review of control charts for monitoring a roortion and some initial

More information

C-Bus Voltage Calculation

C-Bus Voltage Calculation D E S I G N E R N O T E S C-Bus Voltage Calculation Designer note number: 3-12-1256 Designer: Darren Snodgrass Contact Person: Darren Snodgrass Aroved: Date: Synosis: The guidelines used by installers

More information

SQUARE GRID POINTS COVERAGED BY CONNECTED SOURCES WITH COVERAGE RADIUS OF ONE ON A TWO-DIMENSIONAL GRID

SQUARE GRID POINTS COVERAGED BY CONNECTED SOURCES WITH COVERAGE RADIUS OF ONE ON A TWO-DIMENSIONAL GRID International Journal of Comuter Science & Information Technology (IJCSIT) Vol 6, No 4, August 014 SQUARE GRID POINTS COVERAGED BY CONNECTED SOURCES WITH COVERAGE RADIUS OF ONE ON A TWO-DIMENSIONAL GRID

More information

Lecture 4 Online and streaming algorithms for clustering

Lecture 4 Online and streaming algorithms for clustering CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line

More information

Counting Problems in Flash Storage Design

Counting Problems in Flash Storage Design Flash Talk Counting Problems in Flash Storage Design Bongki Moon Department of Computer Science University of Arizona Tucson, AZ 85721, U.S.A. bkmoon@cs.arizona.edu NVRAMOS 09, Jeju, Korea, October 2009-1-

More information

Jubatus: An Open Source Platform for Distributed Online Machine Learning

Jubatus: An Open Source Platform for Distributed Online Machine Learning Jubatus: An Open Source Platform for Distributed Online Machine Learning Shohei Hido Seiya Tokui Preferred Infrastructure Inc. Tokyo, Japan {hido, tokui}@preferred.jp Satoshi Oda NTT Software Innovation

More information

New Hash Function Construction for Textual and Geometric Data Retrieval

New Hash Function Construction for Textual and Geometric Data Retrieval Latest Trends on Computers, Vol., pp.483-489, ISBN 978-96-474-3-4, ISSN 79-45, CSCC conference, Corfu, Greece, New Hash Function Construction for Textual and Geometric Data Retrieval Václav Skala, Jan

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

Online Generation of Locality Sensitive Hash Signatures

Online Generation of Locality Sensitive Hash Signatures Online Generation of Locality Sensitive Hash Signatures Benjamin Van Durme HLTCOE Johns Hopkins University Baltimore, MD 21211 USA Ashwin Lall College of Computing Georgia Institute of Technology Atlanta,

More information

1 Gambler s Ruin Problem

1 Gambler s Ruin Problem Coyright c 2009 by Karl Sigman 1 Gambler s Ruin Problem Let N 2 be an integer and let 1 i N 1. Consider a gambler who starts with an initial fortune of $i and then on each successive gamble either wins

More information

On the Efficiency of Collecting and Reducing Spam Samples

On the Efficiency of Collecting and Reducing Spam Samples On the Efficiency of Collecting and Reducing Spam Samples Pin-Ren Chiou, Po-Ching Lin Department of Computer Science and Information Engineering National Chung Cheng University Chiayi, Taiwan, 62102 {cpj101m,pclin}@cs.ccu.edu.tw

More information

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman To motivate the Bloom-filter idea, consider a web crawler. It keeps, centrally, a list of all the URL s it has found so far. It

More information

Data Structure and Network Searching

Data Structure and Network Searching Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence A Unified Approximate Nearest Neighbor Search Scheme by Combining Data Structure and Hashing Debing Zhang Genmao

More information

Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel Yahoo! Research New York, NY 10018 goel@yahoo-inc.com John Langford Yahoo! Research New York, NY 10018 jl@yahoo-inc.com Alex Strehl Yahoo! Research New York,

More information

Mean Shift Based Clustering in High Dimensions: A Texture Classification Example

Mean Shift Based Clustering in High Dimensions: A Texture Classification Example Mean Shift Based Clustering in High Dimensions: A Texture Classification Example Bogdan Georgescu µ Ilan Shimshoni µ Peter Meer ¾µ Computer Science µ Electrical and Computer Engineering ¾µ Rutgers University,

More information

Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning

Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning Prateek Jain Algorithms Research Group Microsoft Research, Bangalore, India prajain@microsoft.com Sudheendra Vijayanarasimhan

More information

Data Warehousing und Data Mining

Data Warehousing und Data Mining Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data

More information

CSC574 - Computer and Network Security Module: Intrusion Detection

CSC574 - Computer and Network Security Module: Intrusion Detection CSC574 - Computer and Network Security Module: Intrusion Detection Prof. William Enck Spring 2013 1 Intrusion An authorized action... that exploits a vulnerability... that causes a compromise... and thus

More information

Lecture 6 Online and streaming algorithms for clustering

Lecture 6 Online and streaming algorithms for clustering CSE 291: Unsupervised learning Spring 2008 Lecture 6 Online and streaming algorithms for clustering 6.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line

More information

Introduction to NP-Completeness Written and copyright c by Jie Wang 1

Introduction to NP-Completeness Written and copyright c by Jie Wang 1 91.502 Foundations of Comuter Science 1 Introduction to Written and coyright c by Jie Wang 1 We use time-bounded (deterministic and nondeterministic) Turing machines to study comutational comlexity of

More information

In order to describe motion you need to describe the following properties.

In order to describe motion you need to describe the following properties. Chapter 2 One Dimensional Kinematics How would you describe the following motion? Ex: random 1-D path speeding up and slowing down In order to describe motion you need to describe the following properties.

More information

Geometry and Topology from Point Cloud Data

Geometry and Topology from Point Cloud Data Geometry and Topology from Point Cloud Data Tamal K. Dey Department of Computer Science and Engineering The Ohio State University Dey (2011) Geometry and Topology from Point Cloud Data WALCOM 11 1 / 51

More information

Lecture #2. Algorithms for Big Data

Lecture #2. Algorithms for Big Data Additional Topics: Big Data Lecture #2 Algorithms for Big Data Joseph Bonneau jcb82@cam.ac.uk April 30, 2012 Today's topic: algorithms Do we need new algorithms? Quantity is a quality of its own Joseph

More information

The Advantages and Disadvantages of Network Computing Nodes

The Advantages and Disadvantages of Network Computing Nodes Big Data & Scripting storage networks and distributed file systems 1, 2, in the remainder we use networks of computing nodes to enable computations on even larger datasets for a computation, each node

More information

The Online Freeze-tag Problem

The Online Freeze-tag Problem The Online Freeze-tag Problem Mikael Hammar, Bengt J. Nilsson, and Mia Persson Atus Technologies AB, IDEON, SE-3 70 Lund, Sweden mikael.hammar@atus.com School of Technology and Society, Malmö University,

More information

Topological Data Analysis Applications to Computer Vision

Topological Data Analysis Applications to Computer Vision Topological Data Analysis Applications to Computer Vision Vitaliy Kurlin, http://kurlin.org Microsoft Research Cambridge and Durham University, UK Topological Data Analysis quantifies topological structures

More information

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P) Distributed Computing over Communication Networks: Topology (with an excursion to P2P) Some administrative comments... There will be a Skript for this part of the lecture. (Same as slides, except for today...

More information

CIS 700: algorithms for Big Data

CIS 700: algorithms for Big Data CIS 700: algorithms for Big Data Lecture 6: Graph Sketching Slides at http://grigory.us/big-data-class.html Grigory Yaroslavtsev http://grigory.us Sketching Graphs? We know how to sketch vectors: v Mv

More information

United Arab Emirates University College of Sciences Department of Mathematical Sciences HOMEWORK 1 SOLUTION. Section 10.1 Vectors in the Plane

United Arab Emirates University College of Sciences Department of Mathematical Sciences HOMEWORK 1 SOLUTION. Section 10.1 Vectors in the Plane United Arab Emirates University College of Sciences Deartment of Mathematical Sciences HOMEWORK 1 SOLUTION Section 10.1 Vectors in the Plane Calculus II for Engineering MATH 110 SECTION 0 CRN 510 :00 :00

More information

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear

More information

Streaming Algorithms

Streaming Algorithms 3 Streaming Algorithms Great Ideas in Theoretical Computer Science Saarland University, Summer 2014 Some Admin: Deadline of Problem Set 1 is 23:59, May 14 (today)! Students are divided into two groups

More information

Bi-level Locality Sensitive Hashing for K-Nearest Neighbor Computation

Bi-level Locality Sensitive Hashing for K-Nearest Neighbor Computation Bi-level Locality Sensitive Hashing for K-Nearest Neighbor Computation Jia Pan UNC Chapel Hill panj@cs.unc.edu Dinesh Manocha UNC Chapel Hill dm@cs.unc.edu ABSTRACT We present a new Bi-level LSH algorithm

More information

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Introduction to nonparametric regression: Least squares vs. Nearest neighbors Introduction to nonparametric regression: Least squares vs. Nearest neighbors Patrick Breheny October 30 Patrick Breheny STA 621: Nonparametric Statistics 1/16 Introduction For the remainder of the course,

More information

BALTIC OLYMPIAD IN INFORMATICS Stockholm, April 18-22, 2009 Page 1 of?? ENG rectangle. Rectangle

BALTIC OLYMPIAD IN INFORMATICS Stockholm, April 18-22, 2009 Page 1 of?? ENG rectangle. Rectangle Page 1 of?? ENG rectangle Rectangle Spoiler Solution of SQUARE For start, let s solve a similar looking easier task: find the area of the largest square. All we have to do is pick two points A and B and

More information

Approximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs

Approximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs Approximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs Yong Zhang 1.2, Francis Y.L. Chin 2, and Hing-Fung Ting 2 1 College of Mathematics and Computer Science, Hebei University,

More information

MODELING RANDOMNESS IN NETWORK TRAFFIC

MODELING RANDOMNESS IN NETWORK TRAFFIC MODELING RANDOMNESS IN NETWORK TRAFFIC - LAVANYA JOSE, INDEPENDENT WORK FALL 11 ADVISED BY PROF. MOSES CHARIKAR ABSTRACT. Sketches are randomized data structures that allow one to record properties of

More information

CS 5410 - Computer and Network Security: Intrusion Detection

CS 5410 - Computer and Network Security: Intrusion Detection CS 5410 - Computer and Network Security: Intrusion Detection Professor Kevin Butler Fall 2015 Locked Down You re using all the techniques we will talk about over the course of the semester: Strong access

More information

Efficient Similarity Joins for Near Duplicate Detection

Efficient Similarity Joins for Near Duplicate Detection Efficient Similarity Joins for Near Duplicate Detection Chuan Xiao Wei Wang Xuemin Lin School of Computer Science and Engineering University of New South Wales Australia {chuanx, weiw, lxue}@cse.unsw.edu.au

More information

5.4 Closest Pair of Points

5.4 Closest Pair of Points 5.4 Closest Pair of Points Closest Pair of Points Closest pair. Given n points in the plane, find a pair with smallest Euclidean distance between them. Fundamental geometric primitive. Graphics, computer

More information

Lecture 3 The Future of Search and Discovery in Big Data Analytics: Ultrametric Information Spaces

Lecture 3 The Future of Search and Discovery in Big Data Analytics: Ultrametric Information Spaces Lecture 3 The Future of Search and Discovery in Big Data Analytics: Ultrametric Information Spaces Themes 1) Big Data and analytics: the potential for metric (geometric) and ultrametric (topological) analysis.

More information

Load Balancing between Computing Clusters

Load Balancing between Computing Clusters Load Balancing between Computing Clusters Siu-Cheung Chau Dept. of Physics and Computing, Wilfrid Laurier University, Waterloo, Ontario, Canada, NL 3C5 e-mail: schau@wlu.ca Ada Wai-Chee Fu Dept. of Computer

More information

Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman

Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman 2 In a DBMS, input is under the control of the programming staff. SQL INSERT commands or bulk loaders. Stream management is important

More information

Efficient Similarity Search over Encrypted Data

Efficient Similarity Search over Encrypted Data UT DALLAS Erik Jonsson School of Engineering & Computer Science Efficient Similarity Search over Encrypted Data Mehmet Kuzu, Saiful Islam, Murat Kantarcioglu Introduction Client Untrusted Server Similarity

More information

Sublinear Algorithms for Big Data. Part 4: Random Topics

Sublinear Algorithms for Big Data. Part 4: Random Topics Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1 2-1 Topic 1: Compressive sensing Compressive sensing The model (Candes-Romberg-Tao 04; Donoho 04) Applicaitons Medical imaging reconstruction

More information

Entity Resolution Fingerprints Similar News Articles. Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

Entity Resolution Fingerprints Similar News Articles. Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman Entity Resolution Fingerprints Similar News Articles Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman 2 The entity-resolution problem is to examine a collection of records and

More information

Distributed Computing over Communication Networks: Maximal Independent Set

Distributed Computing over Communication Networks: Maximal Independent Set Distributed Computing over Communication Networks: Maximal Independent Set What is a MIS? MIS An independent set (IS) of an undirected graph is a subset U of nodes such that no two nodes in U are adjacent.

More information

Chapter 11. 11.1 Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Chapter 11. 11.1 Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling Approximation Algorithms Chapter Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should I do? A. Theory says you're unlikely to find a poly-time algorithm. Must sacrifice one

More information

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

More information

Finding Similar Items

Finding Similar Items 72 Chapter 3 Finding Similar Items A fundamental data-mining problem is to examine data for similar items. We shall take up applications in Section 3.1, but an example would be looking at a collection

More information

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants R-Trees: A Dynamic Index Structure For Spatial Searching A. Guttman R-trees Generalization of B+-trees to higher dimensions Disk-based index structure Occupancy guarantee Multiple search paths Insertions

More information

(67902) Topics in Theory and Complexity Nov 2, 2006. Lecture 7

(67902) Topics in Theory and Complexity Nov 2, 2006. Lecture 7 (67902) Topics in Theory and Complexity Nov 2, 2006 Lecturer: Irit Dinur Lecture 7 Scribe: Rani Lekach 1 Lecture overview This Lecture consists of two parts In the first part we will refresh the definition

More information

On the Power of Randomization in Big Data Analytics

On the Power of Randomization in Big Data Analytics On the Power of Randomization in Big Data Analytics Phạm Đăng Ninh Theoretical Computer Science Section IT University of Copenhagen, Denmark A thesis submitted for the degree of Doctor of Philosophy 31/08/2014

More information

Stat 134 Fall 2011: Gambler s ruin

Stat 134 Fall 2011: Gambler s ruin Stat 134 Fall 2011: Gambler s ruin Michael Lugo Setember 12, 2011 In class today I talked about the roblem of gambler s ruin but there wasn t enough time to do it roerly. I fear I may have confused some

More information

A Review on Duplicate and Near Duplicate Documents Detection Technique

A Review on Duplicate and Near Duplicate Documents Detection Technique International Journal of Computer Sciences and Engineering Open Access Review Paper Volume-4, Issue-03 E-ISSN: 2347-2693 A Review on Duplicate and Near Duplicate Documents Detection Technique Patil Deepali

More information

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

More information

Building A Scalable Multimedia Search Engine Using Infiniband

Building A Scalable Multimedia Search Engine Using Infiniband Building A Scalable Multimedia Search Engine Using Infiniband Qi Chen 1, Yisheng Liao 2, Christopher Mitchell 2, Jinyang Li 2, and Zhen Xiao 1 1 Department of Computer Science, Peing University 2 Department

More information

Caching Dynamic Skyline Queries

Caching Dynamic Skyline Queries Caching Dynamic Skyline Queries D. Sacharidis 1, P. Bouros 1, T. Sellis 1,2 1 National Technical University of Athens 2 Institute for Management of Information Systems R.C. Athena Outline Introduction

More information

Big Data Begets Big Database Theory

Big Data Begets Big Database Theory Big Data Begets Big Database Theory Dan Suciu University of Washington 1 Motivation Industry analysts describe Big Data in terms of three V s: volume, velocity, variety. The data is too big to process

More information

Part II: Bidding, Dynamics and Competition. Jon Feldman S. Muthukrishnan

Part II: Bidding, Dynamics and Competition. Jon Feldman S. Muthukrishnan Part II: Bidding, Dynamics and Competition Jon Feldman S. Muthukrishnan Campaign Optimization Budget Optimization (BO): Simple Input: Set of keywords and a budget. For each keyword, (clicks, cost) pair.

More information

ECE 533 Project Report Ashish Dhawan Aditi R. Ganesan

ECE 533 Project Report Ashish Dhawan Aditi R. Ganesan Handwritten Signature Verification ECE 533 Project Report by Ashish Dhawan Aditi R. Ganesan Contents 1. Abstract 3. 2. Introduction 4. 3. Approach 6. 4. Pre-processing 8. 5. Feature Extraction 9. 6. Verification

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms K-means and its variants Hierarchical clustering

More information

SECTION 6: FIBER BUNDLES

SECTION 6: FIBER BUNDLES SECTION 6: FIBER BUNDLES In this section we will introduce the interesting class o ibrations given by iber bundles. Fiber bundles lay an imortant role in many geometric contexts. For examle, the Grassmaniann

More information

Product quantization for nearest neighbor search

Product quantization for nearest neighbor search Product quantization for nearest neighbor search Hervé Jégou, Matthijs Douze, Cordelia Schmid Abstract This paper introduces a product quantization based approach for approximate nearest neighbor search.

More information

Chapter 6: Episode discovery process

Chapter 6: Episode discovery process Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing

More information

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum Statistical Validation and Data Analytics in ediscovery Jesse Kornblum Administrivia Silence your mobile Interactive talk Please ask questions 2 Outline Introduction Big Questions What Makes Things Similar?

More information

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm. Approximation Algorithms Chapter Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of

More information

Can linear programs solve NP-hard problems?

Can linear programs solve NP-hard problems? Can linear programs solve NP-hard problems? p. 1/9 Can linear programs solve NP-hard problems? Ronald de Wolf Linear programs Can linear programs solve NP-hard problems? p. 2/9 Can linear programs solve

More information

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs. Multimedia Databases Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Previous Lecture 13 Indexes for Multimedia Data 13.1

More information

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different

More information

Text Clustering Using LucidWorks and Apache Mahout

Text Clustering Using LucidWorks and Apache Mahout Text Clustering Using LucidWorks and Apache Mahout (Nov. 17, 2012) 1. Module name Text Clustering Using Lucidworks and Apache Mahout 2. Scope This module introduces algorithms and evaluation metrics for

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

To determine vertical angular frequency, we need to express vertical viewing angle in terms of and. 2tan. (degree). (1 pt)

To determine vertical angular frequency, we need to express vertical viewing angle in terms of and. 2tan. (degree). (1 pt) Polytechnic University, Dept. Electrical and Computer Engineering EL6123 --- Video Processing, S12 (Prof. Yao Wang) Solution to Midterm Exam Closed Book, 1 sheet of notes (double sided) allowed 1. (5 pt)

More information

PulsON RangeNet / ALOHA Guide to Optimal Performance. Brandon Dewberry, CTO

PulsON RangeNet / ALOHA Guide to Optimal Performance. Brandon Dewberry, CTO TIME DOMAIN PulsON RangeNet / ALOHA Guide to Optimal Performance Brandon Dewberry, CTO 320-0318A November 2013 4955 Corporate Drive, Suite 101, Huntsville, Alabama 35805 Phone: 256.922.9229 Fax: 256.922.0387

More information

Storage Basics Architecting the Storage Supplemental Handout

Storage Basics Architecting the Storage Supplemental Handout Storage Basics Architecting the Storage Sulemental Handout INTRODUCTION With digital data growing at an exonential rate it has become a requirement for the modern business to store data and analyze it

More information

Comparison of Standard and Zipf-Based Document Retrieval Heuristics

Comparison of Standard and Zipf-Based Document Retrieval Heuristics Comparison of Standard and Zipf-Based Document Retrieval Heuristics Benjamin Hoffmann Universität Stuttgart, Institut für Formale Methoden der Informatik Universitätsstr. 38, D-70569 Stuttgart, Germany

More information

RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING

RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING = + RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING Stefan Savev Berlin Buzzwords June 2015 KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data sparsity:

More information

MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

More information

High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University

High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University High Performance Spatial Queries and Analytics for Spatial Big Data Fusheng Wang Department of Biomedical Informatics Emory University Introduction Spatial Big Data Geo-crowdsourcing:OpenStreetMap Remote

More information