Big Data. Lecture 6: Locality Sensitive Hashing (LSH)
|
|
- Darcy Clinton Green
- 8 years ago
- Views:
Transcription
1 Big Data Lecture 6: Locality Sensitive Hashing (LSH)
2 Nearest Neighbor Given a set P of n oints in R d
3 Nearest Neighbor Want to build a data structure to answer nearest neighbor queries
4 Voronoi Diagram Build a Voronoi diagram & a oint location data structure
5 Curse of dimensionality In R 2 the Voronoi diagram is of size O(n) Query takes O(logn) time In R d the comlexity is O(n d/2 ) Other techniques also scale bad with the dimension
6 Locality Sensitive Hashing We will use a family of hash functions such that close oints tend to hash to the same bucket. Put all oints of P in their buckets, ideally we want the query q to find its nearest neighbor in its bucket
7 Locality Sensitive Hashing Def (Charikar): A family H of functions is locality sensitive with resect to a similarity function 0 sim(,q) 1 if Pr[h() = h(q)] = sim(,q)
8 Examle Hamming Similarity Think of the oints as strings of m bits and consider the similarity sim(,q) = 1-ham(,q)/m H={h i () = the i-th bit of } is locality sensitive wrt sim(,q) = 1-ham(,q)/m Pr[h() = h(q)] = 1 ham(,q)/m 1-sim(,q) = ham(,q)/m
9 Examle - Jaacard Think of and q as sets sim(,q) = jaccard(,q) = q / q H={h () = min in of the items in } Pr[h () = h (q)] = jaccard(,q) Need to ick from a min-wise ind. family of ermutations
10 Ma to {0,1} Draw a function b to 0/1 from a airwise ind. family B So: h() h(q) b(h()) = b(h(q)) = 1/2 H ={b(h()) hh, bb} (1 sim(, q)) 1 sim(, q) Pr b( h( )) b( h( q)) sim(, q) 2 2
11 Another examle ( simhash ) H = {h r () = 1 if r > 0, 0 otherwise r is a random unit vector} r
12 Another examle H = {h r () = 1 if r > 0, 0 otherwise r is a random unit vector} Pr[h r () = h r (q)] =?
13 Another examle H = {h r () = 1 if r > 0, 0 otherwise r is a random unit vector} θ Pr[ hr = hr q] 1-
14 Another examle H = {h r () = 1 if r > 0, 0 otherwise r is a random unit vector} θ Pr[ hr = hr q] 1- sim(, q)
15 Another examle H = {h r () = 1 if r > 0, 0 otherwise r is a random unit vector} Pr[ hr = hr q] 1- sim(, q) θ For binary vectors (like term-doc) incidence vectors: cos 1 A B AB
16 How do we really use it? Reduce the number of false ositives by concatenating hash function to get new hash functions ( signature ) sig() = h 1 ()h 2 () h 3 ()h 4 () = Very close documents are hashed to the same bucket or to close buckets (ham(sig(),sig(q)) is small) See aers on removing almost dulicates
17 A theoretical result on NN
18 Locality Sensitive Hashing Thm: If there exists a family H of hash functions such that Pr[h() = h(q)] = sim(,q) then d(,q) = 1-sim(,q) satisfies the triangle inequality
19 Locality Sensitive Hashing Alternative Def (Indyk-Motwani): A family H of functions is (r 1 < r 2, 1 > 2 )- sensitive if d(,q) r 1 Pr[h() = h(q)] 1 d(,q) r 2 Pr[h() = h(q)] 2 r 1 r 2 If d(,q) = 1-sim(,q) then this holds with 1 = 1-r 1 and 2 =1-r 2 r 1, r 2
20 Locality Sensitive Hashing Alternative Def (Indyk-Motwani): A family H of functions is (r 1 < r 2, 1 > 2 )- sensitive if d(,q) r 1 Pr[h() = h(q)] 1 d(,q) r 2 Pr[h() = h(q)] 2 r 1 r 2 If d(,q) = ham(,q) then this holds with 1 = 1-r 1 /m and 2 =1-r 2 /m r 1, r 2
21 (r,ε)-neighbor roblem 1) If there is a neighbor, such that d(,q)r, return, s.t. d(,q) (1+ε)r. 2) If there is no s.t. d(,q)(1+ε)r return nothing. ((1) is the real req. since if we satisfy (1) only, we can satisfy (2) by filtering answers that are too far)
22 (r,ε)-neighbor roblem 1) If there is a neighbor, such that d(,q)r, return, s.t. d(,q) (1+ε)r. r (1+ε)r
23 (r,ε)-neighbor roblem 2) Never return such that d(,q) > (1+ε)r r (1+ε)r
24 (r,ε)-neighbor roblem We can return, s.t. r d(,q) (1+ε)r. r (1+ε)r
25 (r,ε)-neighbor roblem Lets construct a data structure that succeeds with constant robability Focus on the hamming distance first
26 NN using locality sensitive hashing Take a (r 1 < r 2, 1 > 2 ) = (r < (1+)r, 1-r/m > 1-(1+)r/m) - sensitive family If there is a neighbor at distance r we catch it with robability 1
27 NN using locality sensitive hashing Take a (r 1 < r 2, 1 > 2 ) = (r < (1+)r, 1-r/m > 1-(1+)r/m) - sensitive family If there is a neighbor at distance r we catch it with robability 1 so to guarantee catching it we need 1/ 1 functions..
28 NN using locality sensitive hashing Take a (r 1 < r 2, 1 > 2 ) = (r < (1+)r, 1-r/m > 1-(1+)r/m) - sensitive family If there is a neighbor at distance r we catch it with robability 1 so to guarantee catching it we need 1/ 1 functions.. But we also get false ositives in our 1/ 1 buckets, how many?
29 NN using locality sensitive hashing Take a (r 1 < r 2, 1 > 2 ) = (r < (1+)r, 1-r/m > 1-(1+)r/m) - sensitive family If there is a neighbor at distance r we catch it with robability 1 so to guarantee catching it we need 1/ 1 functions.. But we also get false ositives in our 1/ 1 buckets, how many? n 2 / 1
30 NN using locality sensitive hashing Take a (r 1 < r 2, 1 > 2 ) = (r < (1+)r, 1-r/m > 1-(1+)r/m) - sensitive family Make a new function by concatenating k of these basic functions We get a (r 1 < r 2, ( 1 ) k > ( 2 ) k ) If there is a neighbor at distance r we catch it with robability ( 1 ) k so to guarantee catching it we need 1/( 1 ) k functions.. But we also get false ositives in our 1/( 1 ) k buckets, how many? n( 2 ) k /( 1 ) k
31 (r,ε)-neighbor with constant rob Scan the first 4n( 2 ) k /( 1 ) k oints in the buckets and return the closest A close neighbor ( r 1 ) is in one of the buckets with robability 1-(1/e) There are 4n( 2 ) k /( 1 ) k false ositives with robability 3/4 Both events haen with constant rob.
32 Analysis Total query time: (each o takes time ro. to the dim.) 1 k k 2 n 1 k We want to choose k to minimize this. time 2*min k
33 Analysis Total query time: (each o takes time ro. to the dim.) 1 k k 2 n 1 k We want to choose k to minimize this: k n 2 k n k 1 k 2 k log ( n) (loglog n) 1 2
34 Total query time: Put: Summary 1 2 k 1 k 2 n 1 k log ( n) (loglog n) k log ( n) n log log n Total sace: nn
35 What is? Query time: log ( n) n log log n Total sace: n n 1 log r log 1 1 log 1 m 1 1 log 2 (1 ) r 1 log log 1 m 2
36 (1+ε)-aroximate NN Given q find such that d(q,) (1+ε)d(q, ) We can use our solution to the (r,)- neighbor roblem
37 (1+ε)-aroximate NN vs (r,ε)- neighbor roblem If we know r min and r max we can find (1+ε)- aroximate NN using log(r max /r min ) (r,ε ε/2)-neighbor roblems r (1+ε)r
38 LSH using -stable distributions Definition: A distribution D is 2-stable if when X 1,,X d are drawn from D, v i X i = v X where X is drawn from D. So what do we do with this? h() = i X i h()-h(q) = i X i - q i X i = ( i -q i )X i = -q X
39 LSH using -stable distributions Definition: A distribution D is 2-stable if when X 1,,X d are drawn from D, v i X i = v X where X is drawn from D. So what do we do with this? h() = (X+b)/r Pick r to maximize ρ r
40 Bibliograhy M. Charikar: Similarity estimation techniques from rounding algorithms. STOC 2002: P. Indyk, R. Motwani: Aroximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC 1998: A. Gionis, P. Indyk, R. Motwani: Similarity Search in High Dimensions via Hashing. VLDB 1999: M. R. Henzinger: Finding near-dulicate web ages: a largescale evaluation of algorithms. SIGIR 2006: G. S. Manku, A. Jain, A. Das Sarma: Detecting neardulicates for web crawling. WWW 2007:
Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015
Clustering Big Data Efficient Data Mining Technologies J Singh and Teresa Brooks June 4, 2015 Hello Bulgaria (http://hello.bg/) A website with thousands of pages... Some pages identical to other pages
More informationAsymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS)
Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS) Anshumali Shrivastava Department of Computer Science Computing and Information Science Cornell University Ithaca, NY 4853, USA
More informationDATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
More informationSimHash-based Effective and Efficient Detecting of Near-Duplicate Short Messages
ISBN 978-952-5726-07-7 (Print), 978-952-5726-08-4 (CD-ROM) Proceedings of the Second Symposium International Computer Science and Computational Technology(ISCSCT 09) Huangshan, P. R. China, 26-28,Dec.
More informationChallenges in Finding an Appropriate Multi-Dimensional Index Structure with Respect to Specific Use Cases
Challenges in Finding an Appropriate Multi-Dimensional Index Structure with Respect to Specific Use Cases Alexander Grebhahn grebhahn@st.ovgu.de Reimar Schröter rschroet@st.ovgu.de David Broneske dbronesk@st.ovgu.de
More informationSimilarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
More informationClustering using Simhash and Locality Sensitive Hashing in Hadoop HDFS : An Infrastructure Extension
ISTE-ACEEE Int. J. in Computer Science, Vol. 1, No. 1, March 2014 Clustering using Simhash and Locality Sensitive Hashing in Hadoop HDFS : An Infrastructure Extension Kala Karun.A and Chitharanjan. K Sree
More informationPoint Location. Preprocess a planar, polygonal subdivision for point location queries. p = (18, 11)
Point Location Prerocess a lanar, olygonal subdivision for oint location ueries. = (18, 11) Inut is a subdivision S of comlexity n, say, number of edges. uild a data structure on S so that for a uery oint
More informationFast Matching of Binary Features
Fast Matching of Binary Features Marius Muja and David G. Lowe Laboratory for Computational Intelligence University of British Columbia, Vancouver, Canada {mariusm,lowe}@cs.ubc.ca Abstract There has been
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationLarge-Scale Distributed Locality-Sensitive Hashing for General Metric Data
Large-Scale Distributed Locality-Sensitive Hashing for General Metric Data Eliezer Silva 2, Thiago Teixeira 1, George Teodoro 1, and Eduardo Valle 2 1 Dep. of Computer Science, University of Brasilia thiagotei@gmail.com,
More informationClustering and Load Balancing Optimization for Redundant Content Removal
Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu Ask.com Alexandra Potapova University of California Santa Barbara Maha Alabduljalil University of California Santa
More informationEfficient Approximate Similarity Search Using Random Projection Learning
Efficient Approximate Similarity Search Using Random Projection Learning Peisen Yuan, Chaofeng Sha, Xiaoling Wang 2,BinYang, and Aoying Zhou 2 School of Computer Science, Shanghai Key Laboratory of Intelligent
More informationFast Prototype Based Noise Reduction
Fast Prototype Based Noise Reduction Kajsa Tibell, Hagen Spies, and Magnus Borga Sapheneia Commercial Products AB, Teknikringen 8, 583 3 Linkoping, SWEDEN Department of Biomedical Engineering, Linkoping
More informationManual for BEAR Big Data Ensemble of Adaptations for Regression Version 1.0
Manual for BEAR Big Data Ensemble of Adaptations for Regression Version 1.0 Vahid Jalali David Leake August 9, 2015 Abstract BEAR is a case-based regression learner tailored for big data processing. It
More informationTHE concept of Big Data refers to systems conveying
EDIC RESEARCH PROPOSAL 1 High Dimensional Nearest Neighbors Techniques for Data Cleaning Anca-Elena Alexandrescu I&C, EPFL Abstract Organisations from all domains have been searching for increasingly more
More informationLearning Binary Hash Codes for Large-Scale Image Search
Learning Binary Hash Codes for Large-Scale Image Search Kristen Grauman and Rob Fergus Abstract Algorithms to rapidly search massive image or video collections are critical for many vision applications,
More informationContent Delivery Network (CDN) and P2P Model
A multi-agent algorithm to improve content management in CDN networks Agostino Forestiero, forestiero@icar.cnr.it Carlo Mastroianni, mastroianni@icar.cnr.it ICAR-CNR Institute for High Performance Computing
More informationEmail Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email
More informationSmart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets
Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets Dudu Lazarov, Gil David, Amir Averbuch School of Computer Science, Tel-Aviv University Tel-Aviv 69978, Israel Abstract
More informationLarge-Scale IP Traceback in High-Speed Internet: Practical Techniques and Theoretical Foundation
Large-Scale IP Traceback in High-Seed Internet: Practical Techniques and Theoretical Foundation Jun Li Minho Sung Jun (Jim) Xu College of Comuting Georgia Institute of Technology {junli,mhsung,jx}@cc.gatech.edu
More informationMonitoring Frequency of Change By Li Qin
Monitoring Frequency of Change By Li Qin Abstract Control charts are widely used in rocess monitoring roblems. This aer gives a brief review of control charts for monitoring a roortion and some initial
More informationC-Bus Voltage Calculation
D E S I G N E R N O T E S C-Bus Voltage Calculation Designer note number: 3-12-1256 Designer: Darren Snodgrass Contact Person: Darren Snodgrass Aroved: Date: Synosis: The guidelines used by installers
More informationSQUARE GRID POINTS COVERAGED BY CONNECTED SOURCES WITH COVERAGE RADIUS OF ONE ON A TWO-DIMENSIONAL GRID
International Journal of Comuter Science & Information Technology (IJCSIT) Vol 6, No 4, August 014 SQUARE GRID POINTS COVERAGED BY CONNECTED SOURCES WITH COVERAGE RADIUS OF ONE ON A TWO-DIMENSIONAL GRID
More informationLecture 4 Online and streaming algorithms for clustering
CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line
More informationCounting Problems in Flash Storage Design
Flash Talk Counting Problems in Flash Storage Design Bongki Moon Department of Computer Science University of Arizona Tucson, AZ 85721, U.S.A. bkmoon@cs.arizona.edu NVRAMOS 09, Jeju, Korea, October 2009-1-
More informationJubatus: An Open Source Platform for Distributed Online Machine Learning
Jubatus: An Open Source Platform for Distributed Online Machine Learning Shohei Hido Seiya Tokui Preferred Infrastructure Inc. Tokyo, Japan {hido, tokui}@preferred.jp Satoshi Oda NTT Software Innovation
More informationNew Hash Function Construction for Textual and Geometric Data Retrieval
Latest Trends on Computers, Vol., pp.483-489, ISBN 978-96-474-3-4, ISSN 79-45, CSCC conference, Corfu, Greece, New Hash Function Construction for Textual and Geometric Data Retrieval Václav Skala, Jan
More informationMachine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
More informationOnline Generation of Locality Sensitive Hash Signatures
Online Generation of Locality Sensitive Hash Signatures Benjamin Van Durme HLTCOE Johns Hopkins University Baltimore, MD 21211 USA Ashwin Lall College of Computing Georgia Institute of Technology Atlanta,
More information1 Gambler s Ruin Problem
Coyright c 2009 by Karl Sigman 1 Gambler s Ruin Problem Let N 2 be an integer and let 1 i N 1. Consider a gambler who starts with an initial fortune of $i and then on each successive gamble either wins
More informationOn the Efficiency of Collecting and Reducing Spam Samples
On the Efficiency of Collecting and Reducing Spam Samples Pin-Ren Chiou, Po-Ching Lin Department of Computer Science and Information Engineering National Chung Cheng University Chiayi, Taiwan, 62102 {cpj101m,pclin}@cs.ccu.edu.tw
More informationCloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman To motivate the Bloom-filter idea, consider a web crawler. It keeps, centrally, a list of all the URL s it has found so far. It
More informationData Structure and Network Searching
Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence A Unified Approximate Nearest Neighbor Search Scheme by Combining Data Structure and Hashing Debing Zhang Genmao
More informationBig Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),
More informationPredictive Indexing for Fast Search
Predictive Indexing for Fast Search Sharad Goel Yahoo! Research New York, NY 10018 goel@yahoo-inc.com John Langford Yahoo! Research New York, NY 10018 jl@yahoo-inc.com Alex Strehl Yahoo! Research New York,
More informationMean Shift Based Clustering in High Dimensions: A Texture Classification Example
Mean Shift Based Clustering in High Dimensions: A Texture Classification Example Bogdan Georgescu µ Ilan Shimshoni µ Peter Meer ¾µ Computer Science µ Electrical and Computer Engineering ¾µ Rutgers University,
More informationHashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning
Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning Prateek Jain Algorithms Research Group Microsoft Research, Bangalore, India prajain@microsoft.com Sudheendra Vijayanarasimhan
More informationData Warehousing und Data Mining
Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data
More informationCSC574 - Computer and Network Security Module: Intrusion Detection
CSC574 - Computer and Network Security Module: Intrusion Detection Prof. William Enck Spring 2013 1 Intrusion An authorized action... that exploits a vulnerability... that causes a compromise... and thus
More informationLecture 6 Online and streaming algorithms for clustering
CSE 291: Unsupervised learning Spring 2008 Lecture 6 Online and streaming algorithms for clustering 6.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line
More informationIntroduction to NP-Completeness Written and copyright c by Jie Wang 1
91.502 Foundations of Comuter Science 1 Introduction to Written and coyright c by Jie Wang 1 We use time-bounded (deterministic and nondeterministic) Turing machines to study comutational comlexity of
More informationIn order to describe motion you need to describe the following properties.
Chapter 2 One Dimensional Kinematics How would you describe the following motion? Ex: random 1-D path speeding up and slowing down In order to describe motion you need to describe the following properties.
More informationGeometry and Topology from Point Cloud Data
Geometry and Topology from Point Cloud Data Tamal K. Dey Department of Computer Science and Engineering The Ohio State University Dey (2011) Geometry and Topology from Point Cloud Data WALCOM 11 1 / 51
More informationLecture #2. Algorithms for Big Data
Additional Topics: Big Data Lecture #2 Algorithms for Big Data Joseph Bonneau jcb82@cam.ac.uk April 30, 2012 Today's topic: algorithms Do we need new algorithms? Quantity is a quality of its own Joseph
More informationThe Advantages and Disadvantages of Network Computing Nodes
Big Data & Scripting storage networks and distributed file systems 1, 2, in the remainder we use networks of computing nodes to enable computations on even larger datasets for a computation, each node
More informationThe Online Freeze-tag Problem
The Online Freeze-tag Problem Mikael Hammar, Bengt J. Nilsson, and Mia Persson Atus Technologies AB, IDEON, SE-3 70 Lund, Sweden mikael.hammar@atus.com School of Technology and Society, Malmö University,
More informationTopological Data Analysis Applications to Computer Vision
Topological Data Analysis Applications to Computer Vision Vitaliy Kurlin, http://kurlin.org Microsoft Research Cambridge and Durham University, UK Topological Data Analysis quantifies topological structures
More informationDistributed Computing over Communication Networks: Topology. (with an excursion to P2P)
Distributed Computing over Communication Networks: Topology (with an excursion to P2P) Some administrative comments... There will be a Skript for this part of the lecture. (Same as slides, except for today...
More informationCIS 700: algorithms for Big Data
CIS 700: algorithms for Big Data Lecture 6: Graph Sketching Slides at http://grigory.us/big-data-class.html Grigory Yaroslavtsev http://grigory.us Sketching Graphs? We know how to sketch vectors: v Mv
More informationUnited Arab Emirates University College of Sciences Department of Mathematical Sciences HOMEWORK 1 SOLUTION. Section 10.1 Vectors in the Plane
United Arab Emirates University College of Sciences Deartment of Mathematical Sciences HOMEWORK 1 SOLUTION Section 10.1 Vectors in the Plane Calculus II for Engineering MATH 110 SECTION 0 CRN 510 :00 :00
More informationNimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff
Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear
More informationStreaming Algorithms
3 Streaming Algorithms Great Ideas in Theoretical Computer Science Saarland University, Summer 2014 Some Admin: Deadline of Problem Set 1 is 23:59, May 14 (today)! Students are divided into two groups
More informationBi-level Locality Sensitive Hashing for K-Nearest Neighbor Computation
Bi-level Locality Sensitive Hashing for K-Nearest Neighbor Computation Jia Pan UNC Chapel Hill panj@cs.unc.edu Dinesh Manocha UNC Chapel Hill dm@cs.unc.edu ABSTRACT We present a new Bi-level LSH algorithm
More informationIntroduction to nonparametric regression: Least squares vs. Nearest neighbors
Introduction to nonparametric regression: Least squares vs. Nearest neighbors Patrick Breheny October 30 Patrick Breheny STA 621: Nonparametric Statistics 1/16 Introduction For the remainder of the course,
More informationBALTIC OLYMPIAD IN INFORMATICS Stockholm, April 18-22, 2009 Page 1 of?? ENG rectangle. Rectangle
Page 1 of?? ENG rectangle Rectangle Spoiler Solution of SQUARE For start, let s solve a similar looking easier task: find the area of the largest square. All we have to do is pick two points A and B and
More informationApproximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs
Approximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs Yong Zhang 1.2, Francis Y.L. Chin 2, and Hing-Fung Ting 2 1 College of Mathematics and Computer Science, Hebei University,
More informationMODELING RANDOMNESS IN NETWORK TRAFFIC
MODELING RANDOMNESS IN NETWORK TRAFFIC - LAVANYA JOSE, INDEPENDENT WORK FALL 11 ADVISED BY PROF. MOSES CHARIKAR ABSTRACT. Sketches are randomized data structures that allow one to record properties of
More informationCS 5410 - Computer and Network Security: Intrusion Detection
CS 5410 - Computer and Network Security: Intrusion Detection Professor Kevin Butler Fall 2015 Locked Down You re using all the techniques we will talk about over the course of the semester: Strong access
More informationEfficient Similarity Joins for Near Duplicate Detection
Efficient Similarity Joins for Near Duplicate Detection Chuan Xiao Wei Wang Xuemin Lin School of Computer Science and Engineering University of New South Wales Australia {chuanx, weiw, lxue}@cse.unsw.edu.au
More information5.4 Closest Pair of Points
5.4 Closest Pair of Points Closest Pair of Points Closest pair. Given n points in the plane, find a pair with smallest Euclidean distance between them. Fundamental geometric primitive. Graphics, computer
More informationLecture 3 The Future of Search and Discovery in Big Data Analytics: Ultrametric Information Spaces
Lecture 3 The Future of Search and Discovery in Big Data Analytics: Ultrametric Information Spaces Themes 1) Big Data and analytics: the potential for metric (geometric) and ultrametric (topological) analysis.
More informationLoad Balancing between Computing Clusters
Load Balancing between Computing Clusters Siu-Cheung Chau Dept. of Physics and Computing, Wilfrid Laurier University, Waterloo, Ontario, Canada, NL 3C5 e-mail: schau@wlu.ca Ada Wai-Chee Fu Dept. of Computer
More informationCloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman
Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman 2 In a DBMS, input is under the control of the programming staff. SQL INSERT commands or bulk loaders. Stream management is important
More informationEfficient Similarity Search over Encrypted Data
UT DALLAS Erik Jonsson School of Engineering & Computer Science Efficient Similarity Search over Encrypted Data Mehmet Kuzu, Saiful Islam, Murat Kantarcioglu Introduction Client Untrusted Server Similarity
More informationSublinear Algorithms for Big Data. Part 4: Random Topics
Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1 2-1 Topic 1: Compressive sensing Compressive sensing The model (Candes-Romberg-Tao 04; Donoho 04) Applicaitons Medical imaging reconstruction
More informationEntity Resolution Fingerprints Similar News Articles. Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman
Entity Resolution Fingerprints Similar News Articles Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman 2 The entity-resolution problem is to examine a collection of records and
More informationDistributed Computing over Communication Networks: Maximal Independent Set
Distributed Computing over Communication Networks: Maximal Independent Set What is a MIS? MIS An independent set (IS) of an undirected graph is a subset U of nodes such that no two nodes in U are adjacent.
More informationChapter 11. 11.1 Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling
Approximation Algorithms Chapter Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should I do? A. Theory says you're unlikely to find a poly-time algorithm. Must sacrifice one
More informationARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications
More informationFinding Similar Items
72 Chapter 3 Finding Similar Items A fundamental data-mining problem is to examine data for similar items. We shall take up applications in Section 3.1, but an example would be looking at a collection
More informationR-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants
R-Trees: A Dynamic Index Structure For Spatial Searching A. Guttman R-trees Generalization of B+-trees to higher dimensions Disk-based index structure Occupancy guarantee Multiple search paths Insertions
More information(67902) Topics in Theory and Complexity Nov 2, 2006. Lecture 7
(67902) Topics in Theory and Complexity Nov 2, 2006 Lecturer: Irit Dinur Lecture 7 Scribe: Rani Lekach 1 Lecture overview This Lecture consists of two parts In the first part we will refresh the definition
More informationOn the Power of Randomization in Big Data Analytics
On the Power of Randomization in Big Data Analytics Phạm Đăng Ninh Theoretical Computer Science Section IT University of Copenhagen, Denmark A thesis submitted for the degree of Doctor of Philosophy 31/08/2014
More informationStat 134 Fall 2011: Gambler s ruin
Stat 134 Fall 2011: Gambler s ruin Michael Lugo Setember 12, 2011 In class today I talked about the roblem of gambler s ruin but there wasn t enough time to do it roerly. I fear I may have confused some
More informationA Review on Duplicate and Near Duplicate Documents Detection Technique
International Journal of Computer Sciences and Engineering Open Access Review Paper Volume-4, Issue-03 E-ISSN: 2347-2693 A Review on Duplicate and Near Duplicate Documents Detection Technique Patil Deepali
More informationKnowledge Discovery and Data Mining. Structured vs. Non-Structured Data
Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.
More informationBuilding A Scalable Multimedia Search Engine Using Infiniband
Building A Scalable Multimedia Search Engine Using Infiniband Qi Chen 1, Yisheng Liao 2, Christopher Mitchell 2, Jinyang Li 2, and Zhen Xiao 1 1 Department of Computer Science, Peing University 2 Department
More informationCaching Dynamic Skyline Queries
Caching Dynamic Skyline Queries D. Sacharidis 1, P. Bouros 1, T. Sellis 1,2 1 National Technical University of Athens 2 Institute for Management of Information Systems R.C. Athena Outline Introduction
More informationBig Data Begets Big Database Theory
Big Data Begets Big Database Theory Dan Suciu University of Washington 1 Motivation Industry analysts describe Big Data in terms of three V s: volume, velocity, variety. The data is too big to process
More informationPart II: Bidding, Dynamics and Competition. Jon Feldman S. Muthukrishnan
Part II: Bidding, Dynamics and Competition Jon Feldman S. Muthukrishnan Campaign Optimization Budget Optimization (BO): Simple Input: Set of keywords and a budget. For each keyword, (clicks, cost) pair.
More informationECE 533 Project Report Ashish Dhawan Aditi R. Ganesan
Handwritten Signature Verification ECE 533 Project Report by Ashish Dhawan Aditi R. Ganesan Contents 1. Abstract 3. 2. Introduction 4. 3. Approach 6. 4. Pre-processing 8. 5. Feature Extraction 9. 6. Verification
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms K-means and its variants Hierarchical clustering
More informationSECTION 6: FIBER BUNDLES
SECTION 6: FIBER BUNDLES In this section we will introduce the interesting class o ibrations given by iber bundles. Fiber bundles lay an imortant role in many geometric contexts. For examle, the Grassmaniann
More informationProduct quantization for nearest neighbor search
Product quantization for nearest neighbor search Hervé Jégou, Matthijs Douze, Cordelia Schmid Abstract This paper introduces a product quantization based approach for approximate nearest neighbor search.
More informationChapter 6: Episode discovery process
Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing
More informationStatistical Validation and Data Analytics in ediscovery. Jesse Kornblum
Statistical Validation and Data Analytics in ediscovery Jesse Kornblum Administrivia Silence your mobile Interactive talk Please ask questions 2 Outline Introduction Big Questions What Makes Things Similar?
More information! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.
Approximation Algorithms Chapter Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of
More informationCan linear programs solve NP-hard problems?
Can linear programs solve NP-hard problems? p. 1/9 Can linear programs solve NP-hard problems? Ronald de Wolf Linear programs Can linear programs solve NP-hard problems? p. 2/9 Can linear programs solve
More informationMultimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.
Multimedia Databases Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Previous Lecture 13 Indexes for Multimedia Data 13.1
More informationAnti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
More informationText Clustering Using LucidWorks and Apache Mahout
Text Clustering Using LucidWorks and Apache Mahout (Nov. 17, 2012) 1. Module name Text Clustering Using Lucidworks and Apache Mahout 2. Scope This module introduces algorithms and evaluation metrics for
More informationAn Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
More informationTo determine vertical angular frequency, we need to express vertical viewing angle in terms of and. 2tan. (degree). (1 pt)
Polytechnic University, Dept. Electrical and Computer Engineering EL6123 --- Video Processing, S12 (Prof. Yao Wang) Solution to Midterm Exam Closed Book, 1 sheet of notes (double sided) allowed 1. (5 pt)
More informationPulsON RangeNet / ALOHA Guide to Optimal Performance. Brandon Dewberry, CTO
TIME DOMAIN PulsON RangeNet / ALOHA Guide to Optimal Performance Brandon Dewberry, CTO 320-0318A November 2013 4955 Corporate Drive, Suite 101, Huntsville, Alabama 35805 Phone: 256.922.9229 Fax: 256.922.0387
More informationStorage Basics Architecting the Storage Supplemental Handout
Storage Basics Architecting the Storage Sulemental Handout INTRODUCTION With digital data growing at an exonential rate it has become a requirement for the modern business to store data and analyze it
More informationComparison of Standard and Zipf-Based Document Retrieval Heuristics
Comparison of Standard and Zipf-Based Document Retrieval Heuristics Benjamin Hoffmann Universität Stuttgart, Institut für Formale Methoden der Informatik Universitätsstr. 38, D-70569 Stuttgart, Germany
More informationRANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING
= + RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING Stefan Savev Berlin Buzzwords June 2015 KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data sparsity:
More informationMACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
More informationHigh Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University
High Performance Spatial Queries and Analytics for Spatial Big Data Fusheng Wang Department of Biomedical Informatics Emory University Introduction Spatial Big Data Geo-crowdsourcing:OpenStreetMap Remote
More information