An Algorithm For Fast Median Search

Size: px
Start display at page:

Download "An Algorithm For Fast Median Search"

Transcription

1 An Algorithm For Fast Median Search A. Juan and E. Vidal Departamento de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia, Valencia, Spain. ajuan@dsic.upv.es Abstract VII National Symposium on Pattern Recognition and Image Analysis, proc. pp , 1997 Searching for a median of a set of patterns is a well-known technique to model the set or to aid searching for more accurate models such as a generalized median or a k-median. While medians and generalized medians are (relatively) easy to compute in the case of Euclidean representation spaces, this is unfortunately no longer true when more complex distance measures are to be used, which often happens in practical situations. In these cases, a direct method to perform median search is not feasible for large sets of patterns. In order to cope with this computational problem, an algorithm for median search is proposed which is faster than the direct method in terms of distance computations. Key Words: Median, Fast Search, Distance, Metric Spaces, Clustering, Nearest-Neighbour techniques. 1 Introduction Assume that we are given a set of patterns in an arbitrary (non-vector) representation space and a distance function to measure the dissimilarity between any pair of them. A well-known technique for modelling the given set consists of finding a pattern of the set whose sum of distances to all patterns is minimum. Such a pattern is called a median. The concept of median is related to a more general concept known as generalized median. This generalization arises when the search is not constrained to the given set, but extended to the whole pattern representation space. Although a generalized median constitutes a more accurate model than a median, finding such a model is often a difficult task that requires complex and specific (approximate) algorithms [2, 1]. Obviously, here we are excluding trivial cases like that of computing the generalized median of a set of patterns in a Euclidean representation space. In these cases where such computation is not trivial, a median can be used as an approximate generalized median or as a starting point for finding better ones [6]. Another generalization of the concept of median is known as k-median. As its name suggests, searching for a -median consists of finding a subset of patterns (models) such that the sum of distances between all the patterns and their corresponding nearest models is minimum. This problem, well-known as a prototype location problem, is NP-Hard [8]. On the other hand, it can be seen as a clustering problem since each possible subset of patterns induces a partition of the set into clusters and a measure of its quality. Good partitions are generally found using a simple k-means-like algorithm which starts on a given initial partition and then loops searching for a median of each cluster and reclassifying patterns according to their nearest medians [5]. Thus, the concept of median also turns out to be useful in finding more accurate models. Searching for a median of a set of Ò patterns is a relatively easy task since it suffices to compute ½ ¾ Ò Ò ½µ pairwise distances between patterns. However, this naive approach is no longer applicable when large sets of patterns are compared through complex distance functions like, for instance, those provided by algorithms of Sequence Comparison [9]. In order to cope with this computational problem, algorithms for fast median search are required but, unfortunately, available techniques are only useful in very particular cases [3, 7].

2 In [4] we proposed an algorithm for fast median search whose complexity is smaller than that of the direct approach both, measured in terms of distance computations, and measured in terms of time not alloted to distance computing (overhead). This algorithm was tested on artificial data and also embedded in a -means-like algorithm which was applied to a corpus of speech data [4]. Although it was found effective in reducing the number of computed distances, greater savings are desirable in practice. Here we propose a faster algorithm in terms of distance computations. A simulation experiment is reported supporting this claim. 2 The Algorithm Let Á ½ Ò be the set of indices associated with a set of Ò different patterns Ü ½ Ü ¾ Ü Ò in a (given) representation space. A metric or distance function is assumed to be given; that is, a function Á Á Ê that satisfies the following conditions µ ¼ (1) µ ¼ if (2) µ µ (3) µ µ µ (4) for all ¾ Á. The sum of distances between a pattern and all the patterns of a given set Ë Á is denoted as Ë µ ¼ ¾Ë if Ë µ if Ë (5) This value will be called the sum on Ë of or simply the sum of if Ë Á. In order to introduce the key idea which enables fast median search, let us suppose that the sums of some patterns have already been calculated. These patterns will be referred as the set Í of used data. As a by-product of this calculation, it is also assumed that the distances between used and unused patterns have been saved in a matrix, and that a pattern ¾ Í whose sum is minimum among those calculated so far has been chosen. Applying the triangle inequality to all ¾ Á we have µ µ µ µ µ µ µ µ µ µ Then, a lower bound of the sum of any unused pattern, Í µ, can be computed as Í µ Í µ ¾Á Í Í µ (6) where ¼ Í µ ÑÜ µ µ ¾Í if Í if Í (7) Notice that all distances involved in the calculation of (6) are available (have already been computed in previous steps). Obviously, we can avoid the calculation of the sum of any unused pattern whose corresponding lower bound is greater than or equals to.

3 Three issues have to be addressed before developing an algorithm based on (6). First, it is plain that we have to distinguish between the unused patterns whose lower bounds are smaller than and the rest of unused patterns. Only the former must be taken into account for sum calculation, so we call them the set of alive data. The rest of unused patterns will be referred as the set of eliminated data. An interesting question is which technique should we follow so as to calculate (6) without extra distance computations and as fast as possible in terms of overhead. It has been assumed that this calculation is carried out with the aid of a matrix where all distances between used and unused patterns are stored. Unfortunately, this technique is very expensive since the size of the required matrix is Í µ ¾ Ç Ò ¾ µ and each calculation of (6) takes Í µí ¾ Ç Ò ¾ µ steps. Let us see an alternative technique which dims the computational burden devoted to this calculation. Assume that we have an array ¾ R Ò where the sums on Í of all alive patterns are stored, and a matrix À ¾ R Ò Ò which contains the estimates provided by (7) between each alive and unused patterns. If a new pattern is selected for sum calculation, the distances involved in the computation of this sum can be used to calculate (6) for each alive pattern as Í µ µ ¾ ÑÜ À µ µµ Although the use of À keeps the space complexity at Ç Ò ¾ µ, this technique is more efficient since each calculation of (6) now takes ¾ Ç Òµ steps. The third issue we are concerned with is about choosing a criterion to select an alive pattern for sum calculation. Since we are interested in quickly choosing a median, it is clear that a natural criterion consists of selecting an alive pattern whose associated lower bound is minimum. The algorithm described in figure 1 searches for a median on the basis of the ideas just discussed. Its space complexity is Ò ¾ µ due to the use of the matrix À. Notice that this matrix can be implemented as a triangular array of size ½ Ò Ò ½µ since (7) fulfils conditions (1) and (3). ¾ Obviously, the time complexity of the algorithm depends on the nature of the distance function adopted. A worst case arises when the discrete metric is used, µ ¼ ½ if if In this case, the estimates given by (7) are the most pessimistic (zero), (6) reduces to sums on Í and no alive patterns are eliminated. As a result, the process ends computing as many distances as the direct method ( ½ ¾ Ò Ò ½µ) and the overhead becomes Ç Ò µ. Suppose now that patterns are represented as Ò different real numbers which are compared through the usual distance, µ Ü Ü where Ü Ü ¾ R ½ Ò. If the first pattern selected for sum calculation is the index of either the minimum or the maximum of these numbers, exact estimates will be given by (7) and thereby (6). Next, a median will be selected making alive patterns into eliminated and, consequently, directly leading to the end of the algorithm. This computational behaviour is optimal since it can be shown that the algorithm can not stop after the calculation of a single sum if Ò ¾. Therefore, although its overhead is Å Ò ¾ µ, it is worth noting that the algorithm can find a median by computing as little as ¾Ò distances. 3 A Simulation Experiment A simulation experiment was carried out on a ¼ (½¼¼ MHz clocked) PC so as to empirically test the computational behaviour of the algorithm. The experiment was based on the generation of sets of Ò

4 Algorithm for Fast Median Search Input A data set Á ½ Ò and a metric Á Á Ê. Output A median ¾ Á and its associated sum ¾ Ê. Variables Í Á; ¾ Ê Ò ; À ¾ Ê Ò Ò ; and ¾ Á. Method Initialize ½, Í, Á, ¼ and À ¼. while Î ¼ do Select Ö ÑÒ ¾ µ and transfer it from to Í. for all ¾ do Compute µ and update and. if then Update and. end if for all ¾ do Let. for all ¾ do Update À ÑÜ À µ and À. if then Transfer from to. end if end while Figure 1: Algorithm for Fast Median Search. artificial patterns (Ò ½ ¾ ½¼¾) randomly drawn from a uniform distribution in the unit - dimensional hypercube ( ¾ ½¼), and compared through the Euclidean distance. Although the algorithm is not intended to deal with patterns in a Euclidean space, we chose such kind of patterns since they are easy to generate. On the other hand, the algorithm never took advantage of their coordinates, but only the distances were used. For each value of Ò and, the algorithm was tested on ½¼¼ independently generated sets of patterns and the resulting numbers of distance computations were averaged so as to get an estimate of the mean number of distances computed by the algorithm. A total of estimates were obtained for different values of Ò and. From these estimates, the percentages with respect to ½ Ò Ò ½µ (number of distances ¾ computed by the direct method) are plotted in figure 2 together with their corresponding ± confidence intervals. In terms of distance computations, these results are better than those (not reported here) of our previous fast algorithm [4] tested on the very same conditions. Although for large dimensions only modest savings are obtained, it should be pointed out that these savings tend to increase as Ò increases and they can be greater than ¼± for low dimensions. On the other hand, the overhead of the new algorithm is significantly higher than that of the previous algorithm [4], which makes the new version only appropriate for computationally expensive distance functions.

5 100 Ʊ ½¼ ¾ Ò Figure 2: Average number of distances computed by the new algorithm (together with its corresponding ± confidence interval), expressed as a percentage of the number of distances computed by the direct method (Ʊ). Each value was obtained by running the new algorithm on ½¼¼ independent sets of Ò artificial patterns randomly drawn from a uniform distribution in the unit -dimensional hypercube, and compared through the Euclidean distance. 4 Conclusions An algorithm for median search has been proposed which is faster than a previously proposed algorithm [4] in terms of distance computations. Its main application is to be used as a tool for modelling a set of patterns which are compared through a computationally expensive distance function. References [1] F. Casacuberta and M. D. de Antonio. A greedy algorithm for computing approximate Median Strings. In revision, [2] F. Casacuberta and C. de la Higuera. The topology of strings: two NP-Complete problems. In revision, [3] E. Horowitz and S. Sahni. Fundamentals of Computers Algorithms. Computer Science, [4] A. Juan and E. Vidal. An optimized Ã-means-like algorithm. In 4th Portuguese Conference on Pattern Recognition, pages 11 16, Coimbra, Portugal, March [5] A. Juan and E. Vidal. Fast -Means-like Clustering in Metric Spaces. Pattern Recognition Letters, 15(1):19 25, [6] T. Kohonen. Median Strings. Pattern Recognition Letters, 3: , [7] P. B. Mirchandani. The Ô-Median Problem and Generalizations. In Mirchandani and Francis [8], pages [8] P. B. Mirchandani and R. L. Francis, editors. Discrete Location Theory. Wiley, [9] D. Sankoff and J. B. Kruskal, editors. Time Warps, String Edits, and Macromolecules: the Theory and Practice of Sequence Comparison. Addison-Wesley, 1983.

Efficient Recovery of Secrets

Efficient Recovery of Secrets Efficient Recovery of Secrets Marcel Fernandez Miguel Soriano, IEEE Senior Member Department of Telematics Engineering. Universitat Politècnica de Catalunya. C/ Jordi Girona 1 i 3. Campus Nord, Mod C3,

More information

Virtual Landmarks for the Internet

Virtual Landmarks for the Internet Virtual Landmarks for the Internet Liying Tang Mark Crovella Boston University Computer Science Internet Distance Matters! Useful for configuring Content delivery networks Peer to peer applications Multiuser

More information

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties

More information

Visualization of large data sets using MDS combined with LVQ.

Visualization of large data sets using MDS combined with LVQ. Visualization of large data sets using MDS combined with LVQ. Antoine Naud and Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87-100 Toruń, Poland. www.phys.uni.torun.pl/kmk

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

INTERACTIVE DATA EXPLORATION USING MDS MAPPING

INTERACTIVE DATA EXPLORATION USING MDS MAPPING INTERACTIVE DATA EXPLORATION USING MDS MAPPING Antoine Naud and Włodzisław Duch 1 Department of Computer Methods Nicolaus Copernicus University ul. Grudziadzka 5, 87-100 Toruń, Poland Abstract: Interactive

More information

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable

More information

Visualization of General Defined Space Data

Visualization of General Defined Space Data International Journal of Computer Graphics & Animation (IJCGA) Vol.3, No.4, October 013 Visualization of General Defined Space Data John R Rankin La Trobe University, Australia Abstract A new algorithm

More information

Competitive Analysis of On line Randomized Call Control in Cellular Networks

Competitive Analysis of On line Randomized Call Control in Cellular Networks Competitive Analysis of On line Randomized Call Control in Cellular Networks Ioannis Caragiannis Christos Kaklamanis Evi Papaioannou Abstract In this paper we address an important communication issue arising

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

More information

6.2.8 Neural networks for data mining

6.2.8 Neural networks for data mining 6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural

More information

Self Organizing Maps: Fundamentals

Self Organizing Maps: Fundamentals Self Organizing Maps: Fundamentals Introduction to Neural Networks : Lecture 16 John A. Bullinaria, 2004 1. What is a Self Organizing Map? 2. Topographic Maps 3. Setting up a Self Organizing Map 4. Kohonen

More information

Instituto de Engenharia de Sistemas e Computadores de Coimbra Institute of Systems Engineering and Computers INESC Coimbra

Instituto de Engenharia de Sistemas e Computadores de Coimbra Institute of Systems Engineering and Computers INESC Coimbra Instituto de Engenharia de Sistemas e Computadores de Coimbra Institute of Systems Engineering and Computers INESC Coimbra João Clímaco and Marta Pascoal A new method to detere unsupported non-doated solutions

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

Efficient Data Structures for Decision Diagrams

Efficient Data Structures for Decision Diagrams Artificial Intelligence Laboratory Efficient Data Structures for Decision Diagrams Master Thesis Nacereddine Ouaret Professor: Supervisors: Boi Faltings Thomas Léauté Radoslaw Szymanek Contents Introduction...

More information

Metric Spaces. Chapter 7. 7.1. Metrics

Metric Spaces. Chapter 7. 7.1. Metrics Chapter 7 Metric Spaces A metric space is a set X that has a notion of the distance d(x, y) between every pair of points x, y X. The purpose of this chapter is to introduce metric spaces and give some

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Euclidean Minimum Spanning Trees Based on Well Separated Pair Decompositions Chaojun Li. Advised by: Dave Mount. May 22, 2014

Euclidean Minimum Spanning Trees Based on Well Separated Pair Decompositions Chaojun Li. Advised by: Dave Mount. May 22, 2014 Euclidean Minimum Spanning Trees Based on Well Separated Pair Decompositions Chaojun Li Advised by: Dave Mount May 22, 2014 1 INTRODUCTION In this report we consider the implementation of an efficient

More information

ENUMERATION OF INTEGRAL TETRAHEDRA

ENUMERATION OF INTEGRAL TETRAHEDRA ENUMERATION OF INTEGRAL TETRAHEDRA SASCHA KURZ Abstract. We determine the numbers of integral tetrahedra with diameter d up to isomorphism for all d 1000 via computer enumeration. Therefore we give an

More information

How To Identify Noisy Variables In A Cluster

How To Identify Noisy Variables In A Cluster Identification of noisy variables for nonmetric and symbolic data in cluster analysis Marek Walesiak and Andrzej Dudek Wroclaw University of Economics, Department of Econometrics and Computer Science,

More information

Confidence Intervals for One Standard Deviation Using Standard Deviation

Confidence Intervals for One Standard Deviation Using Standard Deviation Chapter 640 Confidence Intervals for One Standard Deviation Using Standard Deviation Introduction This routine calculates the sample size necessary to achieve a specified interval width or distance from

More information

Management of Software Projects with GAs

Management of Software Projects with GAs MIC05: The Sixth Metaheuristics International Conference 1152-1 Management of Software Projects with GAs Enrique Alba J. Francisco Chicano Departamento de Lenguajes y Ciencias de la Computación, Universidad

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

More information

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet

More information

Lecture 3: Finding integer solutions to systems of linear equations

Lecture 3: Finding integer solutions to systems of linear equations Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture

More information

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm. Approximation Algorithms 11 Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of three

More information

3D Distance from a Point to a Triangle

3D Distance from a Point to a Triangle 3D Distance from a Point to a Triangle Mark W. Jones Technical Report CSR-5-95 Department of Computer Science, University of Wales Swansea February 1995 Abstract In this technical report, two different

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Optimal Planning of Closed Loop Supply Chains: A Discrete versus a Continuous-time formulation

Optimal Planning of Closed Loop Supply Chains: A Discrete versus a Continuous-time formulation 17 th European Symposium on Computer Aided Process Engineering ESCAPE17 V. Plesu and P.S. Agachi (Editors) 2007 Elsevier B.V. All rights reserved. 1 Optimal Planning of Closed Loop Supply Chains: A Discrete

More information

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances It is possible to construct a matrix X of Cartesian coordinates of points in Euclidean space when we know the Euclidean

More information

Chapter 19. General Matrices. An n m matrix is an array. a 11 a 12 a 1m a 21 a 22 a 2m A = a n1 a n2 a nm. The matrix A has n row vectors

Chapter 19. General Matrices. An n m matrix is an array. a 11 a 12 a 1m a 21 a 22 a 2m A = a n1 a n2 a nm. The matrix A has n row vectors Chapter 9. General Matrices An n m matrix is an array a a a m a a a m... = [a ij]. a n a n a nm The matrix A has n row vectors and m column vectors row i (A) = [a i, a i,..., a im ] R m a j a j a nj col

More information

Movie Classification Using k-means and Hierarchical Clustering

Movie Classification Using k-means and Hierarchical Clustering Movie Classification Using k-means and Hierarchical Clustering An analysis of clustering algorithms on movie scripts Dharak Shah DA-IICT, Gandhinagar Gujarat, India dharak_shah@daiict.ac.in Saheb Motiani

More information

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Load balancing in a heterogeneous computer system by self-organizing Kohonen network Bull. Nov. Comp. Center, Comp. Science, 25 (2006), 69 74 c 2006 NCC Publisher Load balancing in a heterogeneous computer system by self-organizing Kohonen network Mikhail S. Tarkov, Yakov S. Bezrukov Abstract.

More information

Offline sorting buffers on Line

Offline sorting buffers on Line Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

More information

Chapter 11. 11.1 Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Chapter 11. 11.1 Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling Approximation Algorithms Chapter Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should I do? A. Theory says you're unlikely to find a poly-time algorithm. Must sacrifice one

More information

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics*

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics* IE 680 Special Topics in Production Systems: Networks, Routing and Logistics* Rakesh Nagi Department of Industrial Engineering University at Buffalo (SUNY) *Lecture notes from Network Flows by Ahuja, Magnanti

More information

B490 Mining the Big Data. 2 Clustering

B490 Mining the Big Data. 2 Clustering B490 Mining the Big Data 2 Clustering Qin Zhang 1-1 Motivations Group together similar documents/webpages/images/people/proteins/products One of the most important problems in machine learning, pattern

More information

Supervised and unsupervised learning - 1

Supervised and unsupervised learning - 1 Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in

More information

These axioms must hold for all vectors ū, v, and w in V and all scalars c and d.

These axioms must hold for all vectors ū, v, and w in V and all scalars c and d. DEFINITION: A vector space is a nonempty set V of objects, called vectors, on which are defined two operations, called addition and multiplication by scalars (real numbers), subject to the following axioms

More information

Intersection of a Line and a Convex. Hull of Points Cloud

Intersection of a Line and a Convex. Hull of Points Cloud Applied Mathematical Sciences, Vol. 7, 213, no. 13, 5139-5149 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/1.12988/ams.213.37372 Intersection of a Line and a Convex Hull of Points Cloud R. P. Koptelov

More information

Clustering and scheduling maintenance tasks over time

Clustering and scheduling maintenance tasks over time Clustering and scheduling maintenance tasks over time Per Kreuger 2008-04-29 SICS Technical Report T2008:09 Abstract We report results on a maintenance scheduling problem. The problem consists of allocating

More information

Personalized Hierarchical Clustering

Personalized Hierarchical Clustering Personalized Hierarchical Clustering Korinna Bade, Andreas Nürnberger Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, D-39106 Magdeburg, Germany {kbade,nuernb}@iws.cs.uni-magdeburg.de

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

New Hash Function Construction for Textual and Geometric Data Retrieval

New Hash Function Construction for Textual and Geometric Data Retrieval Latest Trends on Computers, Vol., pp.483-489, ISBN 978-96-474-3-4, ISSN 79-45, CSCC conference, Corfu, Greece, New Hash Function Construction for Textual and Geometric Data Retrieval Václav Skala, Jan

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

HDDVis: An Interactive Tool for High Dimensional Data Visualization

HDDVis: An Interactive Tool for High Dimensional Data Visualization HDDVis: An Interactive Tool for High Dimensional Data Visualization Mingyue Tan Department of Computer Science University of British Columbia mtan@cs.ubc.ca ABSTRACT Current high dimensional data visualization

More information

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS STEVEN P. LALLEY AND ANDREW NOBEL Abstract. It is shown that there are no consistent decision rules for the hypothesis testing problem

More information

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, fabian.gruening@informatik.uni-oldenburg.de Abstract: Independent

More information

A Non-Linear Schema Theorem for Genetic Algorithms

A Non-Linear Schema Theorem for Genetic Algorithms A Non-Linear Schema Theorem for Genetic Algorithms William A Greene Computer Science Department University of New Orleans New Orleans, LA 70148 bill@csunoedu 504-280-6755 Abstract We generalize Holland

More information

Local outlier detection in data forensics: data mining approach to flag unusual schools

Local outlier detection in data forensics: data mining approach to flag unusual schools Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential

More information

K-Means Clustering Tutorial

K-Means Clustering Tutorial K-Means Clustering Tutorial By Kardi Teknomo,PhD Preferable reference for this tutorial is Teknomo, Kardi. K-Means Clustering Tutorials. http:\\people.revoledu.com\kardi\ tutorial\kmean\ Last Update: July

More information

TOOLS FOR 3D-OBJECT RETRIEVAL: KARHUNEN-LOEVE TRANSFORM AND SPHERICAL HARMONICS

TOOLS FOR 3D-OBJECT RETRIEVAL: KARHUNEN-LOEVE TRANSFORM AND SPHERICAL HARMONICS TOOLS FOR 3D-OBJECT RETRIEVAL: KARHUNEN-LOEVE TRANSFORM AND SPHERICAL HARMONICS D.V. Vranić, D. Saupe, and J. Richter Department of Computer Science, University of Leipzig, Leipzig, Germany phone +49 (341)

More information

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm. Approximation Algorithms Chapter Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of

More information

8 Visualization of high-dimensional data

8 Visualization of high-dimensional data 8 VISUALIZATION OF HIGH-DIMENSIONAL DATA 55 { plik roadmap8.tex January 24, 2005} 8 Visualization of high-dimensional data 8. Visualization of individual data points using Kohonen s SOM 8.. Typical Kohonen

More information

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department

More information

Vector Quantization and Clustering

Vector Quantization and Clustering Vector Quantization and Clustering Introduction K-means clustering Clustering issues Hierarchical clustering Divisive (top-down) clustering Agglomerative (bottom-up) clustering Applications to speech recognition

More information

A Procedure for Classifying New Respondents into Existing Segments Using Maximum Difference Scaling

A Procedure for Classifying New Respondents into Existing Segments Using Maximum Difference Scaling A Procedure for Classifying New Respondents into Existing Segments Using Maximum Difference Scaling Background Bryan Orme and Rich Johnson, Sawtooth Software March, 2009 Market segmentation is pervasive

More information

Vector and Matrix Norms

Vector and Matrix Norms Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty

More information

Applied Algorithm Design Lecture 5

Applied Algorithm Design Lecture 5 Applied Algorithm Design Lecture 5 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 5 1 / 86 Approximation Algorithms Pietro Michiardi (Eurecom) Applied Algorithm Design

More information

17. Inner product spaces Definition 17.1. Let V be a real vector space. An inner product on V is a function

17. Inner product spaces Definition 17.1. Let V be a real vector space. An inner product on V is a function 17. Inner product spaces Definition 17.1. Let V be a real vector space. An inner product on V is a function, : V V R, which is symmetric, that is u, v = v, u. bilinear, that is linear (in both factors):

More information

LEARNING OBJECTIVES FOR THIS CHAPTER

LEARNING OBJECTIVES FOR THIS CHAPTER CHAPTER 2 American mathematician Paul Halmos (1916 2006), who in 1942 published the first modern linear algebra book. The title of Halmos s book was the same as the title of this chapter. Finite-Dimensional

More information

Fast Sequential Summation Algorithms Using Augmented Data Structures

Fast Sequential Summation Algorithms Using Augmented Data Structures Fast Sequential Summation Algorithms Using Augmented Data Structures Vadim Stadnik vadim.stadnik@gmail.com Abstract This paper provides an introduction to the design of augmented data structures that offer

More information

CONTINUED FRACTIONS AND FACTORING. Niels Lauritzen

CONTINUED FRACTIONS AND FACTORING. Niels Lauritzen CONTINUED FRACTIONS AND FACTORING Niels Lauritzen ii NIELS LAURITZEN DEPARTMENT OF MATHEMATICAL SCIENCES UNIVERSITY OF AARHUS, DENMARK EMAIL: niels@imf.au.dk URL: http://home.imf.au.dk/niels/ Contents

More information

24. The Branch and Bound Method

24. The Branch and Bound Method 24. The Branch and Bound Method It has serious practical consequences if it is known that a combinatorial problem is NP-complete. Then one can conclude according to the present state of science that no

More information

Shortcut sets for plane Euclidean networks (Extended abstract) 1

Shortcut sets for plane Euclidean networks (Extended abstract) 1 Shortcut sets for plane Euclidean networks (Extended abstract) 1 J. Cáceres a D. Garijo b A. González b A. Márquez b M. L. Puertas a P. Ribeiro c a Departamento de Matemáticas, Universidad de Almería,

More information

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS O.U. Sezerman 1, R. Islamaj 2, E. Alpaydin 2 1 Laborotory of Computational Biology, Sabancı University, Istanbul, Turkey. 2 Computer Engineering

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Near Optimal Solutions

Near Optimal Solutions Near Optimal Solutions Many important optimization problems are lacking efficient solutions. NP-Complete problems unlikely to have polynomial time solutions. Good heuristics important for such problems.

More information

UNIVERSITY OF BOLTON SCHOOL OF ENGINEERING MS SYSTEMS ENGINEERING AND ENGINEERING MANAGEMENT SEMESTER 1 EXAMINATION 2015/2016 INTELLIGENT SYSTEMS

UNIVERSITY OF BOLTON SCHOOL OF ENGINEERING MS SYSTEMS ENGINEERING AND ENGINEERING MANAGEMENT SEMESTER 1 EXAMINATION 2015/2016 INTELLIGENT SYSTEMS TW72 UNIVERSITY OF BOLTON SCHOOL OF ENGINEERING MS SYSTEMS ENGINEERING AND ENGINEERING MANAGEMENT SEMESTER 1 EXAMINATION 2015/2016 INTELLIGENT SYSTEMS MODULE NO: EEM7010 Date: Monday 11 th January 2016

More information

A Novel Binary Particle Swarm Optimization

A Novel Binary Particle Swarm Optimization Proceedings of the 5th Mediterranean Conference on T33- A Novel Binary Particle Swarm Optimization Motaba Ahmadieh Khanesar, Member, IEEE, Mohammad Teshnehlab and Mahdi Aliyari Shoorehdeli K. N. Toosi

More information

Disjunction of Non-Binary and Numeric Constraint Satisfaction Problems

Disjunction of Non-Binary and Numeric Constraint Satisfaction Problems Disjunction of Non-Binary and Numeric Constraint Satisfaction Problems Miguel A. Salido, Federico Barber Departamento de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia Camino

More information

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,

More information

Reliable Systolic Computing through Redundancy

Reliable Systolic Computing through Redundancy Reliable Systolic Computing through Redundancy Kunio Okuda 1, Siang Wun Song 1, and Marcos Tatsuo Yamamoto 1 Universidade de São Paulo, Brazil, {kunio,song,mty}@ime.usp.br, http://www.ime.usp.br/ song/

More information

USING SELF-ORGANIZING MAPS FOR INFORMATION VISUALIZATION AND KNOWLEDGE DISCOVERY IN COMPLEX GEOSPATIAL DATASETS

USING SELF-ORGANIZING MAPS FOR INFORMATION VISUALIZATION AND KNOWLEDGE DISCOVERY IN COMPLEX GEOSPATIAL DATASETS USING SELF-ORGANIZING MAPS FOR INFORMATION VISUALIZATION AND KNOWLEDGE DISCOVERY IN COMPLEX GEOSPATIAL DATASETS Koua, E.L. International Institute for Geo-Information Science and Earth Observation (ITC).

More information

Dynamic Programming. Lecture 11. 11.1 Overview. 11.2 Introduction

Dynamic Programming. Lecture 11. 11.1 Overview. 11.2 Introduction Lecture 11 Dynamic Programming 11.1 Overview Dynamic Programming is a powerful technique that allows one to solve many different types of problems in time O(n 2 ) or O(n 3 ) for which a naive approach

More information

An Impossibility Theorem for Clustering

An Impossibility Theorem for Clustering An Impossibility Theorem for Clustering Jon Kleinberg Department of Computer Science Cornell University Ithaca NY 14853 Abstract Although the study of clustering is centered around an intuitively compelling

More information

MODEL DRIVEN DEVELOPMENT OF BUSINESS PROCESS MONITORING AND CONTROL SYSTEMS

MODEL DRIVEN DEVELOPMENT OF BUSINESS PROCESS MONITORING AND CONTROL SYSTEMS MODEL DRIVEN DEVELOPMENT OF BUSINESS PROCESS MONITORING AND CONTROL SYSTEMS Tao Yu Department of Computer Science, University of California at Irvine, USA Email: tyu1@uci.edu Jun-Jang Jeng IBM T.J. Watson

More information

Statistical tests for SPSS

Statistical tests for SPSS Statistical tests for SPSS Paolo Coletti A.Y. 2010/11 Free University of Bolzano Bozen Premise This book is a very quick, rough and fast description of statistical tests and their usage. It is explicitly

More information

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from

More information

Spatial Data Analysis

Spatial Data Analysis 14 Spatial Data Analysis OVERVIEW This chapter is the first in a set of three dealing with geographic analysis and modeling methods. The chapter begins with a review of the relevant terms, and an outlines

More information

A Novel Hyperbolic Smoothing Algorithm for Clustering Large Data Sets

A Novel Hyperbolic Smoothing Algorithm for Clustering Large Data Sets A Novel Hyperbolic Smoothing Algorithm for Clustering Large Data Sets Abstract Adilson Elias Xavier Vinicius Layter Xavier Dept. of Systems Engineering and Computer Science Graduate School of Engineering

More information

Recurrent Neural Networks

Recurrent Neural Networks Recurrent Neural Networks Neural Computation : Lecture 12 John A. Bullinaria, 2015 1. Recurrent Neural Network Architectures 2. State Space Models and Dynamical Systems 3. Backpropagation Through Time

More information

1 The Brownian bridge construction

1 The Brownian bridge construction The Brownian bridge construction The Brownian bridge construction is a way to build a Brownian motion path by successively adding finer scale detail. This construction leads to a relatively easy proof

More information

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed

More information

Elements of probability theory

Elements of probability theory 2 Elements of probability theory Probability theory provides mathematical models for random phenomena, that is, phenomena which under repeated observations yield di erent outcomes that cannot be predicted

More information

An Optimization Approach for Cooperative Communication in Ad Hoc Networks

An Optimization Approach for Cooperative Communication in Ad Hoc Networks An Optimization Approach for Cooperative Communication in Ad Hoc Networks Carlos A.S. Oliveira and Panos M. Pardalos University of Florida Abstract. Mobile ad hoc networks (MANETs) are a useful organizational

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

A New Approach to Cutting Tetrahedral Meshes

A New Approach to Cutting Tetrahedral Meshes A New Approach to Cutting Tetrahedral Meshes Menion Croll August 9, 2007 1 Introduction Volumetric models provide a realistic representation of three dimensional objects above and beyond what traditional

More information

Feature Selection vs. Extraction

Feature Selection vs. Extraction Feature Selection In many applications, we often encounter a very large number of potential features that can be used Which subset of features should be used for the best classification? Need for a small

More information

Private Record Linkage with Bloom Filters

Private Record Linkage with Bloom Filters To appear in: Proceedings of Statistics Canada Symposium 2010 Social Statistics: The Interplay among Censuses, Surveys and Administrative Data Private Record Linkage with Bloom Filters Rainer Schnell,

More information

Risk Management for IT Security: When Theory Meets Practice

Risk Management for IT Security: When Theory Meets Practice Risk Management for IT Security: When Theory Meets Practice Anil Kumar Chorppath Technical University of Munich Munich, Germany Email: anil.chorppath@tum.de Tansu Alpcan The University of Melbourne Melbourne,

More information

Topological Data Analysis Applications to Computer Vision

Topological Data Analysis Applications to Computer Vision Topological Data Analysis Applications to Computer Vision Vitaliy Kurlin, http://kurlin.org Microsoft Research Cambridge and Durham University, UK Topological Data Analysis quantifies topological structures

More information

Dynamic Network Analyzer Building a Framework for the Graph-theoretic Analysis of Dynamic Networks

Dynamic Network Analyzer Building a Framework for the Graph-theoretic Analysis of Dynamic Networks Dynamic Network Analyzer Building a Framework for the Graph-theoretic Analysis of Dynamic Networks Benjamin Schiller and Thorsten Strufe P2P Networks - TU Darmstadt [schiller, strufe][at]cs.tu-darmstadt.de

More information

Sub-class Error-Correcting Output Codes

Sub-class Error-Correcting Output Codes Sub-class Error-Correcting Output Codes Sergio Escalera, Oriol Pujol and Petia Radeva Computer Vision Center, Campus UAB, Edifici O, 08193, Bellaterra, Spain. Dept. Matemàtica Aplicada i Anàlisi, Universitat

More information

An optimisation framework for determination of capacity in railway networks

An optimisation framework for determination of capacity in railway networks CASPT 2015 An optimisation framework for determination of capacity in railway networks Lars Wittrup Jensen Abstract Within the railway industry, high quality estimates on railway capacity is crucial information,

More information

Performance advantages of resource sharing in polymorphic optical networks

Performance advantages of resource sharing in polymorphic optical networks R. J. Durán, I. de Miguel, N. Merayo, P. Fernández, R. M. Lorenzo, E. J. Abril, I. Tafur Monroy, Performance advantages of resource sharing in polymorphic optical networks, Proc. of the 0th European Conference

More information

Junghyun Ahn Changho Sung Tag Gon Kim. Korea Advanced Institute of Science and Technology (KAIST) 373-1 Kuseong-dong, Yuseong-gu Daejoen, Korea

Junghyun Ahn Changho Sung Tag Gon Kim. Korea Advanced Institute of Science and Technology (KAIST) 373-1 Kuseong-dong, Yuseong-gu Daejoen, Korea Proceedings of the 211 Winter Simulation Conference S. Jain, R. R. Creasey, J. Himmelspach, K. P. White, and M. Fu, eds. A BINARY PARTITION-BASED MATCHING ALGORITHM FOR DATA DISTRIBUTION MANAGEMENT Junghyun

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information