An Algorithm For Fast Median Search
|
|
- Emerald Carter
- 7 years ago
- Views:
Transcription
1 An Algorithm For Fast Median Search A. Juan and E. Vidal Departamento de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia, Valencia, Spain. ajuan@dsic.upv.es Abstract VII National Symposium on Pattern Recognition and Image Analysis, proc. pp , 1997 Searching for a median of a set of patterns is a well-known technique to model the set or to aid searching for more accurate models such as a generalized median or a k-median. While medians and generalized medians are (relatively) easy to compute in the case of Euclidean representation spaces, this is unfortunately no longer true when more complex distance measures are to be used, which often happens in practical situations. In these cases, a direct method to perform median search is not feasible for large sets of patterns. In order to cope with this computational problem, an algorithm for median search is proposed which is faster than the direct method in terms of distance computations. Key Words: Median, Fast Search, Distance, Metric Spaces, Clustering, Nearest-Neighbour techniques. 1 Introduction Assume that we are given a set of patterns in an arbitrary (non-vector) representation space and a distance function to measure the dissimilarity between any pair of them. A well-known technique for modelling the given set consists of finding a pattern of the set whose sum of distances to all patterns is minimum. Such a pattern is called a median. The concept of median is related to a more general concept known as generalized median. This generalization arises when the search is not constrained to the given set, but extended to the whole pattern representation space. Although a generalized median constitutes a more accurate model than a median, finding such a model is often a difficult task that requires complex and specific (approximate) algorithms [2, 1]. Obviously, here we are excluding trivial cases like that of computing the generalized median of a set of patterns in a Euclidean representation space. In these cases where such computation is not trivial, a median can be used as an approximate generalized median or as a starting point for finding better ones [6]. Another generalization of the concept of median is known as k-median. As its name suggests, searching for a -median consists of finding a subset of patterns (models) such that the sum of distances between all the patterns and their corresponding nearest models is minimum. This problem, well-known as a prototype location problem, is NP-Hard [8]. On the other hand, it can be seen as a clustering problem since each possible subset of patterns induces a partition of the set into clusters and a measure of its quality. Good partitions are generally found using a simple k-means-like algorithm which starts on a given initial partition and then loops searching for a median of each cluster and reclassifying patterns according to their nearest medians [5]. Thus, the concept of median also turns out to be useful in finding more accurate models. Searching for a median of a set of Ò patterns is a relatively easy task since it suffices to compute ½ ¾ Ò Ò ½µ pairwise distances between patterns. However, this naive approach is no longer applicable when large sets of patterns are compared through complex distance functions like, for instance, those provided by algorithms of Sequence Comparison [9]. In order to cope with this computational problem, algorithms for fast median search are required but, unfortunately, available techniques are only useful in very particular cases [3, 7].
2 In [4] we proposed an algorithm for fast median search whose complexity is smaller than that of the direct approach both, measured in terms of distance computations, and measured in terms of time not alloted to distance computing (overhead). This algorithm was tested on artificial data and also embedded in a -means-like algorithm which was applied to a corpus of speech data [4]. Although it was found effective in reducing the number of computed distances, greater savings are desirable in practice. Here we propose a faster algorithm in terms of distance computations. A simulation experiment is reported supporting this claim. 2 The Algorithm Let Á ½ Ò be the set of indices associated with a set of Ò different patterns Ü ½ Ü ¾ Ü Ò in a (given) representation space. A metric or distance function is assumed to be given; that is, a function Á Á Ê that satisfies the following conditions µ ¼ (1) µ ¼ if (2) µ µ (3) µ µ µ (4) for all ¾ Á. The sum of distances between a pattern and all the patterns of a given set Ë Á is denoted as Ë µ ¼ ¾Ë if Ë µ if Ë (5) This value will be called the sum on Ë of or simply the sum of if Ë Á. In order to introduce the key idea which enables fast median search, let us suppose that the sums of some patterns have already been calculated. These patterns will be referred as the set Í of used data. As a by-product of this calculation, it is also assumed that the distances between used and unused patterns have been saved in a matrix, and that a pattern ¾ Í whose sum is minimum among those calculated so far has been chosen. Applying the triangle inequality to all ¾ Á we have µ µ µ µ µ µ µ µ µ µ Then, a lower bound of the sum of any unused pattern, Í µ, can be computed as Í µ Í µ ¾Á Í Í µ (6) where ¼ Í µ ÑÜ µ µ ¾Í if Í if Í (7) Notice that all distances involved in the calculation of (6) are available (have already been computed in previous steps). Obviously, we can avoid the calculation of the sum of any unused pattern whose corresponding lower bound is greater than or equals to.
3 Three issues have to be addressed before developing an algorithm based on (6). First, it is plain that we have to distinguish between the unused patterns whose lower bounds are smaller than and the rest of unused patterns. Only the former must be taken into account for sum calculation, so we call them the set of alive data. The rest of unused patterns will be referred as the set of eliminated data. An interesting question is which technique should we follow so as to calculate (6) without extra distance computations and as fast as possible in terms of overhead. It has been assumed that this calculation is carried out with the aid of a matrix where all distances between used and unused patterns are stored. Unfortunately, this technique is very expensive since the size of the required matrix is Í µ ¾ Ç Ò ¾ µ and each calculation of (6) takes Í µí ¾ Ç Ò ¾ µ steps. Let us see an alternative technique which dims the computational burden devoted to this calculation. Assume that we have an array ¾ R Ò where the sums on Í of all alive patterns are stored, and a matrix À ¾ R Ò Ò which contains the estimates provided by (7) between each alive and unused patterns. If a new pattern is selected for sum calculation, the distances involved in the computation of this sum can be used to calculate (6) for each alive pattern as Í µ µ ¾ ÑÜ À µ µµ Although the use of À keeps the space complexity at Ç Ò ¾ µ, this technique is more efficient since each calculation of (6) now takes ¾ Ç Òµ steps. The third issue we are concerned with is about choosing a criterion to select an alive pattern for sum calculation. Since we are interested in quickly choosing a median, it is clear that a natural criterion consists of selecting an alive pattern whose associated lower bound is minimum. The algorithm described in figure 1 searches for a median on the basis of the ideas just discussed. Its space complexity is Ò ¾ µ due to the use of the matrix À. Notice that this matrix can be implemented as a triangular array of size ½ Ò Ò ½µ since (7) fulfils conditions (1) and (3). ¾ Obviously, the time complexity of the algorithm depends on the nature of the distance function adopted. A worst case arises when the discrete metric is used, µ ¼ ½ if if In this case, the estimates given by (7) are the most pessimistic (zero), (6) reduces to sums on Í and no alive patterns are eliminated. As a result, the process ends computing as many distances as the direct method ( ½ ¾ Ò Ò ½µ) and the overhead becomes Ç Ò µ. Suppose now that patterns are represented as Ò different real numbers which are compared through the usual distance, µ Ü Ü where Ü Ü ¾ R ½ Ò. If the first pattern selected for sum calculation is the index of either the minimum or the maximum of these numbers, exact estimates will be given by (7) and thereby (6). Next, a median will be selected making alive patterns into eliminated and, consequently, directly leading to the end of the algorithm. This computational behaviour is optimal since it can be shown that the algorithm can not stop after the calculation of a single sum if Ò ¾. Therefore, although its overhead is Å Ò ¾ µ, it is worth noting that the algorithm can find a median by computing as little as ¾Ò distances. 3 A Simulation Experiment A simulation experiment was carried out on a ¼ (½¼¼ MHz clocked) PC so as to empirically test the computational behaviour of the algorithm. The experiment was based on the generation of sets of Ò
4 Algorithm for Fast Median Search Input A data set Á ½ Ò and a metric Á Á Ê. Output A median ¾ Á and its associated sum ¾ Ê. Variables Í Á; ¾ Ê Ò ; À ¾ Ê Ò Ò ; and ¾ Á. Method Initialize ½, Í, Á, ¼ and À ¼. while Î ¼ do Select Ö ÑÒ ¾ µ and transfer it from to Í. for all ¾ do Compute µ and update and. if then Update and. end if for all ¾ do Let. for all ¾ do Update À ÑÜ À µ and À. if then Transfer from to. end if end while Figure 1: Algorithm for Fast Median Search. artificial patterns (Ò ½ ¾ ½¼¾) randomly drawn from a uniform distribution in the unit - dimensional hypercube ( ¾ ½¼), and compared through the Euclidean distance. Although the algorithm is not intended to deal with patterns in a Euclidean space, we chose such kind of patterns since they are easy to generate. On the other hand, the algorithm never took advantage of their coordinates, but only the distances were used. For each value of Ò and, the algorithm was tested on ½¼¼ independently generated sets of patterns and the resulting numbers of distance computations were averaged so as to get an estimate of the mean number of distances computed by the algorithm. A total of estimates were obtained for different values of Ò and. From these estimates, the percentages with respect to ½ Ò Ò ½µ (number of distances ¾ computed by the direct method) are plotted in figure 2 together with their corresponding ± confidence intervals. In terms of distance computations, these results are better than those (not reported here) of our previous fast algorithm [4] tested on the very same conditions. Although for large dimensions only modest savings are obtained, it should be pointed out that these savings tend to increase as Ò increases and they can be greater than ¼± for low dimensions. On the other hand, the overhead of the new algorithm is significantly higher than that of the previous algorithm [4], which makes the new version only appropriate for computationally expensive distance functions.
5 100 Ʊ ½¼ ¾ Ò Figure 2: Average number of distances computed by the new algorithm (together with its corresponding ± confidence interval), expressed as a percentage of the number of distances computed by the direct method (Ʊ). Each value was obtained by running the new algorithm on ½¼¼ independent sets of Ò artificial patterns randomly drawn from a uniform distribution in the unit -dimensional hypercube, and compared through the Euclidean distance. 4 Conclusions An algorithm for median search has been proposed which is faster than a previously proposed algorithm [4] in terms of distance computations. Its main application is to be used as a tool for modelling a set of patterns which are compared through a computationally expensive distance function. References [1] F. Casacuberta and M. D. de Antonio. A greedy algorithm for computing approximate Median Strings. In revision, [2] F. Casacuberta and C. de la Higuera. The topology of strings: two NP-Complete problems. In revision, [3] E. Horowitz and S. Sahni. Fundamentals of Computers Algorithms. Computer Science, [4] A. Juan and E. Vidal. An optimized Ã-means-like algorithm. In 4th Portuguese Conference on Pattern Recognition, pages 11 16, Coimbra, Portugal, March [5] A. Juan and E. Vidal. Fast -Means-like Clustering in Metric Spaces. Pattern Recognition Letters, 15(1):19 25, [6] T. Kohonen. Median Strings. Pattern Recognition Letters, 3: , [7] P. B. Mirchandani. The Ô-Median Problem and Generalizations. In Mirchandani and Francis [8], pages [8] P. B. Mirchandani and R. L. Francis, editors. Discrete Location Theory. Wiley, [9] D. Sankoff and J. B. Kruskal, editors. Time Warps, String Edits, and Macromolecules: the Theory and Practice of Sequence Comparison. Addison-Wesley, 1983.
Efficient Recovery of Secrets
Efficient Recovery of Secrets Marcel Fernandez Miguel Soriano, IEEE Senior Member Department of Telematics Engineering. Universitat Politècnica de Catalunya. C/ Jordi Girona 1 i 3. Campus Nord, Mod C3,
More informationVirtual Landmarks for the Internet
Virtual Landmarks for the Internet Liying Tang Mark Crovella Boston University Computer Science Internet Distance Matters! Useful for configuring Content delivery networks Peer to peer applications Multiuser
More information. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns
Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties
More informationVisualization of large data sets using MDS combined with LVQ.
Visualization of large data sets using MDS combined with LVQ. Antoine Naud and Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87-100 Toruń, Poland. www.phys.uni.torun.pl/kmk
More informationAn Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
More informationINTERACTIVE DATA EXPLORATION USING MDS MAPPING
INTERACTIVE DATA EXPLORATION USING MDS MAPPING Antoine Naud and Włodzisław Duch 1 Department of Computer Methods Nicolaus Copernicus University ul. Grudziadzka 5, 87-100 Toruń, Poland Abstract: Interactive
More informationUNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS
UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable
More informationVisualization of General Defined Space Data
International Journal of Computer Graphics & Animation (IJCGA) Vol.3, No.4, October 013 Visualization of General Defined Space Data John R Rankin La Trobe University, Australia Abstract A new algorithm
More informationCompetitive Analysis of On line Randomized Call Control in Cellular Networks
Competitive Analysis of On line Randomized Call Control in Cellular Networks Ioannis Caragiannis Christos Kaklamanis Evi Papaioannou Abstract In this paper we address an important communication issue arising
More informationChapter ML:XI (continued)
Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained
More informationInformation Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding
More information6.2.8 Neural networks for data mining
6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural
More informationSelf Organizing Maps: Fundamentals
Self Organizing Maps: Fundamentals Introduction to Neural Networks : Lecture 16 John A. Bullinaria, 2004 1. What is a Self Organizing Map? 2. Topographic Maps 3. Setting up a Self Organizing Map 4. Kohonen
More informationInstituto de Engenharia de Sistemas e Computadores de Coimbra Institute of Systems Engineering and Computers INESC Coimbra
Instituto de Engenharia de Sistemas e Computadores de Coimbra Institute of Systems Engineering and Computers INESC Coimbra João Clímaco and Marta Pascoal A new method to detere unsupported non-doated solutions
More information15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
More informationEfficient Data Structures for Decision Diagrams
Artificial Intelligence Laboratory Efficient Data Structures for Decision Diagrams Master Thesis Nacereddine Ouaret Professor: Supervisors: Boi Faltings Thomas Léauté Radoslaw Szymanek Contents Introduction...
More informationMetric Spaces. Chapter 7. 7.1. Metrics
Chapter 7 Metric Spaces A metric space is a set X that has a notion of the distance d(x, y) between every pair of points x, y X. The purpose of this chapter is to introduce metric spaces and give some
More informationMedical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu
Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
More informationEuclidean Minimum Spanning Trees Based on Well Separated Pair Decompositions Chaojun Li. Advised by: Dave Mount. May 22, 2014
Euclidean Minimum Spanning Trees Based on Well Separated Pair Decompositions Chaojun Li Advised by: Dave Mount May 22, 2014 1 INTRODUCTION In this report we consider the implementation of an efficient
More informationENUMERATION OF INTEGRAL TETRAHEDRA
ENUMERATION OF INTEGRAL TETRAHEDRA SASCHA KURZ Abstract. We determine the numbers of integral tetrahedra with diameter d up to isomorphism for all d 1000 via computer enumeration. Therefore we give an
More informationHow To Identify Noisy Variables In A Cluster
Identification of noisy variables for nonmetric and symbolic data in cluster analysis Marek Walesiak and Andrzej Dudek Wroclaw University of Economics, Department of Econometrics and Computer Science,
More informationConfidence Intervals for One Standard Deviation Using Standard Deviation
Chapter 640 Confidence Intervals for One Standard Deviation Using Standard Deviation Introduction This routine calculates the sample size necessary to achieve a specified interval width or distance from
More informationManagement of Software Projects with GAs
MIC05: The Sixth Metaheuristics International Conference 1152-1 Management of Software Projects with GAs Enrique Alba J. Francisco Chicano Departamento de Lenguajes y Ciencias de la Computación, Universidad
More informationDATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
More informationMultivariate Analysis of Ecological Data
Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology
More informationAnalysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
More informationLecture 3: Finding integer solutions to systems of linear equations
Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture
More information! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.
Approximation Algorithms 11 Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of three
More information3D Distance from a Point to a Triangle
3D Distance from a Point to a Triangle Mark W. Jones Technical Report CSR-5-95 Department of Computer Science, University of Wales Swansea February 1995 Abstract In this technical report, two different
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationOptimal Planning of Closed Loop Supply Chains: A Discrete versus a Continuous-time formulation
17 th European Symposium on Computer Aided Process Engineering ESCAPE17 V. Plesu and P.S. Agachi (Editors) 2007 Elsevier B.V. All rights reserved. 1 Optimal Planning of Closed Loop Supply Chains: A Discrete
More informationTorgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances
Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances It is possible to construct a matrix X of Cartesian coordinates of points in Euclidean space when we know the Euclidean
More informationChapter 19. General Matrices. An n m matrix is an array. a 11 a 12 a 1m a 21 a 22 a 2m A = a n1 a n2 a nm. The matrix A has n row vectors
Chapter 9. General Matrices An n m matrix is an array a a a m a a a m... = [a ij]. a n a n a nm The matrix A has n row vectors and m column vectors row i (A) = [a i, a i,..., a im ] R m a j a j a nj col
More informationMovie Classification Using k-means and Hierarchical Clustering
Movie Classification Using k-means and Hierarchical Clustering An analysis of clustering algorithms on movie scripts Dharak Shah DA-IICT, Gandhinagar Gujarat, India dharak_shah@daiict.ac.in Saheb Motiani
More informationLoad balancing in a heterogeneous computer system by self-organizing Kohonen network
Bull. Nov. Comp. Center, Comp. Science, 25 (2006), 69 74 c 2006 NCC Publisher Load balancing in a heterogeneous computer system by self-organizing Kohonen network Mikhail S. Tarkov, Yakov S. Bezrukov Abstract.
More informationOffline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com
More informationChapter 11. 11.1 Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling
Approximation Algorithms Chapter Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should I do? A. Theory says you're unlikely to find a poly-time algorithm. Must sacrifice one
More informationIE 680 Special Topics in Production Systems: Networks, Routing and Logistics*
IE 680 Special Topics in Production Systems: Networks, Routing and Logistics* Rakesh Nagi Department of Industrial Engineering University at Buffalo (SUNY) *Lecture notes from Network Flows by Ahuja, Magnanti
More informationB490 Mining the Big Data. 2 Clustering
B490 Mining the Big Data 2 Clustering Qin Zhang 1-1 Motivations Group together similar documents/webpages/images/people/proteins/products One of the most important problems in machine learning, pattern
More informationSupervised and unsupervised learning - 1
Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in
More informationThese axioms must hold for all vectors ū, v, and w in V and all scalars c and d.
DEFINITION: A vector space is a nonempty set V of objects, called vectors, on which are defined two operations, called addition and multiplication by scalars (real numbers), subject to the following axioms
More informationIntersection of a Line and a Convex. Hull of Points Cloud
Applied Mathematical Sciences, Vol. 7, 213, no. 13, 5139-5149 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/1.12988/ams.213.37372 Intersection of a Line and a Convex Hull of Points Cloud R. P. Koptelov
More informationClustering and scheduling maintenance tasks over time
Clustering and scheduling maintenance tasks over time Per Kreuger 2008-04-29 SICS Technical Report T2008:09 Abstract We report results on a maintenance scheduling problem. The problem consists of allocating
More informationPersonalized Hierarchical Clustering
Personalized Hierarchical Clustering Korinna Bade, Andreas Nürnberger Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, D-39106 Magdeburg, Germany {kbade,nuernb}@iws.cs.uni-magdeburg.de
More informationData Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
More informationNew Hash Function Construction for Textual and Geometric Data Retrieval
Latest Trends on Computers, Vol., pp.483-489, ISBN 978-96-474-3-4, ISSN 79-45, CSCC conference, Corfu, Greece, New Hash Function Construction for Textual and Geometric Data Retrieval Václav Skala, Jan
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationHDDVis: An Interactive Tool for High Dimensional Data Visualization
HDDVis: An Interactive Tool for High Dimensional Data Visualization Mingyue Tan Department of Computer Science University of British Columbia mtan@cs.ubc.ca ABSTRACT Current high dimensional data visualization
More informationINDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS
INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS STEVEN P. LALLEY AND ANDREW NOBEL Abstract. It is shown that there are no consistent decision rules for the hypothesis testing problem
More informationData Quality Mining: Employing Classifiers for Assuring consistent Datasets
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, fabian.gruening@informatik.uni-oldenburg.de Abstract: Independent
More informationA Non-Linear Schema Theorem for Genetic Algorithms
A Non-Linear Schema Theorem for Genetic Algorithms William A Greene Computer Science Department University of New Orleans New Orleans, LA 70148 bill@csunoedu 504-280-6755 Abstract We generalize Holland
More informationLocal outlier detection in data forensics: data mining approach to flag unusual schools
Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential
More informationK-Means Clustering Tutorial
K-Means Clustering Tutorial By Kardi Teknomo,PhD Preferable reference for this tutorial is Teknomo, Kardi. K-Means Clustering Tutorials. http:\\people.revoledu.com\kardi\ tutorial\kmean\ Last Update: July
More informationTOOLS FOR 3D-OBJECT RETRIEVAL: KARHUNEN-LOEVE TRANSFORM AND SPHERICAL HARMONICS
TOOLS FOR 3D-OBJECT RETRIEVAL: KARHUNEN-LOEVE TRANSFORM AND SPHERICAL HARMONICS D.V. Vranić, D. Saupe, and J. Richter Department of Computer Science, University of Leipzig, Leipzig, Germany phone +49 (341)
More information! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.
Approximation Algorithms Chapter Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of
More information8 Visualization of high-dimensional data
8 VISUALIZATION OF HIGH-DIMENSIONAL DATA 55 { plik roadmap8.tex January 24, 2005} 8 Visualization of high-dimensional data 8. Visualization of individual data points using Kohonen s SOM 8.. Typical Kohonen
More informationLoad Balancing on a Non-dedicated Heterogeneous Network of Workstations
Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department
More informationVector Quantization and Clustering
Vector Quantization and Clustering Introduction K-means clustering Clustering issues Hierarchical clustering Divisive (top-down) clustering Agglomerative (bottom-up) clustering Applications to speech recognition
More informationA Procedure for Classifying New Respondents into Existing Segments Using Maximum Difference Scaling
A Procedure for Classifying New Respondents into Existing Segments Using Maximum Difference Scaling Background Bryan Orme and Rich Johnson, Sawtooth Software March, 2009 Market segmentation is pervasive
More informationVector and Matrix Norms
Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty
More informationApplied Algorithm Design Lecture 5
Applied Algorithm Design Lecture 5 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 5 1 / 86 Approximation Algorithms Pietro Michiardi (Eurecom) Applied Algorithm Design
More information17. Inner product spaces Definition 17.1. Let V be a real vector space. An inner product on V is a function
17. Inner product spaces Definition 17.1. Let V be a real vector space. An inner product on V is a function, : V V R, which is symmetric, that is u, v = v, u. bilinear, that is linear (in both factors):
More informationLEARNING OBJECTIVES FOR THIS CHAPTER
CHAPTER 2 American mathematician Paul Halmos (1916 2006), who in 1942 published the first modern linear algebra book. The title of Halmos s book was the same as the title of this chapter. Finite-Dimensional
More informationFast Sequential Summation Algorithms Using Augmented Data Structures
Fast Sequential Summation Algorithms Using Augmented Data Structures Vadim Stadnik vadim.stadnik@gmail.com Abstract This paper provides an introduction to the design of augmented data structures that offer
More informationCONTINUED FRACTIONS AND FACTORING. Niels Lauritzen
CONTINUED FRACTIONS AND FACTORING Niels Lauritzen ii NIELS LAURITZEN DEPARTMENT OF MATHEMATICAL SCIENCES UNIVERSITY OF AARHUS, DENMARK EMAIL: niels@imf.au.dk URL: http://home.imf.au.dk/niels/ Contents
More information24. The Branch and Bound Method
24. The Branch and Bound Method It has serious practical consequences if it is known that a combinatorial problem is NP-complete. Then one can conclude according to the present state of science that no
More informationShortcut sets for plane Euclidean networks (Extended abstract) 1
Shortcut sets for plane Euclidean networks (Extended abstract) 1 J. Cáceres a D. Garijo b A. González b A. Márquez b M. L. Puertas a P. Ribeiro c a Departamento de Matemáticas, Universidad de Almería,
More informationTHREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS
THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS O.U. Sezerman 1, R. Islamaj 2, E. Alpaydin 2 1 Laborotory of Computational Biology, Sabancı University, Istanbul, Turkey. 2 Computer Engineering
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationNear Optimal Solutions
Near Optimal Solutions Many important optimization problems are lacking efficient solutions. NP-Complete problems unlikely to have polynomial time solutions. Good heuristics important for such problems.
More informationUNIVERSITY OF BOLTON SCHOOL OF ENGINEERING MS SYSTEMS ENGINEERING AND ENGINEERING MANAGEMENT SEMESTER 1 EXAMINATION 2015/2016 INTELLIGENT SYSTEMS
TW72 UNIVERSITY OF BOLTON SCHOOL OF ENGINEERING MS SYSTEMS ENGINEERING AND ENGINEERING MANAGEMENT SEMESTER 1 EXAMINATION 2015/2016 INTELLIGENT SYSTEMS MODULE NO: EEM7010 Date: Monday 11 th January 2016
More informationA Novel Binary Particle Swarm Optimization
Proceedings of the 5th Mediterranean Conference on T33- A Novel Binary Particle Swarm Optimization Motaba Ahmadieh Khanesar, Member, IEEE, Mohammad Teshnehlab and Mahdi Aliyari Shoorehdeli K. N. Toosi
More informationDisjunction of Non-Binary and Numeric Constraint Satisfaction Problems
Disjunction of Non-Binary and Numeric Constraint Satisfaction Problems Miguel A. Salido, Federico Barber Departamento de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia Camino
More informationAn Analysis on Density Based Clustering of Multi Dimensional Spatial Data
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,
More informationReliable Systolic Computing through Redundancy
Reliable Systolic Computing through Redundancy Kunio Okuda 1, Siang Wun Song 1, and Marcos Tatsuo Yamamoto 1 Universidade de São Paulo, Brazil, {kunio,song,mty}@ime.usp.br, http://www.ime.usp.br/ song/
More informationUSING SELF-ORGANIZING MAPS FOR INFORMATION VISUALIZATION AND KNOWLEDGE DISCOVERY IN COMPLEX GEOSPATIAL DATASETS
USING SELF-ORGANIZING MAPS FOR INFORMATION VISUALIZATION AND KNOWLEDGE DISCOVERY IN COMPLEX GEOSPATIAL DATASETS Koua, E.L. International Institute for Geo-Information Science and Earth Observation (ITC).
More informationDynamic Programming. Lecture 11. 11.1 Overview. 11.2 Introduction
Lecture 11 Dynamic Programming 11.1 Overview Dynamic Programming is a powerful technique that allows one to solve many different types of problems in time O(n 2 ) or O(n 3 ) for which a naive approach
More informationAn Impossibility Theorem for Clustering
An Impossibility Theorem for Clustering Jon Kleinberg Department of Computer Science Cornell University Ithaca NY 14853 Abstract Although the study of clustering is centered around an intuitively compelling
More informationMODEL DRIVEN DEVELOPMENT OF BUSINESS PROCESS MONITORING AND CONTROL SYSTEMS
MODEL DRIVEN DEVELOPMENT OF BUSINESS PROCESS MONITORING AND CONTROL SYSTEMS Tao Yu Department of Computer Science, University of California at Irvine, USA Email: tyu1@uci.edu Jun-Jang Jeng IBM T.J. Watson
More informationStatistical tests for SPSS
Statistical tests for SPSS Paolo Coletti A.Y. 2010/11 Free University of Bolzano Bozen Premise This book is a very quick, rough and fast description of statistical tests and their usage. It is explicitly
More informationCluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico
Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from
More informationSpatial Data Analysis
14 Spatial Data Analysis OVERVIEW This chapter is the first in a set of three dealing with geographic analysis and modeling methods. The chapter begins with a review of the relevant terms, and an outlines
More informationA Novel Hyperbolic Smoothing Algorithm for Clustering Large Data Sets
A Novel Hyperbolic Smoothing Algorithm for Clustering Large Data Sets Abstract Adilson Elias Xavier Vinicius Layter Xavier Dept. of Systems Engineering and Computer Science Graduate School of Engineering
More informationRecurrent Neural Networks
Recurrent Neural Networks Neural Computation : Lecture 12 John A. Bullinaria, 2015 1. Recurrent Neural Network Architectures 2. State Space Models and Dynamical Systems 3. Backpropagation Through Time
More information1 The Brownian bridge construction
The Brownian bridge construction The Brownian bridge construction is a way to build a Brownian motion path by successively adding finer scale detail. This construction leads to a relatively easy proof
More informationA Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed
More informationElements of probability theory
2 Elements of probability theory Probability theory provides mathematical models for random phenomena, that is, phenomena which under repeated observations yield di erent outcomes that cannot be predicted
More informationAn Optimization Approach for Cooperative Communication in Ad Hoc Networks
An Optimization Approach for Cooperative Communication in Ad Hoc Networks Carlos A.S. Oliveira and Panos M. Pardalos University of Florida Abstract. Mobile ad hoc networks (MANETs) are a useful organizational
More informationUsing Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
More informationA New Approach to Cutting Tetrahedral Meshes
A New Approach to Cutting Tetrahedral Meshes Menion Croll August 9, 2007 1 Introduction Volumetric models provide a realistic representation of three dimensional objects above and beyond what traditional
More informationFeature Selection vs. Extraction
Feature Selection In many applications, we often encounter a very large number of potential features that can be used Which subset of features should be used for the best classification? Need for a small
More informationPrivate Record Linkage with Bloom Filters
To appear in: Proceedings of Statistics Canada Symposium 2010 Social Statistics: The Interplay among Censuses, Surveys and Administrative Data Private Record Linkage with Bloom Filters Rainer Schnell,
More informationRisk Management for IT Security: When Theory Meets Practice
Risk Management for IT Security: When Theory Meets Practice Anil Kumar Chorppath Technical University of Munich Munich, Germany Email: anil.chorppath@tum.de Tansu Alpcan The University of Melbourne Melbourne,
More informationTopological Data Analysis Applications to Computer Vision
Topological Data Analysis Applications to Computer Vision Vitaliy Kurlin, http://kurlin.org Microsoft Research Cambridge and Durham University, UK Topological Data Analysis quantifies topological structures
More informationDynamic Network Analyzer Building a Framework for the Graph-theoretic Analysis of Dynamic Networks
Dynamic Network Analyzer Building a Framework for the Graph-theoretic Analysis of Dynamic Networks Benjamin Schiller and Thorsten Strufe P2P Networks - TU Darmstadt [schiller, strufe][at]cs.tu-darmstadt.de
More informationSub-class Error-Correcting Output Codes
Sub-class Error-Correcting Output Codes Sergio Escalera, Oriol Pujol and Petia Radeva Computer Vision Center, Campus UAB, Edifici O, 08193, Bellaterra, Spain. Dept. Matemàtica Aplicada i Anàlisi, Universitat
More informationAn optimisation framework for determination of capacity in railway networks
CASPT 2015 An optimisation framework for determination of capacity in railway networks Lars Wittrup Jensen Abstract Within the railway industry, high quality estimates on railway capacity is crucial information,
More informationPerformance advantages of resource sharing in polymorphic optical networks
R. J. Durán, I. de Miguel, N. Merayo, P. Fernández, R. M. Lorenzo, E. J. Abril, I. Tafur Monroy, Performance advantages of resource sharing in polymorphic optical networks, Proc. of the 0th European Conference
More informationJunghyun Ahn Changho Sung Tag Gon Kim. Korea Advanced Institute of Science and Technology (KAIST) 373-1 Kuseong-dong, Yuseong-gu Daejoen, Korea
Proceedings of the 211 Winter Simulation Conference S. Jain, R. R. Creasey, J. Himmelspach, K. P. White, and M. Fu, eds. A BINARY PARTITION-BASED MATCHING ALGORITHM FOR DATA DISTRIBUTION MANAGEMENT Junghyun
More informationClustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
More information