An Adaptive Index Structure for High-Dimensional Similarity Search
|
|
- Flora Murphy
- 1 years ago
- Views:
Transcription
1 An Adaptive Index Structure for High-Dimensional Similarity Search Abstract A practical method for creating a high dimensional index structure that adapts to the data distribution and scales well with the database size, is presented. Typical media descriptors, such as texture features, are high dimensional and are not uniformly distributed in the feature space. The performance of many existing methods degrade if the data is not uniformly distributed. The proposed method offers an efficient solution to this problem. First, the data s marginal distribution along each dimension is characterized using a Gaussian mixture model. The parameters of this model are estimated using the well known Expectation- aximization (E) method. These model parameters can also be estimated sequentially for on-line updating. Using the marginal distribution information, each of the data dimensions can be partitioned such that each bin contains approximately an equal number of objects. Experimental results on a real image texture data set are presented. Comparisons with existing techniques, such as the well known VA-File, demonstrate a significant overall improvement. 1 Introduction P. Wu, B. S. anjunath and S. Chandrasekaran Department of Electrical and Computer Engineering University of California, Santa Barbara, CA {peng, manj, Typical audio-visual descriptors are high dimensional vectors and not uniformly distributed [6]. These descriptors are useful in content based image/video retrieval, data mining and knowledge discovery. To index high dimensional feature vectors, various index structures such as R*-tree, X-tree, TV-tree, etc., have been proposed. A good overview and analysis of these techniques can be found from [5]. The study in [5] also argued that typical tree based index methods are outperformed by a linear search when the search dimensions exceed 10. This has motivated the introduction of approximation methods [, 5] to speed up the linear search. Approximation based methods have certain advantages. First, they support different distance measures. This is an important property especially for learning and concept mining related applications. Secondly, the construction of approximation can be made adaptive to the dimensionality of data. However, the approximation method [5] is sensitive to data s distribution. does not perform well when feature vectors are not uniformly distributed. In [5], the approximation is constructed by partitioning the feature space into hyper rectangles. The grids on each dimension are equally spaced. We refer to such uniform partitioning as "regular approximation" in the following discussion. In [], the feature space is first transformed using the KL-transform to reduce the correlation of data at different dimensions. Secondly, in the transformed space, the data in each dimension is clustered into a pre-assigned number of grids using the Lloyd s algorithm. But in [3], it is reported that for high dimensional data, transforming the data space by rotation does not result in a significant decorrelation of the data. This implies that global statistics, such as the second order statistics used in [], may not be able to characterize the data distribution effectively for high dimensional spaces. In this work, we propose an effective and practical solution to adapt the design of the approximation based index structure to the data s distribution. The main idea of the pro-
2 posed method is to model the marginal distribution of the data using a mixture of Gaussians and use the estimated parameters of the model to partition the data space. By adapting the construction of approximations to data s marginal distribution, the proposed method overcomes the sensitivity of index performance to data s distribution, thus resulting in a significant improvement compared with regular the VA-File [, 5]. In the next section, we summarize the construction of regular approximation and its associated indexing. We also discuss the limitations of using regular approximation. Our proposed method is presented in Section 3. Experimental results are provided in Section 4. Regular Approximation.1 Construction of regular approximation [5] In regular approximation methods, such as the VA-File approach [5], the range of the feature vectors in each dimension is uniformly partitioned. Let D denote the total number of dimensions of data space, and let B i bits ( i 1,, D ) be allocated to each of the dimensions. Then the range of feature values on dimension i is segmented to partitions by a set of boundary points, denoted as c i [ k] ( k 0 B i,, ) with equal length, each partition is uniquely identified by a binary string of length. The high dimensional space is in turn segmented into D dimensional hyper cells. Each of them can be uniquely identified by a binary string of length B ( B ). For a feature vector v[ i] [ j], j 0,,, wherein is the total number of object in a database, its approximation is such a binary string of length B to indicate which hyper cell it is contained in. If falls into a partition bounded by [ c i [ k 1], c i [ k] ], it satisfies c i [ k 1] v[ i] [ j] < c i [ k] (1) So the boundary points provide an approximation of the value of v[ i] [ j]. As in [5], a lower bound and a upper bound of the distance between any vector with a query vector can be computed using this property. Figure 1 gives an illustrative example of constructing the regular approximation for two dimension data. The 1856 image objects are collected from the Brodatz album [4]. Figure 1 shows the first two components of the 60 dimensional feature vectors computed in [4]. The feature distribution is clearly not uniform.. Indexing based on regular approximation Approximation based nearest neighbor search can be considered as a two phase filtering process. In the first phase, the set of all approximations is scanned sequentially and lower and upper bounds on the distances of each object in the database to the query object are computed. In this phase, if an approximation is encountered such that its lower bound is larger than the k -th smallest upper bound found so far, the corresponding feature vector can be skipped since at least k better candidates exist. At the end of the first phase filtering, the set of vectors that are not skipped are collected as candidates for the second phase filtering. Denote the number of candidates to be 1. In the second phase filtering, the actual 1 feature vectors are examined. The feature vectors are visited in increasing order of their lower bounds and then exact distances to the query vector are computed. If a lower bound is reached that is larger than the k -th actual nearest neighbor distance encountered so far, there is no need to visit the remaining candidates. Let denote the number of feature vectors visited before the thresholding lower bound is encountered. B i B i B i v[ i] [ j]
3 A Fig 1. Using regular approximation to uniformly partition a two dimensional space. 1856, B 1 B 3. For approximation based methods, the index performance is measured by 1 and (see [5]). Smaller values of 1 and indicate a better performance..3 Limitations The effectiveness of using VA-File based indexing structure is sensitive to data s distribution. For the example given in Figure 1, the data on each dimension is not uniformly distributed. As a result, if a query vector happens to be one that falls into the cell A in Figure 1, in which approximately 30% of elements are contained, we will have 1 > 0.3 and > 0.3, which indicates a poor block selectivity and a high I/O cost, see [5]. This example illustrates one of the shortcomings of the regular approximation. In such cases, the first phase filtering still results in a large number of items to search. Our proposed method specifically addresses this problem. 3 Approximation Based on arginal Distribution Densely populated cells can potentially degrade the indexing performance. For this reason, we propose an approach to adaptively construct the approximation of feature vectors. The general idea is to first estimate data s marginal distribution in each dimension. Secondly, individual axes are partitioned such that the data has equal probability of falling into any partition. The approximation such constructed reduces the possibility of having densely populated cells, which in turn improves the indexing performance. 3.1 pdf modeling using mixture of Gaussians Denote p i ( x) to be the pdf of data on dimension i. The algorithm introduced below is applied to data in each dimension independently. For notation simplicity, we denote p( x) to be the pdf of an one dimensional signal. The one-dimensional pdf is modeled using a mixture of Gaussians, represented as where p( x j) p( x) p( x j)p( j) j 1 1 ( x µ j ) exp πσ j σ j () (3)
4 The coefficients P( j) are called the mixing parameters, which satisfies P( j) 1 and 0 P( j) 1. The task of estimating the pdf is then converted to be a problem of parameter estimation. The parameters we need to estimate are { P( j), µ j, σ j }, for j 1,,. 3. Parameter estimation using the E algorithm The classical maximum likelihood (L) approach is used to estimate the parameters. The task is to find φ j, j 1,,, to maximize where v[ l], l 1,,, are the given data set. A simple and practical method of solving this optimization problem is to use the Expectation-aximization algorithm [1]. Given data v[ l] available as the input for the estimation, E algorithm estimates the parameters iteratively using all the data in each iteration. Let t denote the iteration number. Then the following equations are used to update the parameters t + 1 µ j p( j v[ l] ) t v[ l] p( j v[ l] ) t 1 (5) Using Bayes theorem [1], the where 3.3 Sequential updating of parameters is computed as Using the formulas given in equations (5), (6) and (7), the parameters can be estimated given that the data are available. For a large database, is usually only a small portion of the total number of elements in the database. In practice, it is desirable to have an incremental pdf update scheme that can track the changes of the data distribution. The E algorithm can be modified for sequential updating [1]. Given that P( j) {, µ j, ( σj ) } is the parameter set estimated form using the data v[ l], the updated parameter set, when there is a new data v[ + 1] coming in, can be computed as where Φ( φ 1,, φ ) p( v[ l] ( φ 1,, φ )) l 1 l 1 φ j l 1 ( σ j ) t + 1 p( j v[ l] ) t t ( v[ l] µ j ) l 1 P( j) t p ( j v [ l ]) t p( j v[ l] ) t l 1 p( j v[ l] ) t 1 l 1 p( j v[ l] ) t ( p( v[ l] j) t P( j) t ) p( v[ l] j) t P( j) t 1 ( σ j ) 1 j 1 p( v[ l] j) t 1 t ( v[ l] µ j ) exp π( σ j ) t ( σ j ) t + 1 µ j + σ j µ j θ j ( v[ + 1] µ j ) ( ) + 1 θ j ( v[ + 1] µ j ) + [ ( σ j ) ] P( j) + 1 P ( j ) ( p( j v[ + 1] ) P( j) ) + 1 j 1,, (4) (6) (7) (8) (9) (10) (11) (1)
5 + 1 ( θ j ) 1 p( j v[ ] ) p ( ( j v[ + 1] ) θ ) j The conditional probability p( j v[ + 1] ) is computed as in (8) and (9). 3.4 Bit allocation (13) Denote the estimated pdf to be pˆ ( x) as the approximation of p( x). The objective of nonlinear quantization is to segment the pdf into grids of equal area. If the boundary points are denoted by c[ l], l 0,, b, b is the number of bits allocated, the boundary points should satisfy c[ l + 1] pˆ ( x) dx c[ b ] pˆ ( x ) dx (14) c[ l] Using this criterion, the boundary points can be determined efficiently from a single scan of the estimated pdf. 3.5 Approximation updating For dynamic databases, the pdf estimates are to be updated periodically. In our implementation, we update the estimates whenever a certain number of new data items are added. The approximation and pdf quantization are updated only when the new pdf, denoted as pˆ new( x), differs significantly from the current pdf, denoted as pˆ old( x). We use the following measure to quantify this change ρ The approximation on that dimension is updated when ρ is larger than a certain threshold. 4 Experiments and Discussions b Our evaluation is performed on a database containing 75,465 aerial photo images. For each image, the method developed in [4] is used to extract a 60 dimensional texture feature descriptor. Initially, around 10% of total number of image objects are used to initialize the pdf estimation using the algorithms proposed in Section 3.. Based on the estimated pdf, the approximation is constructed to support nearest neighbor search of the whole database. eanwhile, an updating strategy is developed to take the rest of the data as input to the on-line estimation. For each dimension, when the change of the estimated pdf, which is measured by (15), is beyond the threshold ρ > 0.15, the approximation is adjusted accordingly. The approach that uses the regular approximation, VA-File, is also implemented for comparison purpose. The number of candidate 1 and the number of visited feature vectors are used to evaluate the performance. We tested using 3, 4, 5 and 6 bits for each dimension to construct the approximation. For each approximation, we consider the queries to be all image items in the database. For each query, the 10 nearest neighbor search is performed and results in a 1 and a. The average performances of 1 and are computed by averaging 1 and from all queries. The results are shown in Figure (a) and (b) for 1 and, respectively. ote that the figures are plotted on a logarithmic Y- axis. To normalize the results into the same range for displaying, the maximum value of 1 ( ) of VA-File is used to normalize all the values of 1 ( ). As observed, 1 for the VA-File is about 3 to 0 times more than the proposed adaptive method (Figure (a)). After the second phase filtering, the VA-File visits 16 to 60 times more feature vectors than the proposed method (Figure (b). One can conclude that the adaptive pdf quantization results in an order of magnitude performance improvement over the regular VA-File. We have also investigated the indexing performance as the size of the database grows. For this purpose, we construct 4 databases, including 10%, 5%, 50% and 100% percent of the total 75,645 image objects. For each dimension, 6 bits are assigned to construct the approximation. In measuring the scalability, we are mainly concerned with as this c[ 0] ( ( pˆ old( x) pˆ new( x) ) dx ) ( pˆ old ( x ) dx) 1 1 (15)
6 (a) (b) Fig. Comparison between VA-File and the proposed method: (a) umber of candidates (1); (b) umber of visited feature vectors () after the second phase filtering.the proposed methods offers a significant reduction in 1 and compared to the VA files.
7 Fig 3. The increase of the number of candidates (1) vs. the size of databases, the solid line is plotted to illustrate a linear increase of 1 as the database grows. relates directly to the I/O cost of the index structure. Figure 3 illustrates that as the size of the database grows, how fast the number of candidates 1 increase for the VA-File and the proposed approach. As can be observed, in terms of the scalability with the size of databases, the proposed method tends to maintain a much better sub-linear behavior as the database grows. We have presented a novel adaptive indexing scheme for high-dimensional feature vectors. The method is adaptive to the data distribution. The indexing performance scales well with the size of the database. Experiments demonstrate a significant improvement over the VA-file data structures for high dimensional datasets. Ackowledgement: This research was in part supported by the following grants/awards: LLL/ISCR award #0108, SF-IRI , SF Instrumentation #EIA , SF Infrastructure SF#EIA , and by Samsung Electronics. References [1] Christopher. Bishop, eural networks for pattern recognition, Oxford: Clarendon Press, [] H. Ferhatosmanoglu, E. Tuncel, D. Agrawal, A. E. Abbadi, Vector approximation based indexing for non-uniform high dimensional data sets, Proc. Int l Conf. Information and Knowledge anagement (CIK), pp. 0-09, Washington, DC, USA. ovember 000. [3] S. Kaski, Dimensionality reduction by random mapping: Fast similarity computation for clustering, Proc. Int l Joint Conference on eural etworks, volume 1, pages IEEE Service Center, Piscataway, J, [4] B. S. anjunath and W. Y. a, Texture features for browsing and retrieval of image data, IEEE Trans. on Pattern Analysis and achine Intelligence, 18(8), pp , [5] R. Webber, J.-J. Schek and S. Blott, A quantitative analysis and performance study for similarity-search methods in high-dimensional space, Proc. Int l Conf. Very Large Data bases, pp , ew York City, ew York, August [6] ISO/IEC JTC1/SC9/WG11, PEG-7 Working Draft 4.0, Beijing, July, 000.
Why the Normal Distribution?
Why the Normal Distribution? Raul Rojas Freie Universität Berlin Februar 2010 Abstract This short note explains in simple terms why the normal distribution is so ubiquitous in pattern recognition applications.
Determining optimal window size for texture feature extraction methods
IX Spanish Symposium on Pattern Recognition and Image Analysis, Castellon, Spain, May 2001, vol.2, 237-242, ISBN: 84-8021-351-5. Determining optimal window size for texture feature extraction methods Domènec
Chapter 4: Non-Parametric Classification
Chapter 4: Non-Parametric Classification Introduction Density Estimation Parzen Windows Kn-Nearest Neighbor Density Estimation K-Nearest Neighbor (KNN) Decision Rule Gaussian Mixture Model A weighted combination
L10: Probability, statistics, and estimation theory
L10: Probability, statistics, and estimation theory Review of probability theory Bayes theorem Statistics and the Normal distribution Least Squares Error estimation Maximum Likelihood estimation Bayesian
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants
R-Trees: A Dynamic Index Structure For Spatial Searching A. Guttman R-trees Generalization of B+-trees to higher dimensions Disk-based index structure Occupancy guarantee Multiple search paths Insertions
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: macario.cordel@dlsu.edu.ph
Statistical Models in Data Mining
Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano
Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University
Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Models vs. Patterns Models A model is a high level, global description of a
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
Face Recognition using SIFT Features
Face Recognition using SIFT Features Mohamed Aly CNS186 Term Project Winter 2006 Abstract Face recognition has many important practical applications, like surveillance and access control.
Unsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall
Automatic Photo Quality Assessment Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Estimating i the photorealism of images: Distinguishing i i paintings from photographs h Florin
Challenges in Finding an Appropriate Multi-Dimensional Index Structure with Respect to Specific Use Cases
Challenges in Finding an Appropriate Multi-Dimensional Index Structure with Respect to Specific Use Cases Alexander Grebhahn grebhahn@st.ovgu.de Reimar Schröter rschroet@st.ovgu.de David Broneske dbronesk@st.ovgu.de
Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011
Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning
Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.
Multimedia Databases Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Previous Lecture 13 Indexes for Multimedia Data 13.1
Principles of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
A Computational Framework for Exploratory Data Analysis
A Computational Framework for Exploratory Data Analysis Axel Wismüller Depts. of Radiology and Biomedical Engineering, University of Rochester, New York 601 Elmwood Avenue, Rochester, NY 14642-8648, U.S.A.
Supervised and unsupervised learning - 1
Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in
A Study of Web Log Analysis Using Clustering Techniques
A Study of Web Log Analysis Using Clustering Techniques Hemanshu Rana 1, Mayank Patel 2 Assistant Professor, Dept of CSE, M.G Institute of Technical Education, Gujarat India 1 Assistant Professor, Dept
Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler
Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand
Robotics 2 Clustering & EM. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard
Robotics 2 Clustering & EM Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard 1 Clustering (1) Common technique for statistical data analysis to detect structure (machine learning,
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Mean Shift Based Clustering in High Dimensions: A Texture Classification Example
Mean Shift Based Clustering in High Dimensions: A Texture Classification Example Bogdan Georgescu µ Ilan Shimshoni µ Peter Meer ¾µ Computer Science µ Electrical and Computer Engineering ¾µ Rutgers University,
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 12, December 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
A Clustering Model for Mining Evolving Web User Patterns in Data Stream Environment
A Clustering Model for Mining Evolving Web User Patterns in Data Stream Environment Edmond H. Wu,MichaelK.Ng, Andy M. Yip,andTonyF.Chan Department of Mathematics, The University of Hong Kong Pokfulam Road,
Going Big in Data Dimensionality:
LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
Email Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email
Performance of KDB-Trees with Query-Based Splitting*
Performance of KDB-Trees with Query-Based Splitting* Yves Lépouchard Ratko Orlandic John L. Pfaltz Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science University of Virginia Illinois
Bisecting K-Means for Clustering Web Log data
Bisecting K-Means for Clustering Web Log data Ruchika R. Patil Department of Computer Technology YCCE Nagpur, India Amreen Khan Department of Computer Technology YCCE Nagpur, India ABSTRACT Web usage mining
Clustering through Decision Tree Construction in Geology
Nonlinear Analysis: Modelling and Control, 2001, v. 6, No. 2, 29-41 Clustering through Decision Tree Construction in Geology Received: 22.10.2001 Accepted: 31.10.2001 A. Juozapavičius, V. Rapševičius Faculty
The Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
Data, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
Semantic Video Annotation by Mining Association Patterns from Visual and Speech Features
Semantic Video Annotation by Mining Association Patterns from and Speech Features Vincent. S. Tseng, Ja-Hwung Su, Jhih-Hong Huang and Chih-Jen Chen Department of Computer Science and Information Engineering
Environmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
Lecture 3: Linear Programming Relaxations and Rounding
Lecture 3: Linear Programming Relaxations and Rounding 1 Approximation Algorithms and Linear Relaxations For the time being, suppose we have a minimization problem. Many times, the problem at hand can
Binary Image Analysis
Binary Image Analysis Segmentation produces homogenous regions each region has uniform gray-level each region is a binary image (0: background, 1: object or the reverse) more intensity values for overlapping
FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
L15: statistical clustering
Similarity measures Criterion functions Cluster validity Flat clustering algorithms k-means ISODATA L15: statistical clustering Hierarchical clustering algorithms Divisive Agglomerative CSCE 666 Pattern
Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang
Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental
Character Image Patterns as Big Data
22 International Conference on Frontiers in Handwriting Recognition Character Image Patterns as Big Data Seiichi Uchida, Ryosuke Ishida, Akira Yoshida, Wenjie Cai, Yaokai Feng Kyushu University, Fukuoka,
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181 The Global Kernel k-means Algorithm for Clustering in Feature Space Grigorios F. Tzortzis and Aristidis C. Likas, Senior Member, IEEE
Large Databases. mjf@inesc-id.pt, jorgej@acm.org. Abstract. Many indexing approaches for high dimensional data points have evolved into very complex
NB-Tree: An Indexing Structure for Content-Based Retrieval in Large Databases Manuel J. Fonseca, Joaquim A. Jorge Department of Information Systems and Computer Science INESC-ID/IST/Technical University
Geometric Image Transformations
Geometric Image Transformations Part One 2D Transformations Spatial Coordinates (x,y) are mapped to new coords (u,v) pixels of source image -> pixels of destination image Types of 2D Transformations Affine
Signature Segmentation from Machine Printed Documents using Conditional Random Field
2011 International Conference on Document Analysis and Recognition Signature Segmentation from Machine Printed Documents using Conditional Random Field Ranju Mandal Computer Vision and Pattern Recognition
THE concept of Big Data refers to systems conveying
EDIC RESEARCH PROPOSAL 1 High Dimensional Nearest Neighbors Techniques for Data Cleaning Anca-Elena Alexandrescu I&C, EPFL Abstract Organisations from all domains have been searching for increasingly more
Chapter ML:XI (continued)
Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained
Multivariate Normal Distribution
Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues
Context-Aware Online Traffic Prediction
Context-Aware Online Traffic Prediction Jie Xu, Dingxiong Deng, Ugur Demiryurek, Cyrus Shahabi, Mihaela van der Schaar University of California, Los Angeles University of Southern California J. Xu, D.
Lecture 20: Clustering
Lecture 20: Clustering Wrap-up of neural nets (from last lecture Introduction to unsupervised learning K-means clustering COMP-424, Lecture 20 - April 3, 2013 1 Unsupervised learning In supervised learning,
Feature Selection with Decision Tree Criterion
Feature Selection with Decision Tree Criterion Krzysztof Grąbczewski and Norbert Jankowski Department of Computer Methods Nicolaus Copernicus University Toruń, Poland kgrabcze,norbert@phys.uni.torun.pl
Data Warehousing und Data Mining
Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data
Estimation with Minimum Mean Square Error
C H A P T E R 8 Estimation with Minimum Mean Square Error INTRODUCTION A recurring theme in this text and in much of communication, control and signal processing is that of making systematic estimates,
TRANSFORMATIONS OF RANDOM VARIABLES
TRANSFORMATIONS OF RANDOM VARIABLES 1. INTRODUCTION 1.1. Definition. We are often interested in the probability distributions or densities of functions of one or more random variables. Suppose we have
Digital Image Processing
Digital Image Processing Using MATLAB Second Edition Rafael C. Gonzalez University of Tennessee Richard E. Woods MedData Interactive Steven L. Eddins The MathWorks, Inc. Gatesmark Publishing A Division
Cluster Analysis for Optimal Indexing
Proceedings of the Twenty-Sixth International Florida Artificial Intelligence Research Society Conference Cluster Analysis for Optimal Indexing Tim Wylie, Michael A. Schuh, John Sheppard, and Rafal A.
Machine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm
R. Sridevi et al Int. Journal of Engineering Research and Applications RESEARCH ARTICLE OPEN ACCESS Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm R. Sridevi,*
Component Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
Local features and matching. Image classification & object localization
Overview Instance level search Local features and matching Efficient visual recognition Image classification & object localization Category recognition Image classification: assigning a class label to
Load Balancing and Switch Scheduling
EE384Y Project Final Report Load Balancing and Switch Scheduling Xiangheng Liu Department of Electrical Engineering Stanford University, Stanford CA 94305 Email: liuxh@systems.stanford.edu Abstract Load
Fig. 1 A typical Knowledge Discovery process [2]
Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering
An Energy-Based Vehicle Tracking System using Principal Component Analysis and Unsupervised ART Network
Proceedings of the 8th WSEAS Int. Conf. on ARTIFICIAL INTELLIGENCE, KNOWLEDGE ENGINEERING & DATA BASES (AIKED '9) ISSN: 179-519 435 ISBN: 978-96-474-51-2 An Energy-Based Vehicle Tracking System using Principal
Visualization of large data sets using MDS combined with LVQ.
Visualization of large data sets using MDS combined with LVQ. Antoine Naud and Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87-100 Toruń, Poland. www.phys.uni.torun.pl/kmk
Feature Selection vs. Extraction
Feature Selection In many applications, we often encounter a very large number of potential features that can be used Which subset of features should be used for the best classification? Need for a small
Categorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons
FAST APPROXIMATE NEAREST NEIGHBORS WITH AUTOMATIC ALGORITHM CONFIGURATION
FAST APPROXIMATE NEAREST NEIGHBORS WITH AUTOMATIC ALGORITHM CONFIGURATION Marius Muja, David G. Lowe Computer Science Department, University of British Columbia, Vancouver, B.C., Canada mariusm@cs.ubc.ca,
Lecture Notes for ECE 361. Fall 1995
Introduction to Digital Communication Systems Lecture Notes for ECE 361 Fall 1995 Dilip V. Sarwate Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign Urbana, Illinois
Density Evolution. Telecommunications Laboratory. Alex Balatsoukas-Stimming. Technical University of Crete. October 21, 2009
Density Evolution Telecommunications Laboratory Alex Balatsoukas-Stimming Technical University of Crete October 21, 2009 Telecommunications Laboratory (TUC) Density Evolution October 21, 2009 1 / 27 Outline
2D GEOMETRIC SHAPE AND COLOR RECOGNITION USING DIGITAL IMAGE PROCESSING
2D GEOMETRIC SHAPE AND COLOR RECOGNITION USING DIGITAL IMAGE PROCESSING Sanket Rege 1, Rajendra Memane 2, Mihir Phatak 3, Parag Agarwal 4 UG Student, Dept. of E&TC Engineering, PVG s COET, Pune, Maharashtra,
Neural Networks Lesson 5 - Cluster Analysis
Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29
Accessing Private Network via Firewall Based On Preset Threshold Value
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 3, Ver. V (May-Jun. 2014), PP 55-60 Accessing Private Network via Firewall Based On Preset Threshold
A Data Mining Study of Weld Quality Models Constructed with MLP Neural Networks from Stratified Sampled Data
A Data Mining Study of Weld Quality Models Constructed with MLP Neural Networks from Stratified Sampled Data T. W. Liao, G. Wang, and E. Triantaphyllou Department of Industrial and Manufacturing Systems
A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images
A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images Małgorzata Charytanowicz, Jerzy Niewczas, Piotr A. Kowalski, Piotr Kulczycki, Szymon Łukasik, and Sławomir Żak Abstract Methods
Visualization by Linear Projections as Information Retrieval
Visualization by Linear Projections as Information Retrieval Jaakko Peltonen Helsinki University of Technology, Department of Information and Computer Science, P. O. Box 5400, FI-0015 TKK, Finland jaakko.peltonen@tkk.fi
Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices
Proc. of Int. Conf. on Advances in Computer Science, AETACS Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices Ms.Archana G.Narawade a, Mrs.Vaishali Kolhe b a PG student, D.Y.Patil
Canny Edge Detection
Canny Edge Detection 09gr820 March 23, 2009 1 Introduction The purpose of edge detection in general is to significantly reduce the amount of data in an image, while preserving the structural properties
Authors. Data Clustering: Algorithms and Applications
Authors Data Clustering: Algorithms and Applications 2 Contents 1 Grid-based Clustering 1 Wei Cheng, Wei Wang, and Sandra Batista 1.1 Introduction................................... 1 1.2 The Classical
Calculation of Minimum Distances. Minimum Distance to Means. Σi i = 1
Minimum Distance to Means Similar to Parallelepiped classifier, but instead of bounding areas, the user supplies spectral class means in n-dimensional space and the algorithm calculates the distance between
Maximum Likelihood Estimation of ADC Parameters from Sine Wave Test Data. László Balogh, Balázs Fodor, Attila Sárhegyi, and István Kollár
Maximum Lielihood Estimation of ADC Parameters from Sine Wave Test Data László Balogh, Balázs Fodor, Attila Sárhegyi, and István Kollár Dept. of Measurement and Information Systems Budapest University
Efficient Data Recovery scheme in PTS-Based OFDM systems with MATRIX Formulation
Efficient Data Recovery scheme in PTS-Based OFDM systems with MATRIX Formulation Sunil Karthick.M PG Scholar Department of ECE Kongu Engineering College Perundurau-638052 Venkatachalam.S Assistant Professor
Anomaly Detection using multidimensional reduction Principal Component Analysis
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 1, Ver. II (Jan. 2014), PP 86-90 Anomaly Detection using multidimensional reduction Principal Component
Students will understand 1. use numerical bases and the laws of exponents
Grade 8 Expressions and Equations Essential Questions: 1. How do you use patterns to understand mathematics and model situations? 2. What is algebra? 3. How are the horizontal and vertical axes related?
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du fdu@cs.ubc.ca University of British Columbia
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed
Operating Systems. Virtual Memory
Operating Systems Virtual Memory Virtual Memory Topics. Memory Hierarchy. Why Virtual Memory. Virtual Memory Issues. Virtual Memory Solutions. Locality of Reference. Virtual Memory with Segmentation. Page
EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES
ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada tcossentine@gmail.com
Biometric Authentication using Online Signatures
Biometric Authentication using Online Signatures Alisher Kholmatov and Berrin Yanikoglu alisher@su.sabanciuniv.edu, berrin@sabanciuniv.edu http://fens.sabanciuniv.edu Sabanci University, Tuzla, Istanbul,
Statistiek (WISB361)
Statistiek (WISB361) Final exam June 29, 2015 Schrijf uw naam op elk in te leveren vel. Schrijf ook uw studentnummer op blad 1. The maximum number of points is 100. Points distribution: 23 20 20 20 17
Document Image Retrieval using Signatures as Queries
Document Image Retrieval using Signatures as Queries Sargur N. Srihari, Shravya Shetty, Siyuan Chen, Harish Srinivasan, Chen Huang CEDAR, University at Buffalo(SUNY) Amherst, New York 14228 Gady Agam and
Strategic Online Advertising: Modeling Internet User Behavior with
2 Strategic Online Advertising: Modeling Internet User Behavior with Patrick Johnston, Nicholas Kristoff, Heather McGinness, Phuong Vu, Nathaniel Wong, Jason Wright with William T. Scherer and Matthew
Image Estimation Algorithm for Out of Focus and Blur Images to Retrieve the Barcode Value
IJSTE - International Journal of Science Technology & Engineering Volume 1 Issue 10 April 2015 ISSN (online): 2349-784X Image Estimation Algorithm for Out of Focus and Blur Images to Retrieve the Barcode
Colour Image Encryption and Decryption by using Scan Approach
Colour Image Encryption and Decryption by using Scan Approach, Rinkee Gupta,Master of Engineering Scholar, Email: guptarinki.14@gmail.com Jaipal Bisht, Asst. Professor Radharaman Institute Of Technology