THE concept of Big Data refers to systems conveying

Transcription

1 EDIC RESEARCH PROPOSAL 1 High Dimensional Nearest Neighbors Techniques for Data Cleaning Anca-Elena Alexandrescu I&C, EPFL Abstract Organisations from all domains have been searching for increasingly more insights from the available information in order to add as much value as possible to their business decisions. With their growth over time, the demand for knowledge and the associated data grew exponentially, resulting in the Big Data era, which brings data sets so large and complex that are impractical to manage with traditional software tools. Given the increase in the number of connected devices, dealing with the uncertainty of the data becomes a necessary evil as sources often contain redundant data in different representations. In order to get accurate results, processing has to be made on consistent data which makes data cleaning gain more and more importance in the process. This paper focuses on techniques for duplicate information retrieval. We first illustrate the usage of similarity search based on nearest neighbor techniques and the limitations of similar approaches for increasing dimensionality of data. In order to expand the range of applicability to big data, we then consider using approximate nearest neighbors techniques. Finally, current work is presented and future research directions are discussed. Index Terms Data cleaning, Big Data, k-nearest neighbor, R- tree, Locality Sensitive Hashing Proposal submitted to committee: June 4 th, 2014; Candidacy exam date: June 11 th, 2014; Candidacy exam committee: Prof. Christoph Koch, Prof. Anastasia Ailamaki, Prof. Willy Zwaenepoel. This research plan has been approved: Date: Doctoral candidate: (name and signature) Thesis director: (name and signature) Thesis co-director: (if applicable) (name and signature) Doct. prog. director: (B. Falsafi) (signature) EDIC-ru/ I. INTRODUCTION THE concept of Big Data refers to systems conveying large amounts of heterogeneous data which is always changing and growing and it is characterized by the three Vs: Volume, Variety and Velocity. The multitude of existing applications lead to a variety of data formats which need to be processed together to get the most out of the collected data. In the past, data files were required to abide by a predefined structure and a certain amount of time was spent on making data files compliant. In the Big Data era, this is no longer possible since great amouts of data are constantly collected and they need to be consumed without delay. In order to derive accurate insights from the collected data, processing has to be done on consistent information. The diversity of sources generating data can often lead to incomplete or duplicate records, which left unaddressed, will affect the quality of the results. Therefore, data cleaning becomes more and more important in the process of preparing the data as it ensures the veracity of the data - which has been recently defined as the 4 th V of Big Data. A. Data Cleaning Data cleaning is the process dealing with detection and removal of errors and inconsistencies in the data which results in improved data veracity. There are numerous sources for data accuracy problems, for example plain human error, missing information, corruption of data during transmission, etc. In case of organizations working with Big Data, multiple data sources have to be integrated. Given that different sources may guide their data collection process on different sets of rules and representations of the information, the need for data cleaning increases significantly. Each data source can contain erroneous data as well as different data formats since collected information comes from multiple applications. Furthermore, the attributes split into categories are prone to misinterpretation since it is possible that the sources use different representation of the data (for example, marital status values) or different meaning for the same value (for example, temperature in F degrees vs C degrees). A very important problem in cleaning data from multiple sources is to identify overlapping records which it is often refered to as duplicates detection. The probability that two

2 EDIC RESEARCH PROPOSAL 2 records are overlapping is proportional to the similarity between them. Duplicates detection based on similarities involves identification of close records, where the closeness is evaluated based on a similariy function chosen to suit the specificity of the application, for example, the Euclidean distance between two records. B. Objectives Data cleaning becomes more and more important, but given that the existing approaches were developed before the Big Data explosion, their performance is no longer considered sufficient. Stimulated by the growth of storage capacity, the increase in data volume is due not only to the fact that more and more devices generate data, but also to the fact that the number of attibutes generated per record has been increased in order to collect as much information as possible, even if in the end not all of it is necessary. The information is gathered in vectors whose dimension corresponds to the number of collected attributes, and as such the dimensionality of the data sets that need to be processed keeps on growing. This paper focuses on nearest neighbors techniques which have been successfully applied on data in order to detect duplicates. Performance of discussed techniques are evaluated and their limitations are highligthed in order to detect improvement opportunities. To this end, we first describe two hierarchical data structures based on the R-tree [3], the state-of-the-art approach for nearest neighbors search. Limitations of the two approaches are explored and, based on the fact that approximate nearest neighbors approaches are more time-efficient without having great loss of result accuracy, we propose Locality Sensitive Hashing as an alternative to deal with duplicates detection in the context of Big Data. The remainder of this paper is structured as follows: section II introduces the nearest neighbor join based on Multipage Index (MuX). The X-tree approach is described and evaluated in section III, followed in section IV by the exploration of Locality Sensitive Hashing. Finally, section V briefly presents current work and future research proposal. II. NEAREST NEIGHBOR JOIN A very important operation in data cleaning, the similarity join received a lot of attention with regard to duplicates detection. The similarity join between two data sets outputs all pairs of similar records between the two sets of input data. There are two well known types of similarity join: 1) distance range join, which returns all pairs containing records from the two data sets where the distance between the objects does not exceed a value ɛ, received as input. 2) k-distance join, which orders all pairs containing records from the two data sets by increasing distance between the objects and returns the first k pairs, based on the value of k received as input. In addition to these two types of similarity join, a third kind of similarity join is introduced, the k-nearest neighbor similarity join (k-nn join, for short). In constrast to the two similarity joins presented above which take into consideration all the pairs resulted from joining the two data sets, the k-nn join combines each record in one data sets with its k nearest neighbors from the other data set. A. The k-nearest neighbor join K-nearest neighbor join [1] uses a new index structure, the Multipage Index [2] which consists of large I/O pages supported by an additional R-tree structure to speed up the main memory operations. The index is a height-balanced tree containing directory pages and data pages and is depicted in figure 1. The secondary search structure is represented by a modified R-tree consisting of a flat directory called page directory and a constant number of leaves, the accomodated buckets. The page directory consists of an array of MBRs and pointers to the corresponding accomodated buckets. Both data and directory pages are assigned to a rectilinear region in the main memory and to a block on disk. The pages are I/O optimized and they are called hosting pages. If a hosting page is a: data page, then the accomodated buckets are data buckets containing data records; directory page, then the accomodated buckets are dictionary buckets storing pairs of an MBR and a pointer to another hosting page. Fig. 1: Multipage Index In order to compute the similarity join, the hosting pages of both relations are processed in two nested loops, with each hosting page of the outer set being accessed exactly once. For each point in the current page of the outer set, an array is allocated to hold the candidates until the requested nearest neighbors have been confirmed. Given that the k-nn join is simultaneously searching for nearest neighbors for all the points inside a hosting page, it is very important to exclude as many hosting pages and accomodated buckets of the inner data set as possible. To this end, a page quality measure has been defined to take into account both the distance to the current buckets and the distance to the last candidate point as pruning distance. Based on this quality measure, a loading strategy is used to ensure that the next page to load is the one which brings the highest

3 EDIC RESEARCH PROPOSAL 3 gain. Moreover, a processing strategy is defined to find the best processing order for the accomodated buckets that are already loaded in the main memory. Similar to the page quality mentioned above, a quality measure for bucket pairs has been defined to give more importance to buckets within smaller distance. B. Performance evaluation The k-nearest neighbor join algorithm has been evaluated on both synthetic and real data sets of varying sizes and dimensions and has been compared to the nested block loop join and a conventional non-join technique, the k-nn algorithm by Hjaltason and Samet. [11] Results on real 9-dimensional data for varying sizes of the data set have shown a maximum speed-up factor of the k-nn join over the nested block loop join of 17. Tests performed on 16-dimensional real data show that k- of dimensionality, which is due to the exponential increase of data volume associated with the addition of extra dimensions to a data set. III. X-TREE APPROACH Many of the existing applications are already collecting huge amounts of data consisting of millions of objects with tens to a few hundreds dimensions. In order to be able to process large amounts of information, it is mandatory to use appropriate algorithms and indexing structures which provide efficient access to high-dimensional data. Current algorithms are based on different variants of the state-of-the-art approach in k-nearest neighbor search, the R-tree, which is known to suffer from overlap and dead space. Overlap (figure 3) in a R-tree structure represents the percentage of space covered by more than one hyperrectangle. Given the fact that nodes overlap in a R-tree results in following multiple paths for computing the answer for the queries, it directly affects query performance. Also, this behaviour gets worse with the increase of data dimensionality as this leads to overlap between more hyperrectangles. The R*-tree [5] is a variant of the R-tree which tries Fig. 2: Results for 16-dimensional real data nn join still obtains better results than the nested block loop join, but reaching a speed-up of only 1.3 for the 80,000 point set. The performance obtained by the k-nn join on 16- dimensional real data is depicted in figure 2. C. K-nn Join Discussion Data cleaning techniques often make use of k-nearest neighbor queries to gain knowledge on the processed data sets, most of the times, meaning that a k-nearest neighbor query is run for each point of the data set. The authors aim to offer an alternative by replacing the large number of k-nearest neighbors queries with the proposed k-nearest neighbor join. The results show that the proposed algorithm using multipage index obtains satisfactory results and is efficient for at least 9 dimensions.the fact that the speed-up obtained for 16- dimensional data is a lot lower than the one for 9-dimensional shows the limitation of the algorithm s perfomance due to increasing data dimensionality. With the increase in data volume and dimensionality brought by Big Data, the existence of k-nearest neighbors algorithms which perform well on high-dimensional data has become a necessity. In order to achieve the requiered performance, algorithms must be designed to overcome the curse Fig. 3: Overlap in 2-dimensional data to reduce overlap using a combination of a specialized split algorithm and forced reinsertion at nodes overflow. Results of R*-tree evaluation on real data showed that the performance degrades very rapidly with the growing dimensionality of data. Figure 4 illustrates these results. Fig. 4: Performance of R-tree depending on data dimension (real data)

4 EDIC RESEARCH PROPOSAL 4 In spite of the R*-tree s optimizations aimed to minimize overlap, a detailed investigation of important characteristics of the R*-tree has revealed that this behaviour is caused by the fact that overlap in the directory increases very rapidly with the growth of data dimensionality. A. The X-tree Structure Based on the insight regarding the effect of data overlap in case of high dimensional data, the X-tree [4] avoids overlap through the use of supernodes. The supernodes are directory Fig. 6: Speed-up of X-tree over R*-tree on Real Point Data for k = 10 Fig. 5: Structure of the X-tree nodes which were extended over the usual block size in order to avoid index degeneration. The structure of the X-tree is presented in figure 5. It is important to notice that the X-tree differs from an R- tree with a larger block size since the X-tree extends the nodes only it is needed to avoid inserting into the tree overlapping nodes. As expected, the structure of the X-tree changes during index updates, since it can lead to reorganization of internal tree nodes. In order to obtain the most suitable configuration for the X-tree, specialized insertion and split algorithm aimed to minimize overlap are applied. The insert algorithm determines the structure of the X- tree and it aims to avoid directory splits which could lead to overlap between the nodes. In case that during the insert procedure, an internal node of the X-tree has to be split, the split algorithm takes into account properties of the MBR like dead-space partitioning, extension of the MBR s, etc., and tries to find a split which doesn t introduce overlap or, if that is not possible, introduces the minimal amount of overlap. C. Result interpretation The X-tree has been created based on the insight that data overlap is the reason for low performance of R-tree based approaches. The structure of the X-tree together with specialy designed insertion and split algorithms contribute to achieving an increased performance level over the R*tree for data with as much as 16 dimensions. However, since the supernodes become quite large with the increase in the number of data dimensions, the time needed to linearly scan their contents will also keep increasing, causing the X-tree to quickly reach its limits. IV. LOCALITY SENSITIVE HASHING Despite decades of research efforts, current solutions for finding the k-nearest neighbors in high dimensional data do not provide the necessary performance. Although both approaches presented before achieve better performance than state-of-theart R-tree, they still have a dimension threshold for their performance. The authors of [12] observe that there are many applications of nearest neighbors search where the exact answer is not really needed and an aproximate answer is good enough. Based on this insight and the assumption that the approximate similarity search can be performed much faster that the exact one, a new technique relying on Locality Sensitive Hashing [6] is proposed. B. Performance evaluation The performance of the X-tree has been evaluated in comparison with the R*-tree on both synthetic and real data. The results have shown that the X-tree outperforms the R*-tree up to orders of magnitude for both point and nearest neighbors queries on both types of data. Figure 6 shows that the X-tree obtains a maximum speedup of 20 for 16-dimensional real data on nearest neighbors queries with k = 10. Fig. 7: Locality Sensitive Hashing Locality Sensitive Hashing (figure 7) provides an efficient

5 EDIC RESEARCH PROPOSAL 5 approximate nearest neighbor search algorithm. It the first step, Locality Sensitive Hashing defines L, received as input, randomly chosen hash functions which are then applied, one at a time, on all points in the data set, mapping them to buckets into all the hash tables. In order to answer a query, the query point is also hashed with the L functions. Then, for each hash table, the data points corresponding to the same bucket as the query point are retrieved and the answer to the query is among them. A. Locality Sensitive Hashing-based algorithm This new approach allows fast retrieval of approximate answer, which will probably be good enough for most of the cases, followed by a slower but accurate computation for the few cases which require an exact answer. The idea behind the algorithm is that the probability of two points p and q to be hashed to the same bucket is closely related to the distance between the points. The algorithm uses two levels of hashing: Locality Sensitive Hashing function, which maps a point p P to a bucket, as briefly described above; standard hash function, which maps the content of the buckets into a hash table. In order to answer a query, the buckets are processed until either a sufficient number of points or all the buckets have been searched. For approximate k-nearest neighbor search, the k closest points to the query are returned. If less than k points have been encountered, then the output will also contain less than k results. B. Performance evaluation Performance of the Locality Sensitive Hashing-based algorithm has been evaluated and compared to the performance of the SR-tree [8] on real data sets using two performance measures: speed, for both approaches, and accuracy of Locality Sensitive Hashing-based algorithm. The SR-tree is an extension of the R*-tree which combines the utilization of bounding spheres and bounding rectangles aiming to improve the performance of nearest neighbors queries by reducing the region s volume and diameter. Given that the SR-tree outputs the exact nearest neighbors while the Locality Sensitive Hashing returns a list of approximate nearest neighbors, the algorithm for SR-tree has been modified to run only on a random sample of the data. Therefore, the modified SR-tree also outputs approximate k- nearest neighbors and it achieves a speed-up due to the fact that the data set used for execution is smaller. As figure 8 shows, the improvement obtained by the Locality Sensitive Hashing algorithm over the modified SR-tree is up to an order of magnitude on a data set containing 270,000 points. The Locality Sensitive Hashing-based algorithm scales really well with the increase in data dimensionality, as the number of disk accesses for ɛ =0.1 grows by 2 for a dimensional increase from 8 to 64. As expected, the SR-tree s performance degrades rapidly with the increase of data dimensionality. The results are illustrated in figure 9. C. Discussion Fig. 8: Performance vs error Fig. 9: Approximate 10-NNS, α=2 A new indexing method based on Locality Sensitive Hashing has been proposed as an alternative for hierarchical data structures. Although the Locality Sensitive Hashing algorithm returns approximate nearest neighbors, the extensive evaluation has shown that allowing a small accuracy loss results in considerably improving execution time. On the other side, a major drawback of Locality Sensitive Hashing is the fact that randomly choosing the hash functions can lead to a poor space partitioning. Therefore, choosing more appropriate hash functions could be taken into consideration and, based on the fact that the Locality Sensitive Hashingbased approach already scales really well for high-dimensional data, it could be a good candidate for the next generation of k-nearest neighbors data cleaning techniques. V. FUTURE WORK An important factor in being able to achieve big data success is having the appropriate hardware and software resources for processing the accumulated data. With the recent improvements in hardware equipment which allows constant increase in data volume, investing time and money into data curation is no longer an option, therefore dealing with the uncertainty of the data is now a necessary evil that has to be dealt with. Although quite a large number of tools of varying functionality support data cleaning process, a significant portion of the cleaning work has to be done either manually or by low-level programs that are difficult to write and maintain. [7] This kind

6 EDIC RESEARCH PROPOSAL 6 of approach becomes increasingly difficult with the growth of the dimensions number in the data set. The focus of current research has been on efficient methods for in-memory processing of k-nearest neighbor queries and a new indexing approach which avoids hierarchical structures issues has been investigated. In the context of NoDB/ViDa project which enables efficient queries on raw, heterogenous data, without pre-formatiing or loading it into a database, future research will focus on designing algorithms aimed to improve the quality of raw data, so that the results obtained by the new processing techniques will not have to lose accuracy by processing low quality data. A. Current Research When processing data stored on disk, the majority of time was spent to transfer data back and forth to the CPU. As such, the research and improvements brought to the k- nearest neighbors approaches have focused on reducing the time spent for data transfer. Given the fact that disk I/O is considerably slower than memory access, the time spent for actually computing distances between query point and objects in the index was hidden by the transfer time. While the in-memory approaches gain popularity due to the increase of available memory in computer systems, the optimizations applied on this indexing methods prove to be less efficient and the need for new ideas emerges. The SNAIL algorithm is an in-memory k-nearest neighbor technique which avoids hierarchical data structures by using a grid type structure which reduces the time spent traversing the index. The main idea of the algorithm is to find as many groups of points that can be added to the result set without individually computing the distance from the point to the query. The groups we are using are the grid cells, and the intuition is that, if for a cell C, the total number of points from its neighboring cells is lower that the number of searched neighbors, we can conclude that all the points in cell C can be added directly to the result set. As expected, at some point the neighboring cell will contain more than the number of searched cells, and computation of individual distances will be necessary. SNAIL has been evaluated on 3-dimensional synthetic data sets with sizes up to 5GB, and the preliminary results have shown that later/greater resolution values result in greater computation time, as the number of objects that need to be individually checked grows. The best execution times have been obtained on a grid resolution of 50 cells per dimension. Current results show that SNAIL building process is almost 20 times more efficient than the building of the R-tree. Given the fact that SNAIL computes the individual distance for a large number of objects in the data set, for high enough value of k, SNAIL execution time outperforms the R-tree approach. B. Future Directions Due to the big data effect, the Curse of Dimensionality has been studied on several problems, such as clustering, indexing, nearest neighbors search and it seems that the fact that in high dimensional space the data tends to become sparse is only part of the problem. Recent results show that in high dimensional space, the concept of proximity or nearest neighbor may not be quantitatively meaningful, and the use of fractional distance metrics L p -norm, where p is allowed to be a fraction smaller than 1, are shown be more accurate. [10] More precisely, the distance computation will be made using the following fractional distance metric: dist f d (x, y) = d 1/f (x i y i ) f i=1 Given that SNAIL shows great potential for efficiently processing large data sets, the possibility of adapting it for high-dimensional data will be explored. One of the challenges that come up is the choice of the distance metric and the use of fractional distance will be taken into consideration in order to decide if it is appropriate for duplicates detection. Also, since SNAIL has been designed as an in-memory algorithm and the data volume is constantly growing, another challenge will be to adapt it for on-disk data processing. Data cleaning is a vast research area, and while we proposed an algorithm that helps detecting duplicates, other quality problems also exist in data files. Given that the aim is to minimize the time spent for processing data in order to correct it, another challenge is to explore the possibilities of designing algorithms that can address more than one of this issues at a time. REFERENCES [1] C. Böhm, F. Krebs, High Performance Data Mining Using the Nearest Neighbor Join, IEEE International Conference on Data Mining (ICDM), 2002 [2] C. Böhm, H. P. Kriegel, A Cost Model and Index Architecture for the Similarity Join, IEEE International Conference on Data Engineering (ICDE), 2001 [3] A. Guttman, R-trees: A Dynamic Index Structure for Spatial Searching, ACM SIGMOD International Conference on Management of Data, 1984 [4] S. Berchtold,D. A. Keim, H. P. Kriegel The X-tree: An Index Structure for High-Dimensional Data, International Conference on Very Large Data Bases, 1996 [5] N. Beckmann,H. P. Kriegel, R.Schneider, B. Seegerl The R*-tree: An Efficient and Robust Access method for Points and Rectangles, International Conference on Very Large Data Bases, 1996 [6] P. Indyk, R. Motwani Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality, ACM Symposium on Theory of Computing, 1998 [7] E. Rahm, H. Hai Do Data Cleaning: Problems and Current Approaches, IEEE Data Engineering Bulletin, 2000 [8] N. Katayama, S. Satoh The SR-tree: An Index Structure for High- Dimensional Nearest Neighbor Queries, ACM SIGMOD International Conference on Management of Data, 1997 [9] S. Chaudhuri, V. Ganti, R. Kaushik, A Primitive Operator for Similarity Joins in Data Cleaning IEEE International Conference on Data Engineering (ICDE), 2006 [10] C. C. Aggrawal, A. Hinneburg, D. A. Keim, On the Surprising Behaviour of Distance Metrics in High Dimensional Spaces International Conference on Database Theory (ICDE), 2001 [11] G. R. Hjaltson,H. Samet, Ranking in Spatial Databases International Symposyum on Large Spatial Databases (SSD), 1995 [12] A. Gionis, P. Indyk,R. Motwani Similarity Search in High Dimensions via Hashing International Conference on Very Large Data Bases, 1999