THE concept of Big Data refers to systems conveying

Size: px
Start display at page:

Download "THE concept of Big Data refers to systems conveying"

Transcription

1 EDIC RESEARCH PROPOSAL 1 High Dimensional Nearest Neighbors Techniques for Data Cleaning Anca-Elena Alexandrescu I&C, EPFL Abstract Organisations from all domains have been searching for increasingly more insights from the available information in order to add as much value as possible to their business decisions. With their growth over time, the demand for knowledge and the associated data grew exponentially, resulting in the Big Data era, which brings data sets so large and complex that are impractical to manage with traditional software tools. Given the increase in the number of connected devices, dealing with the uncertainty of the data becomes a necessary evil as sources often contain redundant data in different representations. In order to get accurate results, processing has to be made on consistent data which makes data cleaning gain more and more importance in the process. This paper focuses on techniques for duplicate information retrieval. We first illustrate the usage of similarity search based on nearest neighbor techniques and the limitations of similar approaches for increasing dimensionality of data. In order to expand the range of applicability to big data, we then consider using approximate nearest neighbors techniques. Finally, current work is presented and future research directions are discussed. Index Terms Data cleaning, Big Data, k-nearest neighbor, R- tree, Locality Sensitive Hashing Proposal submitted to committee: June 4 th, 2014; Candidacy exam date: June 11 th, 2014; Candidacy exam committee: Prof. Christoph Koch, Prof. Anastasia Ailamaki, Prof. Willy Zwaenepoel. This research plan has been approved: Date: Doctoral candidate: (name and signature) Thesis director: (name and signature) Thesis co-director: (if applicable) (name and signature) Doct. prog. director: (B. Falsafi) (signature) EDIC-ru/ I. INTRODUCTION THE concept of Big Data refers to systems conveying large amounts of heterogeneous data which is always changing and growing and it is characterized by the three Vs: Volume, Variety and Velocity. The multitude of existing applications lead to a variety of data formats which need to be processed together to get the most out of the collected data. In the past, data files were required to abide by a predefined structure and a certain amount of time was spent on making data files compliant. In the Big Data era, this is no longer possible since great amouts of data are constantly collected and they need to be consumed without delay. In order to derive accurate insights from the collected data, processing has to be done on consistent information. The diversity of sources generating data can often lead to incomplete or duplicate records, which left unaddressed, will affect the quality of the results. Therefore, data cleaning becomes more and more important in the process of preparing the data as it ensures the veracity of the data - which has been recently defined as the 4 th V of Big Data. A. Data Cleaning Data cleaning is the process dealing with detection and removal of errors and inconsistencies in the data which results in improved data veracity. There are numerous sources for data accuracy problems, for example plain human error, missing information, corruption of data during transmission, etc. In case of organizations working with Big Data, multiple data sources have to be integrated. Given that different sources may guide their data collection process on different sets of rules and representations of the information, the need for data cleaning increases significantly. Each data source can contain erroneous data as well as different data formats since collected information comes from multiple applications. Furthermore, the attributes split into categories are prone to misinterpretation since it is possible that the sources use different representation of the data (for example, marital status values) or different meaning for the same value (for example, temperature in F degrees vs C degrees). A very important problem in cleaning data from multiple sources is to identify overlapping records which it is often refered to as duplicates detection. The probability that two

2 EDIC RESEARCH PROPOSAL 2 records are overlapping is proportional to the similarity between them. Duplicates detection based on similarities involves identification of close records, where the closeness is evaluated based on a similariy function chosen to suit the specificity of the application, for example, the Euclidean distance between two records. B. Objectives Data cleaning becomes more and more important, but given that the existing approaches were developed before the Big Data explosion, their performance is no longer considered sufficient. Stimulated by the growth of storage capacity, the increase in data volume is due not only to the fact that more and more devices generate data, but also to the fact that the number of attibutes generated per record has been increased in order to collect as much information as possible, even if in the end not all of it is necessary. The information is gathered in vectors whose dimension corresponds to the number of collected attributes, and as such the dimensionality of the data sets that need to be processed keeps on growing. This paper focuses on nearest neighbors techniques which have been successfully applied on data in order to detect duplicates. Performance of discussed techniques are evaluated and their limitations are highligthed in order to detect improvement opportunities. To this end, we first describe two hierarchical data structures based on the R-tree [3], the state-of-the-art approach for nearest neighbors search. Limitations of the two approaches are explored and, based on the fact that approximate nearest neighbors approaches are more time-efficient without having great loss of result accuracy, we propose Locality Sensitive Hashing as an alternative to deal with duplicates detection in the context of Big Data. The remainder of this paper is structured as follows: section II introduces the nearest neighbor join based on Multipage Index (MuX). The X-tree approach is described and evaluated in section III, followed in section IV by the exploration of Locality Sensitive Hashing. Finally, section V briefly presents current work and future research proposal. II. NEAREST NEIGHBOR JOIN A very important operation in data cleaning, the similarity join received a lot of attention with regard to duplicates detection. The similarity join between two data sets outputs all pairs of similar records between the two sets of input data. There are two well known types of similarity join: 1) distance range join, which returns all pairs containing records from the two data sets where the distance between the objects does not exceed a value ɛ, received as input. 2) k-distance join, which orders all pairs containing records from the two data sets by increasing distance between the objects and returns the first k pairs, based on the value of k received as input. In addition to these two types of similarity join, a third kind of similarity join is introduced, the k-nearest neighbor similarity join (k-nn join, for short). In constrast to the two similarity joins presented above which take into consideration all the pairs resulted from joining the two data sets, the k-nn join combines each record in one data sets with its k nearest neighbors from the other data set. A. The k-nearest neighbor join K-nearest neighbor join [1] uses a new index structure, the Multipage Index [2] which consists of large I/O pages supported by an additional R-tree structure to speed up the main memory operations. The index is a height-balanced tree containing directory pages and data pages and is depicted in figure 1. The secondary search structure is represented by a modified R-tree consisting of a flat directory called page directory and a constant number of leaves, the accomodated buckets. The page directory consists of an array of MBRs and pointers to the corresponding accomodated buckets. Both data and directory pages are assigned to a rectilinear region in the main memory and to a block on disk. The pages are I/O optimized and they are called hosting pages. If a hosting page is a: data page, then the accomodated buckets are data buckets containing data records; directory page, then the accomodated buckets are dictionary buckets storing pairs of an MBR and a pointer to another hosting page. Fig. 1: Multipage Index In order to compute the similarity join, the hosting pages of both relations are processed in two nested loops, with each hosting page of the outer set being accessed exactly once. For each point in the current page of the outer set, an array is allocated to hold the candidates until the requested nearest neighbors have been confirmed. Given that the k-nn join is simultaneously searching for nearest neighbors for all the points inside a hosting page, it is very important to exclude as many hosting pages and accomodated buckets of the inner data set as possible. To this end, a page quality measure has been defined to take into account both the distance to the current buckets and the distance to the last candidate point as pruning distance. Based on this quality measure, a loading strategy is used to ensure that the next page to load is the one which brings the highest

3 EDIC RESEARCH PROPOSAL 3 gain. Moreover, a processing strategy is defined to find the best processing order for the accomodated buckets that are already loaded in the main memory. Similar to the page quality mentioned above, a quality measure for bucket pairs has been defined to give more importance to buckets within smaller distance. B. Performance evaluation The k-nearest neighbor join algorithm has been evaluated on both synthetic and real data sets of varying sizes and dimensions and has been compared to the nested block loop join and a conventional non-join technique, the k-nn algorithm by Hjaltason and Samet. [11] Results on real 9-dimensional data for varying sizes of the data set have shown a maximum speed-up factor of the k-nn join over the nested block loop join of 17. Tests performed on 16-dimensional real data show that k- of dimensionality, which is due to the exponential increase of data volume associated with the addition of extra dimensions to a data set. III. X-TREE APPROACH Many of the existing applications are already collecting huge amounts of data consisting of millions of objects with tens to a few hundreds dimensions. In order to be able to process large amounts of information, it is mandatory to use appropriate algorithms and indexing structures which provide efficient access to high-dimensional data. Current algorithms are based on different variants of the state-of-the-art approach in k-nearest neighbor search, the R-tree, which is known to suffer from overlap and dead space. Overlap (figure 3) in a R-tree structure represents the percentage of space covered by more than one hyperrectangle. Given the fact that nodes overlap in a R-tree results in following multiple paths for computing the answer for the queries, it directly affects query performance. Also, this behaviour gets worse with the increase of data dimensionality as this leads to overlap between more hyperrectangles. The R*-tree [5] is a variant of the R-tree which tries Fig. 2: Results for 16-dimensional real data nn join still obtains better results than the nested block loop join, but reaching a speed-up of only 1.3 for the 80,000 point set. The performance obtained by the k-nn join on 16- dimensional real data is depicted in figure 2. C. K-nn Join Discussion Data cleaning techniques often make use of k-nearest neighbor queries to gain knowledge on the processed data sets, most of the times, meaning that a k-nearest neighbor query is run for each point of the data set. The authors aim to offer an alternative by replacing the large number of k-nearest neighbors queries with the proposed k-nearest neighbor join. The results show that the proposed algorithm using multipage index obtains satisfactory results and is efficient for at least 9 dimensions.the fact that the speed-up obtained for 16- dimensional data is a lot lower than the one for 9-dimensional shows the limitation of the algorithm s perfomance due to increasing data dimensionality. With the increase in data volume and dimensionality brought by Big Data, the existence of k-nearest neighbors algorithms which perform well on high-dimensional data has become a necessity. In order to achieve the requiered performance, algorithms must be designed to overcome the curse Fig. 3: Overlap in 2-dimensional data to reduce overlap using a combination of a specialized split algorithm and forced reinsertion at nodes overflow. Results of R*-tree evaluation on real data showed that the performance degrades very rapidly with the growing dimensionality of data. Figure 4 illustrates these results. Fig. 4: Performance of R-tree depending on data dimension (real data)

4 EDIC RESEARCH PROPOSAL 4 In spite of the R*-tree s optimizations aimed to minimize overlap, a detailed investigation of important characteristics of the R*-tree has revealed that this behaviour is caused by the fact that overlap in the directory increases very rapidly with the growth of data dimensionality. A. The X-tree Structure Based on the insight regarding the effect of data overlap in case of high dimensional data, the X-tree [4] avoids overlap through the use of supernodes. The supernodes are directory Fig. 6: Speed-up of X-tree over R*-tree on Real Point Data for k = 10 Fig. 5: Structure of the X-tree nodes which were extended over the usual block size in order to avoid index degeneration. The structure of the X-tree is presented in figure 5. It is important to notice that the X-tree differs from an R- tree with a larger block size since the X-tree extends the nodes only it is needed to avoid inserting into the tree overlapping nodes. As expected, the structure of the X-tree changes during index updates, since it can lead to reorganization of internal tree nodes. In order to obtain the most suitable configuration for the X-tree, specialized insertion and split algorithm aimed to minimize overlap are applied. The insert algorithm determines the structure of the X- tree and it aims to avoid directory splits which could lead to overlap between the nodes. In case that during the insert procedure, an internal node of the X-tree has to be split, the split algorithm takes into account properties of the MBR like dead-space partitioning, extension of the MBR s, etc., and tries to find a split which doesn t introduce overlap or, if that is not possible, introduces the minimal amount of overlap. C. Result interpretation The X-tree has been created based on the insight that data overlap is the reason for low performance of R-tree based approaches. The structure of the X-tree together with specialy designed insertion and split algorithms contribute to achieving an increased performance level over the R*tree for data with as much as 16 dimensions. However, since the supernodes become quite large with the increase in the number of data dimensions, the time needed to linearly scan their contents will also keep increasing, causing the X-tree to quickly reach its limits. IV. LOCALITY SENSITIVE HASHING Despite decades of research efforts, current solutions for finding the k-nearest neighbors in high dimensional data do not provide the necessary performance. Although both approaches presented before achieve better performance than state-of-theart R-tree, they still have a dimension threshold for their performance. The authors of [12] observe that there are many applications of nearest neighbors search where the exact answer is not really needed and an aproximate answer is good enough. Based on this insight and the assumption that the approximate similarity search can be performed much faster that the exact one, a new technique relying on Locality Sensitive Hashing [6] is proposed. B. Performance evaluation The performance of the X-tree has been evaluated in comparison with the R*-tree on both synthetic and real data. The results have shown that the X-tree outperforms the R*-tree up to orders of magnitude for both point and nearest neighbors queries on both types of data. Figure 6 shows that the X-tree obtains a maximum speedup of 20 for 16-dimensional real data on nearest neighbors queries with k = 10. Fig. 7: Locality Sensitive Hashing Locality Sensitive Hashing (figure 7) provides an efficient

5 EDIC RESEARCH PROPOSAL 5 approximate nearest neighbor search algorithm. It the first step, Locality Sensitive Hashing defines L, received as input, randomly chosen hash functions which are then applied, one at a time, on all points in the data set, mapping them to buckets into all the hash tables. In order to answer a query, the query point is also hashed with the L functions. Then, for each hash table, the data points corresponding to the same bucket as the query point are retrieved and the answer to the query is among them. A. Locality Sensitive Hashing-based algorithm This new approach allows fast retrieval of approximate answer, which will probably be good enough for most of the cases, followed by a slower but accurate computation for the few cases which require an exact answer. The idea behind the algorithm is that the probability of two points p and q to be hashed to the same bucket is closely related to the distance between the points. The algorithm uses two levels of hashing: Locality Sensitive Hashing function, which maps a point p P to a bucket, as briefly described above; standard hash function, which maps the content of the buckets into a hash table. In order to answer a query, the buckets are processed until either a sufficient number of points or all the buckets have been searched. For approximate k-nearest neighbor search, the k closest points to the query are returned. If less than k points have been encountered, then the output will also contain less than k results. B. Performance evaluation Performance of the Locality Sensitive Hashing-based algorithm has been evaluated and compared to the performance of the SR-tree [8] on real data sets using two performance measures: speed, for both approaches, and accuracy of Locality Sensitive Hashing-based algorithm. The SR-tree is an extension of the R*-tree which combines the utilization of bounding spheres and bounding rectangles aiming to improve the performance of nearest neighbors queries by reducing the region s volume and diameter. Given that the SR-tree outputs the exact nearest neighbors while the Locality Sensitive Hashing returns a list of approximate nearest neighbors, the algorithm for SR-tree has been modified to run only on a random sample of the data. Therefore, the modified SR-tree also outputs approximate k- nearest neighbors and it achieves a speed-up due to the fact that the data set used for execution is smaller. As figure 8 shows, the improvement obtained by the Locality Sensitive Hashing algorithm over the modified SR-tree is up to an order of magnitude on a data set containing 270,000 points. The Locality Sensitive Hashing-based algorithm scales really well with the increase in data dimensionality, as the number of disk accesses for ɛ =0.1 grows by 2 for a dimensional increase from 8 to 64. As expected, the SR-tree s performance degrades rapidly with the increase of data dimensionality. The results are illustrated in figure 9. C. Discussion Fig. 8: Performance vs error Fig. 9: Approximate 10-NNS, α=2 A new indexing method based on Locality Sensitive Hashing has been proposed as an alternative for hierarchical data structures. Although the Locality Sensitive Hashing algorithm returns approximate nearest neighbors, the extensive evaluation has shown that allowing a small accuracy loss results in considerably improving execution time. On the other side, a major drawback of Locality Sensitive Hashing is the fact that randomly choosing the hash functions can lead to a poor space partitioning. Therefore, choosing more appropriate hash functions could be taken into consideration and, based on the fact that the Locality Sensitive Hashingbased approach already scales really well for high-dimensional data, it could be a good candidate for the next generation of k-nearest neighbors data cleaning techniques. V. FUTURE WORK An important factor in being able to achieve big data success is having the appropriate hardware and software resources for processing the accumulated data. With the recent improvements in hardware equipment which allows constant increase in data volume, investing time and money into data curation is no longer an option, therefore dealing with the uncertainty of the data is now a necessary evil that has to be dealt with. Although quite a large number of tools of varying functionality support data cleaning process, a significant portion of the cleaning work has to be done either manually or by low-level programs that are difficult to write and maintain. [7] This kind

6 EDIC RESEARCH PROPOSAL 6 of approach becomes increasingly difficult with the growth of the dimensions number in the data set. The focus of current research has been on efficient methods for in-memory processing of k-nearest neighbor queries and a new indexing approach which avoids hierarchical structures issues has been investigated. In the context of NoDB/ViDa project which enables efficient queries on raw, heterogenous data, without pre-formatiing or loading it into a database, future research will focus on designing algorithms aimed to improve the quality of raw data, so that the results obtained by the new processing techniques will not have to lose accuracy by processing low quality data. A. Current Research When processing data stored on disk, the majority of time was spent to transfer data back and forth to the CPU. As such, the research and improvements brought to the k- nearest neighbors approaches have focused on reducing the time spent for data transfer. Given the fact that disk I/O is considerably slower than memory access, the time spent for actually computing distances between query point and objects in the index was hidden by the transfer time. While the in-memory approaches gain popularity due to the increase of available memory in computer systems, the optimizations applied on this indexing methods prove to be less efficient and the need for new ideas emerges. The SNAIL algorithm is an in-memory k-nearest neighbor technique which avoids hierarchical data structures by using a grid type structure which reduces the time spent traversing the index. The main idea of the algorithm is to find as many groups of points that can be added to the result set without individually computing the distance from the point to the query. The groups we are using are the grid cells, and the intuition is that, if for a cell C, the total number of points from its neighboring cells is lower that the number of searched neighbors, we can conclude that all the points in cell C can be added directly to the result set. As expected, at some point the neighboring cell will contain more than the number of searched cells, and computation of individual distances will be necessary. SNAIL has been evaluated on 3-dimensional synthetic data sets with sizes up to 5GB, and the preliminary results have shown that later/greater resolution values result in greater computation time, as the number of objects that need to be individually checked grows. The best execution times have been obtained on a grid resolution of 50 cells per dimension. Current results show that SNAIL building process is almost 20 times more efficient than the building of the R-tree. Given the fact that SNAIL computes the individual distance for a large number of objects in the data set, for high enough value of k, SNAIL execution time outperforms the R-tree approach. B. Future Directions Due to the big data effect, the Curse of Dimensionality has been studied on several problems, such as clustering, indexing, nearest neighbors search and it seems that the fact that in high dimensional space the data tends to become sparse is only part of the problem. Recent results show that in high dimensional space, the concept of proximity or nearest neighbor may not be quantitatively meaningful, and the use of fractional distance metrics L p -norm, where p is allowed to be a fraction smaller than 1, are shown be more accurate. [10] More precisely, the distance computation will be made using the following fractional distance metric: dist f d (x, y) = d 1/f (x i y i ) f i=1 Given that SNAIL shows great potential for efficiently processing large data sets, the possibility of adapting it for high-dimensional data will be explored. One of the challenges that come up is the choice of the distance metric and the use of fractional distance will be taken into consideration in order to decide if it is appropriate for duplicates detection. Also, since SNAIL has been designed as an in-memory algorithm and the data volume is constantly growing, another challenge will be to adapt it for on-disk data processing. Data cleaning is a vast research area, and while we proposed an algorithm that helps detecting duplicates, other quality problems also exist in data files. Given that the aim is to minimize the time spent for processing data in order to correct it, another challenge is to explore the possibilities of designing algorithms that can address more than one of this issues at a time. REFERENCES [1] C. Böhm, F. Krebs, High Performance Data Mining Using the Nearest Neighbor Join, IEEE International Conference on Data Mining (ICDM), 2002 [2] C. Böhm, H. P. Kriegel, A Cost Model and Index Architecture for the Similarity Join, IEEE International Conference on Data Engineering (ICDE), 2001 [3] A. Guttman, R-trees: A Dynamic Index Structure for Spatial Searching, ACM SIGMOD International Conference on Management of Data, 1984 [4] S. Berchtold,D. A. Keim, H. P. Kriegel The X-tree: An Index Structure for High-Dimensional Data, International Conference on Very Large Data Bases, 1996 [5] N. Beckmann,H. P. Kriegel, R.Schneider, B. Seegerl The R*-tree: An Efficient and Robust Access method for Points and Rectangles, International Conference on Very Large Data Bases, 1996 [6] P. Indyk, R. Motwani Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality, ACM Symposium on Theory of Computing, 1998 [7] E. Rahm, H. Hai Do Data Cleaning: Problems and Current Approaches, IEEE Data Engineering Bulletin, 2000 [8] N. Katayama, S. Satoh The SR-tree: An Index Structure for High- Dimensional Nearest Neighbor Queries, ACM SIGMOD International Conference on Management of Data, 1997 [9] S. Chaudhuri, V. Ganti, R. Kaushik, A Primitive Operator for Similarity Joins in Data Cleaning IEEE International Conference on Data Engineering (ICDE), 2006 [10] C. C. Aggrawal, A. Hinneburg, D. A. Keim, On the Surprising Behaviour of Distance Metrics in High Dimensional Spaces International Conference on Database Theory (ICDE), 2001 [11] G. R. Hjaltson,H. Samet, Ranking in Spatial Databases International Symposyum on Large Spatial Databases (SSD), 1995 [12] A. Gionis, P. Indyk,R. Motwani Similarity Search in High Dimensions via Hashing International Conference on Very Large Data Bases, 1999

Challenges in Finding an Appropriate Multi-Dimensional Index Structure with Respect to Specific Use Cases

Challenges in Finding an Appropriate Multi-Dimensional Index Structure with Respect to Specific Use Cases Challenges in Finding an Appropriate Multi-Dimensional Index Structure with Respect to Specific Use Cases Alexander Grebhahn grebhahn@st.ovgu.de Reimar Schröter rschroet@st.ovgu.de David Broneske dbronesk@st.ovgu.de

More information

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants R-Trees: A Dynamic Index Structure For Spatial Searching A. Guttman R-trees Generalization of B+-trees to higher dimensions Disk-based index structure Occupancy guarantee Multiple search paths Insertions

More information

Data Warehousing und Data Mining

Data Warehousing und Data Mining Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data

More information

Survey On: Nearest Neighbour Search With Keywords In Spatial Databases

Survey On: Nearest Neighbour Search With Keywords In Spatial Databases Survey On: Nearest Neighbour Search With Keywords In Spatial Databases SayaliBorse 1, Prof. P. M. Chawan 2, Prof. VishwanathChikaraddi 3, Prof. Manish Jansari 4 P.G. Student, Dept. of Computer Engineering&

More information

Large Databases. mjf@inesc-id.pt, jorgej@acm.org. Abstract. Many indexing approaches for high dimensional data points have evolved into very complex

Large Databases. mjf@inesc-id.pt, jorgej@acm.org. Abstract. Many indexing approaches for high dimensional data points have evolved into very complex NB-Tree: An Indexing Structure for Content-Based Retrieval in Large Databases Manuel J. Fonseca, Joaquim A. Jorge Department of Information Systems and Computer Science INESC-ID/IST/Technical University

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

Multi-dimensional index structures Part I: motivation

Multi-dimensional index structures Part I: motivation Multi-dimensional index structures Part I: motivation 144 Motivation: Data Warehouse A definition A data warehouse is a repository of integrated enterprise data. A data warehouse is used specifically for

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 11, November 2015 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Providing Diversity in K-Nearest Neighbor Query Results

Providing Diversity in K-Nearest Neighbor Query Results Providing Diversity in K-Nearest Neighbor Query Results Anoop Jain, Parag Sarda, and Jayant R. Haritsa Database Systems Lab, SERC/CSA Indian Institute of Science, Bangalore 560012, INDIA. Abstract. Given

More information

Fast Matching of Binary Features

Fast Matching of Binary Features Fast Matching of Binary Features Marius Muja and David G. Lowe Laboratory for Computational Intelligence University of British Columbia, Vancouver, Canada {mariusm,lowe}@cs.ubc.ca Abstract There has been

More information

FAST APPROXIMATE NEAREST NEIGHBORS WITH AUTOMATIC ALGORITHM CONFIGURATION

FAST APPROXIMATE NEAREST NEIGHBORS WITH AUTOMATIC ALGORITHM CONFIGURATION FAST APPROXIMATE NEAREST NEIGHBORS WITH AUTOMATIC ALGORITHM CONFIGURATION Marius Muja, David G. Lowe Computer Science Department, University of British Columbia, Vancouver, B.C., Canada mariusm@cs.ubc.ca,

More information

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs. Multimedia Databases Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Previous Lecture 13 Indexes for Multimedia Data 13.1

More information

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets Dudu Lazarov, Gil David, Amir Averbuch School of Computer Science, Tel-Aviv University Tel-Aviv 69978, Israel Abstract

More information

Vector storage and access; algorithms in GIS. This is lecture 6

Vector storage and access; algorithms in GIS. This is lecture 6 Vector storage and access; algorithms in GIS This is lecture 6 Vector data storage and access Vectors are built from points, line and areas. (x,y) Surface: (x,y,z) Vector data access Access to vector

More information

The DC-tree: A Fully Dynamic Index Structure for Data Warehouses

The DC-tree: A Fully Dynamic Index Structure for Data Warehouses The DC-tree: A Fully Dynamic Index Structure for Data Warehouses Martin Ester, Jörn Kohlhammer, Hans-Peter Kriegel Institute for Computer Science, University of Munich Oettingenstr. 67, D-80538 Munich,

More information

Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

More information

Chapter 13: Query Processing. Basic Steps in Query Processing

Chapter 13: Query Processing. Basic Steps in Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

The DC-Tree: A Fully Dynamic Index Structure for Data Warehouses

The DC-Tree: A Fully Dynamic Index Structure for Data Warehouses Published in the Proceedings of 16th International Conference on Data Engineering (ICDE 2) The DC-Tree: A Fully Dynamic Index Structure for Data Warehouses Martin Ester, Jörn Kohlhammer, Hans-Peter Kriegel

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Comp 5311 Database Management Systems. 16. Review 2 (Physical Level)

Comp 5311 Database Management Systems. 16. Review 2 (Physical Level) Comp 5311 Database Management Systems 16. Review 2 (Physical Level) 1 Main Topics Indexing Join Algorithms Query Processing and Optimization Transactions and Concurrency Control 2 Indexing Used for faster

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

Cluster Analysis for Optimal Indexing

Cluster Analysis for Optimal Indexing Proceedings of the Twenty-Sixth International Florida Artificial Intelligence Research Society Conference Cluster Analysis for Optimal Indexing Tim Wylie, Michael A. Schuh, John Sheppard, and Rafal A.

More information

QuickDB Yet YetAnother Database Management System?

QuickDB Yet YetAnother Database Management System? QuickDB Yet YetAnother Database Management System? Radim Bača, Peter Chovanec, Michal Krátký, and Petr Lukáš Radim Bača, Peter Chovanec, Michal Krátký, and Petr Lukáš Department of Computer Science, FEECS,

More information

2) What is the structure of an organization? Explain how IT support at different organizational levels.

2) What is the structure of an organization? Explain how IT support at different organizational levels. (PGDIT 01) Paper - I : BASICS OF INFORMATION TECHNOLOGY 1) What is an information technology? Why you need to know about IT. 2) What is the structure of an organization? Explain how IT support at different

More information

Voronoi Treemaps in D3

Voronoi Treemaps in D3 Voronoi Treemaps in D3 Peter Henry University of Washington phenry@gmail.com Paul Vines University of Washington paul.l.vines@gmail.com ABSTRACT Voronoi treemaps are an alternative to traditional rectangular

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

Effective Complex Data Retrieval Mechanism for Mobile Applications

Effective Complex Data Retrieval Mechanism for Mobile Applications , 23-25 October, 2013, San Francisco, USA Effective Complex Data Retrieval Mechanism for Mobile Applications Haeng Kon Kim Abstract While mobile devices own limited storages and low computational resources,

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

HELP DESK SYSTEMS. Using CaseBased Reasoning

HELP DESK SYSTEMS. Using CaseBased Reasoning HELP DESK SYSTEMS Using CaseBased Reasoning Topics Covered Today What is Help-Desk? Components of HelpDesk Systems Types Of HelpDesk Systems Used Need for CBR in HelpDesk Systems GE Helpdesk using ReMind

More information

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria

More information

Ins+tuto Superior Técnico Technical University of Lisbon. Big Data. Bruno Lopes Catarina Moreira João Pinho

Ins+tuto Superior Técnico Technical University of Lisbon. Big Data. Bruno Lopes Catarina Moreira João Pinho Ins+tuto Superior Técnico Technical University of Lisbon Big Data Bruno Lopes Catarina Moreira João Pinho Mo#va#on 2 220 PetaBytes Of data that people create every day! 2 Mo#va#on 90 % of Data UNSTRUCTURED

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Physical Data Organization

Physical Data Organization Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor

More information

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in

More information

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,

More information

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

More information

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #2610771

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #2610771 ENHANCEMENTS TO SQL SERVER COLUMN STORES Anuhya Mallempati #2610771 CONTENTS Abstract Introduction Column store indexes Batch mode processing Other Enhancements Conclusion ABSTRACT SQL server introduced

More information

Performance evaluation of Web Information Retrieval Systems and its application to e-business

Performance evaluation of Web Information Retrieval Systems and its application to e-business Performance evaluation of Web Information Retrieval Systems and its application to e-business Fidel Cacheda, Angel Viña Departament of Information and Comunications Technologies Facultad de Informática,

More information

2 Associating Facts with Time

2 Associating Facts with Time TEMPORAL DATABASES Richard Thomas Snodgrass A temporal database (see Temporal Database) contains time-varying data. Time is an important aspect of all real-world phenomena. Events occur at specific points

More information

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein Falaki @mhfalaki hossein@databricks.com

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein Falaki @mhfalaki hossein@databricks.com Common Patterns and Pitfalls for Implementing Algorithms in Spark Hossein Falaki @mhfalaki hossein@databricks.com Challenges of numerical computation over big data When applying any algorithm to big data

More information

Big Data Racing and parallel Database Technology

Big Data Racing and parallel Database Technology EFFICIENT DATA ANALYSIS SCHEME FOR INCREASING PERFORMANCE IN BIG DATA Mr. V. Vivekanandan Computer Science and Engineering, SriGuru Institute of Technology, Coimbatore, Tamilnadu, India. Abstract Big data

More information

InfiniteGraph: The Distributed Graph Database

InfiniteGraph: The Distributed Graph Database A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086

More information

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada tcossentine@gmail.com

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Storage Management for Files of Dynamic Records

Storage Management for Files of Dynamic Records Storage Management for Files of Dynamic Records Justin Zobel Department of Computer Science, RMIT, GPO Box 2476V, Melbourne 3001, Australia. jz@cs.rmit.edu.au Alistair Moffat Department of Computer Science

More information

Investigating the Effects of Spatial Data Redundancy in Query Performance over Geographical Data Warehouses

Investigating the Effects of Spatial Data Redundancy in Query Performance over Geographical Data Warehouses Investigating the Effects of Spatial Data Redundancy in Query Performance over Geographical Data Warehouses Thiago Luís Lopes Siqueira Ricardo Rodrigues Ciferri Valéria Cesário Times Cristina Dutra de

More information

CHAPTER-24 Mining Spatial Databases

CHAPTER-24 Mining Spatial Databases CHAPTER-24 Mining Spatial Databases 24.1 Introduction 24.2 Spatial Data Cube Construction and Spatial OLAP 24.3 Spatial Association Analysis 24.4 Spatial Clustering Methods 24.5 Spatial Classification

More information

Whitepaper. Innovations in Business Intelligence Database Technology. www.sisense.com

Whitepaper. Innovations in Business Intelligence Database Technology. www.sisense.com Whitepaper Innovations in Business Intelligence Database Technology The State of Database Technology in 2015 Database technology has seen rapid developments in the past two decades. Online Analytical Processing

More information

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM 152 APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM A1.1 INTRODUCTION PPATPAN is implemented in a test bed with five Linux system arranged in a multihop topology. The system is implemented

More information

Secure Similarity Search on Outsourced Metric Data

Secure Similarity Search on Outsourced Metric Data International Journal of Computer Trends and Technology (IJCTT) volume 6 number 5 Dec 213 Secure Similarity Search on Outsourced Metric Data P.Maruthi Rao 1, M.Gayatri 2 1 (M.Tech Scholar,Department of

More information

Overview of Storage and Indexing

Overview of Storage and Indexing Overview of Storage and Indexing Chapter 8 How index-learning turns no student pale Yet holds the eel of science by the tail. -- Alexander Pope (1688-1744) Database Management Systems 3ed, R. Ramakrishnan

More information

Going Big in Data Dimensionality:

Going Big in Data Dimensionality: LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für

More information

Indexing Spatio-Temporal archive As a Preprocessing Alsuccession

Indexing Spatio-Temporal archive As a Preprocessing Alsuccession The VLDB Journal manuscript No. (will be inserted by the editor) Indexing Spatio-temporal Archives Marios Hadjieleftheriou 1, George Kollios 2, Vassilis J. Tsotras 1, Dimitrios Gunopulos 1 1 Computer Science

More information

When Is Nearest Neighbor Meaningful?

When Is Nearest Neighbor Meaningful? When Is Nearest Neighbor Meaningful? Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft CS Dept., University of Wisconsin-Madison 1210 W. Dayton St., Madison, WI 53706 {beyer, jgoldst,

More information

Storage Systems Autumn 2009. Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

Storage Systems Autumn 2009. Chapter 6: Distributed Hash Tables and their Applications André Brinkmann Storage Systems Autumn 2009 Chapter 6: Distributed Hash Tables and their Applications André Brinkmann Scaling RAID architectures Using traditional RAID architecture does not scale Adding news disk implies

More information

Manual for BEAR Big Data Ensemble of Adaptations for Regression Version 1.0

Manual for BEAR Big Data Ensemble of Adaptations for Regression Version 1.0 Manual for BEAR Big Data Ensemble of Adaptations for Regression Version 1.0 Vahid Jalali David Leake August 9, 2015 Abstract BEAR is a case-based regression learner tailored for big data processing. It

More information

CUBE INDEXING IMPLEMENTATION USING INTEGRATION OF SIDERA AND BERKELEY DB

CUBE INDEXING IMPLEMENTATION USING INTEGRATION OF SIDERA AND BERKELEY DB CUBE INDEXING IMPLEMENTATION USING INTEGRATION OF SIDERA AND BERKELEY DB Badal K. Kothari 1, Prof. Ashok R. Patel 2 1 Research Scholar, Mewar University, Chittorgadh, Rajasthan, India 2 Department of Computer

More information

Fuzzy Duplicate Detection on XML Data

Fuzzy Duplicate Detection on XML Data Fuzzy Duplicate Detection on XML Data Melanie Weis Humboldt-Universität zu Berlin Unter den Linden 6, Berlin, Germany mweis@informatik.hu-berlin.de Abstract XML is popular for data exchange and data publishing

More information

Junghyun Ahn Changho Sung Tag Gon Kim. Korea Advanced Institute of Science and Technology (KAIST) 373-1 Kuseong-dong, Yuseong-gu Daejoen, Korea

Junghyun Ahn Changho Sung Tag Gon Kim. Korea Advanced Institute of Science and Technology (KAIST) 373-1 Kuseong-dong, Yuseong-gu Daejoen, Korea Proceedings of the 211 Winter Simulation Conference S. Jain, R. R. Creasey, J. Himmelspach, K. P. White, and M. Fu, eds. A BINARY PARTITION-BASED MATCHING ALGORITHM FOR DATA DISTRIBUTION MANAGEMENT Junghyun

More information

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Ag + -tree: an Index Structure for Range-aggregation Queries in Data Warehouse Environments

Ag + -tree: an Index Structure for Range-aggregation Queries in Data Warehouse Environments Ag + -tree: an Index Structure for Range-aggregation Queries in Data Warehouse Environments Yaokai Feng a, Akifumi Makinouchi b a Faculty of Information Science and Electrical Engineering, Kyushu University,

More information

Supporting Software Development Process Using Evolution Analysis : a Brief Survey

Supporting Software Development Process Using Evolution Analysis : a Brief Survey Supporting Software Development Process Using Evolution Analysis : a Brief Survey Samaneh Bayat Department of Computing Science, University of Alberta, Edmonton, Canada samaneh@ualberta.ca Abstract During

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Abdun Mahmood, Christopher Leckie, Parampalli Udaya Department of Computer Science and Software Engineering University of

More information

File Management. Chapter 12

File Management. Chapter 12 Chapter 12 File Management File is the basic element of most of the applications, since the input to an application, as well as its output, is usually a file. They also typically outlive the execution

More information

Overview of Storage and Indexing. Data on External Storage. Alternative File Organizations. Chapter 8

Overview of Storage and Indexing. Data on External Storage. Alternative File Organizations. Chapter 8 Overview of Storage and Indexing Chapter 8 How index-learning turns no student pale Yet holds the eel of science by the tail. -- Alexander Pope (1688-1744) Database Management Systems 3ed, R. Ramakrishnan

More information

Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015

Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015 Clustering Big Data Efficient Data Mining Technologies J Singh and Teresa Brooks June 4, 2015 Hello Bulgaria (http://hello.bg/) A website with thousands of pages... Some pages identical to other pages

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India

More information

Map-Reduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programming Model

Map-Reduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programming Model Map-Reduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programming Model Subramanyam. RBV, Sonam. Gupta Abstract An important problem that appears often when analyzing data involves

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps. Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps. Yu Su, Yi Wang, Gagan Agrawal The Ohio State University In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps Yu Su, Yi Wang, Gagan Agrawal The Ohio State University Motivation HPC Trends Huge performance gap CPU: extremely fast for generating

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel Yahoo! Research New York, NY 10018 goel@yahoo-inc.com John Langford Yahoo! Research New York, NY 10018 jl@yahoo-inc.com Alex Strehl Yahoo! Research New York,

More information

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1 Slide 13-1 Chapter 13 Disk Storage, Basic File Structures, and Hashing Chapter Outline Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and Extendible

More information

AsicBoost A Speedup for Bitcoin Mining

AsicBoost A Speedup for Bitcoin Mining AsicBoost A Speedup for Bitcoin Mining Dr. Timo Hanke March 31, 2016 (rev. 5) Abstract. AsicBoost is a method to speed up Bitcoin mining by a factor of approximately 20%. The performance gain is achieved

More information

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

More information

Some Computer Organizations and Their Effectiveness. Michael J Flynn. IEEE Transactions on Computers. Vol. c-21, No.

Some Computer Organizations and Their Effectiveness. Michael J Flynn. IEEE Transactions on Computers. Vol. c-21, No. Some Computer Organizations and Their Effectiveness Michael J Flynn IEEE Transactions on Computers. Vol. c-21, No.9, September 1972 Introduction Attempts to codify a computer have been from three points

More information

Indexing the Trajectories of Moving Objects in Networks

Indexing the Trajectories of Moving Objects in Networks Indexing the Trajectories of Moving Objects in Networks Victor Teixeira de Almeida Ralf Hartmut Güting Praktische Informatik IV Fernuniversität Hagen, D-5884 Hagen, Germany {victor.almeida, rhg}@fernuni-hagen.de

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

Principles of Data Mining by Hand&Mannila&Smyth

Principles of Data Mining by Hand&Mannila&Smyth Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences

More information

Oracle8i Spatial: Experiences with Extensible Databases

Oracle8i Spatial: Experiences with Extensible Databases Oracle8i Spatial: Experiences with Extensible Databases Siva Ravada and Jayant Sharma Spatial Products Division Oracle Corporation One Oracle Drive Nashua NH-03062 {sravada,jsharma}@us.oracle.com 1 Introduction

More information

PARALLEL REAL-TIME OLAP ON CLOUD PLATFORMS

PARALLEL REAL-TIME OLAP ON CLOUD PLATFORMS PARALLEL REAL-TIME OLAP ON CLOUD PLATFORMS by Xiaoyun Zhou A thesis proposal submitted to the Faculty of Graduate Studies and Research in partial fulfillment of the requirements for the degree of MASTER

More information

High-performance XML Storage/Retrieval System

High-performance XML Storage/Retrieval System UDC 00.5:68.3 High-performance XML Storage/Retrieval System VYasuo Yamane VNobuyuki Igata VIsao Namba (Manuscript received August 8, 000) This paper describes a system that integrates full-text searching

More information

Lecture 1: Data Storage & Index

Lecture 1: Data Storage & Index Lecture 1: Data Storage & Index R&G Chapter 8-11 Concurrency control Query Execution and Optimization Relational Operators File & Access Methods Buffer Management Disk Space Management Recovery Manager

More information

Data Structure and Network Searching

Data Structure and Network Searching Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence A Unified Approximate Nearest Neighbor Search Scheme by Combining Data Structure and Hashing Debing Zhang Genmao

More information

The assignment of chunk size according to the target data characteristics in deduplication backup system

The assignment of chunk size according to the target data characteristics in deduplication backup system The assignment of chunk size according to the target data characteristics in deduplication backup system Mikito Ogata Norihisa Komoda Hitachi Information and Telecommunication Engineering, Ltd. 781 Sakai,

More information

Performance Tuning for the Teradata Database

Performance Tuning for the Teradata Database Performance Tuning for the Teradata Database Matthew W Froemsdorf Teradata Partner Engineering and Technical Consulting - i - Document Changes Rev. Date Section Comment 1.0 2010-10-26 All Initial document

More information

FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

More information

New Hash Function Construction for Textual and Geometric Data Retrieval

New Hash Function Construction for Textual and Geometric Data Retrieval Latest Trends on Computers, Vol., pp.483-489, ISBN 978-96-474-3-4, ISSN 79-45, CSCC conference, Corfu, Greece, New Hash Function Construction for Textual and Geometric Data Retrieval Václav Skala, Jan

More information

Chapter 13 Disk Storage, Basic File Structures, and Hashing.

Chapter 13 Disk Storage, Basic File Structures, and Hashing. Chapter 13 Disk Storage, Basic File Structures, and Hashing. Copyright 2004 Pearson Education, Inc. Chapter Outline Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Chapter 13. Disk Storage, Basic File Structures, and Hashing

Chapter 13. Disk Storage, Basic File Structures, and Hashing Chapter 13 Disk Storage, Basic File Structures, and Hashing Chapter Outline Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and Extendible Hashing

More information

Distributed Apriori in Hadoop MapReduce Framework

Distributed Apriori in Hadoop MapReduce Framework Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing

More information

Chapter 20: Data Analysis

Chapter 20: Data Analysis Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

More information

Bitmap Index as Effective Indexing for Low Cardinality Column in Data Warehouse

Bitmap Index as Effective Indexing for Low Cardinality Column in Data Warehouse Bitmap Index as Effective Indexing for Low Cardinality Column in Data Warehouse Zainab Qays Abdulhadi* * Ministry of Higher Education & Scientific Research Baghdad, Iraq Zhang Zuping Hamed Ibrahim Housien**

More information

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du fdu@cs.ubc.ca University of British Columbia

More information

Graph Database Proof of Concept Report

Graph Database Proof of Concept Report Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment

More information