THE concept of Big Data refers to systems conveying


 Blaise Turner
 2 years ago
 Views:
Transcription
1 EDIC RESEARCH PROPOSAL 1 High Dimensional Nearest Neighbors Techniques for Data Cleaning AncaElena Alexandrescu I&C, EPFL Abstract Organisations from all domains have been searching for increasingly more insights from the available information in order to add as much value as possible to their business decisions. With their growth over time, the demand for knowledge and the associated data grew exponentially, resulting in the Big Data era, which brings data sets so large and complex that are impractical to manage with traditional software tools. Given the increase in the number of connected devices, dealing with the uncertainty of the data becomes a necessary evil as sources often contain redundant data in different representations. In order to get accurate results, processing has to be made on consistent data which makes data cleaning gain more and more importance in the process. This paper focuses on techniques for duplicate information retrieval. We first illustrate the usage of similarity search based on nearest neighbor techniques and the limitations of similar approaches for increasing dimensionality of data. In order to expand the range of applicability to big data, we then consider using approximate nearest neighbors techniques. Finally, current work is presented and future research directions are discussed. Index Terms Data cleaning, Big Data, knearest neighbor, R tree, Locality Sensitive Hashing Proposal submitted to committee: June 4 th, 2014; Candidacy exam date: June 11 th, 2014; Candidacy exam committee: Prof. Christoph Koch, Prof. Anastasia Ailamaki, Prof. Willy Zwaenepoel. This research plan has been approved: Date: Doctoral candidate: (name and signature) Thesis director: (name and signature) Thesis codirector: (if applicable) (name and signature) Doct. prog. director: (B. Falsafi) (signature) EDICru/ I. INTRODUCTION THE concept of Big Data refers to systems conveying large amounts of heterogeneous data which is always changing and growing and it is characterized by the three Vs: Volume, Variety and Velocity. The multitude of existing applications lead to a variety of data formats which need to be processed together to get the most out of the collected data. In the past, data files were required to abide by a predefined structure and a certain amount of time was spent on making data files compliant. In the Big Data era, this is no longer possible since great amouts of data are constantly collected and they need to be consumed without delay. In order to derive accurate insights from the collected data, processing has to be done on consistent information. The diversity of sources generating data can often lead to incomplete or duplicate records, which left unaddressed, will affect the quality of the results. Therefore, data cleaning becomes more and more important in the process of preparing the data as it ensures the veracity of the data  which has been recently defined as the 4 th V of Big Data. A. Data Cleaning Data cleaning is the process dealing with detection and removal of errors and inconsistencies in the data which results in improved data veracity. There are numerous sources for data accuracy problems, for example plain human error, missing information, corruption of data during transmission, etc. In case of organizations working with Big Data, multiple data sources have to be integrated. Given that different sources may guide their data collection process on different sets of rules and representations of the information, the need for data cleaning increases significantly. Each data source can contain erroneous data as well as different data formats since collected information comes from multiple applications. Furthermore, the attributes split into categories are prone to misinterpretation since it is possible that the sources use different representation of the data (for example, marital status values) or different meaning for the same value (for example, temperature in F degrees vs C degrees). A very important problem in cleaning data from multiple sources is to identify overlapping records which it is often refered to as duplicates detection. The probability that two
2 EDIC RESEARCH PROPOSAL 2 records are overlapping is proportional to the similarity between them. Duplicates detection based on similarities involves identification of close records, where the closeness is evaluated based on a similariy function chosen to suit the specificity of the application, for example, the Euclidean distance between two records. B. Objectives Data cleaning becomes more and more important, but given that the existing approaches were developed before the Big Data explosion, their performance is no longer considered sufficient. Stimulated by the growth of storage capacity, the increase in data volume is due not only to the fact that more and more devices generate data, but also to the fact that the number of attibutes generated per record has been increased in order to collect as much information as possible, even if in the end not all of it is necessary. The information is gathered in vectors whose dimension corresponds to the number of collected attributes, and as such the dimensionality of the data sets that need to be processed keeps on growing. This paper focuses on nearest neighbors techniques which have been successfully applied on data in order to detect duplicates. Performance of discussed techniques are evaluated and their limitations are highligthed in order to detect improvement opportunities. To this end, we first describe two hierarchical data structures based on the Rtree [3], the stateoftheart approach for nearest neighbors search. Limitations of the two approaches are explored and, based on the fact that approximate nearest neighbors approaches are more timeefficient without having great loss of result accuracy, we propose Locality Sensitive Hashing as an alternative to deal with duplicates detection in the context of Big Data. The remainder of this paper is structured as follows: section II introduces the nearest neighbor join based on Multipage Index (MuX). The Xtree approach is described and evaluated in section III, followed in section IV by the exploration of Locality Sensitive Hashing. Finally, section V briefly presents current work and future research proposal. II. NEAREST NEIGHBOR JOIN A very important operation in data cleaning, the similarity join received a lot of attention with regard to duplicates detection. The similarity join between two data sets outputs all pairs of similar records between the two sets of input data. There are two well known types of similarity join: 1) distance range join, which returns all pairs containing records from the two data sets where the distance between the objects does not exceed a value ɛ, received as input. 2) kdistance join, which orders all pairs containing records from the two data sets by increasing distance between the objects and returns the first k pairs, based on the value of k received as input. In addition to these two types of similarity join, a third kind of similarity join is introduced, the knearest neighbor similarity join (knn join, for short). In constrast to the two similarity joins presented above which take into consideration all the pairs resulted from joining the two data sets, the knn join combines each record in one data sets with its k nearest neighbors from the other data set. A. The knearest neighbor join Knearest neighbor join [1] uses a new index structure, the Multipage Index [2] which consists of large I/O pages supported by an additional Rtree structure to speed up the main memory operations. The index is a heightbalanced tree containing directory pages and data pages and is depicted in figure 1. The secondary search structure is represented by a modified Rtree consisting of a flat directory called page directory and a constant number of leaves, the accomodated buckets. The page directory consists of an array of MBRs and pointers to the corresponding accomodated buckets. Both data and directory pages are assigned to a rectilinear region in the main memory and to a block on disk. The pages are I/O optimized and they are called hosting pages. If a hosting page is a: data page, then the accomodated buckets are data buckets containing data records; directory page, then the accomodated buckets are dictionary buckets storing pairs of an MBR and a pointer to another hosting page. Fig. 1: Multipage Index In order to compute the similarity join, the hosting pages of both relations are processed in two nested loops, with each hosting page of the outer set being accessed exactly once. For each point in the current page of the outer set, an array is allocated to hold the candidates until the requested nearest neighbors have been confirmed. Given that the knn join is simultaneously searching for nearest neighbors for all the points inside a hosting page, it is very important to exclude as many hosting pages and accomodated buckets of the inner data set as possible. To this end, a page quality measure has been defined to take into account both the distance to the current buckets and the distance to the last candidate point as pruning distance. Based on this quality measure, a loading strategy is used to ensure that the next page to load is the one which brings the highest
3 EDIC RESEARCH PROPOSAL 3 gain. Moreover, a processing strategy is defined to find the best processing order for the accomodated buckets that are already loaded in the main memory. Similar to the page quality mentioned above, a quality measure for bucket pairs has been defined to give more importance to buckets within smaller distance. B. Performance evaluation The knearest neighbor join algorithm has been evaluated on both synthetic and real data sets of varying sizes and dimensions and has been compared to the nested block loop join and a conventional nonjoin technique, the knn algorithm by Hjaltason and Samet. [11] Results on real 9dimensional data for varying sizes of the data set have shown a maximum speedup factor of the knn join over the nested block loop join of 17. Tests performed on 16dimensional real data show that k of dimensionality, which is due to the exponential increase of data volume associated with the addition of extra dimensions to a data set. III. XTREE APPROACH Many of the existing applications are already collecting huge amounts of data consisting of millions of objects with tens to a few hundreds dimensions. In order to be able to process large amounts of information, it is mandatory to use appropriate algorithms and indexing structures which provide efficient access to highdimensional data. Current algorithms are based on different variants of the stateoftheart approach in knearest neighbor search, the Rtree, which is known to suffer from overlap and dead space. Overlap (figure 3) in a Rtree structure represents the percentage of space covered by more than one hyperrectangle. Given the fact that nodes overlap in a Rtree results in following multiple paths for computing the answer for the queries, it directly affects query performance. Also, this behaviour gets worse with the increase of data dimensionality as this leads to overlap between more hyperrectangles. The R*tree [5] is a variant of the Rtree which tries Fig. 2: Results for 16dimensional real data nn join still obtains better results than the nested block loop join, but reaching a speedup of only 1.3 for the 80,000 point set. The performance obtained by the knn join on 16 dimensional real data is depicted in figure 2. C. Knn Join Discussion Data cleaning techniques often make use of knearest neighbor queries to gain knowledge on the processed data sets, most of the times, meaning that a knearest neighbor query is run for each point of the data set. The authors aim to offer an alternative by replacing the large number of knearest neighbors queries with the proposed knearest neighbor join. The results show that the proposed algorithm using multipage index obtains satisfactory results and is efficient for at least 9 dimensions.the fact that the speedup obtained for 16 dimensional data is a lot lower than the one for 9dimensional shows the limitation of the algorithm s perfomance due to increasing data dimensionality. With the increase in data volume and dimensionality brought by Big Data, the existence of knearest neighbors algorithms which perform well on highdimensional data has become a necessity. In order to achieve the requiered performance, algorithms must be designed to overcome the curse Fig. 3: Overlap in 2dimensional data to reduce overlap using a combination of a specialized split algorithm and forced reinsertion at nodes overflow. Results of R*tree evaluation on real data showed that the performance degrades very rapidly with the growing dimensionality of data. Figure 4 illustrates these results. Fig. 4: Performance of Rtree depending on data dimension (real data)
4 EDIC RESEARCH PROPOSAL 4 In spite of the R*tree s optimizations aimed to minimize overlap, a detailed investigation of important characteristics of the R*tree has revealed that this behaviour is caused by the fact that overlap in the directory increases very rapidly with the growth of data dimensionality. A. The Xtree Structure Based on the insight regarding the effect of data overlap in case of high dimensional data, the Xtree [4] avoids overlap through the use of supernodes. The supernodes are directory Fig. 6: Speedup of Xtree over R*tree on Real Point Data for k = 10 Fig. 5: Structure of the Xtree nodes which were extended over the usual block size in order to avoid index degeneration. The structure of the Xtree is presented in figure 5. It is important to notice that the Xtree differs from an R tree with a larger block size since the Xtree extends the nodes only it is needed to avoid inserting into the tree overlapping nodes. As expected, the structure of the Xtree changes during index updates, since it can lead to reorganization of internal tree nodes. In order to obtain the most suitable configuration for the Xtree, specialized insertion and split algorithm aimed to minimize overlap are applied. The insert algorithm determines the structure of the X tree and it aims to avoid directory splits which could lead to overlap between the nodes. In case that during the insert procedure, an internal node of the Xtree has to be split, the split algorithm takes into account properties of the MBR like deadspace partitioning, extension of the MBR s, etc., and tries to find a split which doesn t introduce overlap or, if that is not possible, introduces the minimal amount of overlap. C. Result interpretation The Xtree has been created based on the insight that data overlap is the reason for low performance of Rtree based approaches. The structure of the Xtree together with specialy designed insertion and split algorithms contribute to achieving an increased performance level over the R*tree for data with as much as 16 dimensions. However, since the supernodes become quite large with the increase in the number of data dimensions, the time needed to linearly scan their contents will also keep increasing, causing the Xtree to quickly reach its limits. IV. LOCALITY SENSITIVE HASHING Despite decades of research efforts, current solutions for finding the knearest neighbors in high dimensional data do not provide the necessary performance. Although both approaches presented before achieve better performance than stateoftheart Rtree, they still have a dimension threshold for their performance. The authors of [12] observe that there are many applications of nearest neighbors search where the exact answer is not really needed and an aproximate answer is good enough. Based on this insight and the assumption that the approximate similarity search can be performed much faster that the exact one, a new technique relying on Locality Sensitive Hashing [6] is proposed. B. Performance evaluation The performance of the Xtree has been evaluated in comparison with the R*tree on both synthetic and real data. The results have shown that the Xtree outperforms the R*tree up to orders of magnitude for both point and nearest neighbors queries on both types of data. Figure 6 shows that the Xtree obtains a maximum speedup of 20 for 16dimensional real data on nearest neighbors queries with k = 10. Fig. 7: Locality Sensitive Hashing Locality Sensitive Hashing (figure 7) provides an efficient
5 EDIC RESEARCH PROPOSAL 5 approximate nearest neighbor search algorithm. It the first step, Locality Sensitive Hashing defines L, received as input, randomly chosen hash functions which are then applied, one at a time, on all points in the data set, mapping them to buckets into all the hash tables. In order to answer a query, the query point is also hashed with the L functions. Then, for each hash table, the data points corresponding to the same bucket as the query point are retrieved and the answer to the query is among them. A. Locality Sensitive Hashingbased algorithm This new approach allows fast retrieval of approximate answer, which will probably be good enough for most of the cases, followed by a slower but accurate computation for the few cases which require an exact answer. The idea behind the algorithm is that the probability of two points p and q to be hashed to the same bucket is closely related to the distance between the points. The algorithm uses two levels of hashing: Locality Sensitive Hashing function, which maps a point p P to a bucket, as briefly described above; standard hash function, which maps the content of the buckets into a hash table. In order to answer a query, the buckets are processed until either a sufficient number of points or all the buckets have been searched. For approximate knearest neighbor search, the k closest points to the query are returned. If less than k points have been encountered, then the output will also contain less than k results. B. Performance evaluation Performance of the Locality Sensitive Hashingbased algorithm has been evaluated and compared to the performance of the SRtree [8] on real data sets using two performance measures: speed, for both approaches, and accuracy of Locality Sensitive Hashingbased algorithm. The SRtree is an extension of the R*tree which combines the utilization of bounding spheres and bounding rectangles aiming to improve the performance of nearest neighbors queries by reducing the region s volume and diameter. Given that the SRtree outputs the exact nearest neighbors while the Locality Sensitive Hashing returns a list of approximate nearest neighbors, the algorithm for SRtree has been modified to run only on a random sample of the data. Therefore, the modified SRtree also outputs approximate k nearest neighbors and it achieves a speedup due to the fact that the data set used for execution is smaller. As figure 8 shows, the improvement obtained by the Locality Sensitive Hashing algorithm over the modified SRtree is up to an order of magnitude on a data set containing 270,000 points. The Locality Sensitive Hashingbased algorithm scales really well with the increase in data dimensionality, as the number of disk accesses for ɛ =0.1 grows by 2 for a dimensional increase from 8 to 64. As expected, the SRtree s performance degrades rapidly with the increase of data dimensionality. The results are illustrated in figure 9. C. Discussion Fig. 8: Performance vs error Fig. 9: Approximate 10NNS, α=2 A new indexing method based on Locality Sensitive Hashing has been proposed as an alternative for hierarchical data structures. Although the Locality Sensitive Hashing algorithm returns approximate nearest neighbors, the extensive evaluation has shown that allowing a small accuracy loss results in considerably improving execution time. On the other side, a major drawback of Locality Sensitive Hashing is the fact that randomly choosing the hash functions can lead to a poor space partitioning. Therefore, choosing more appropriate hash functions could be taken into consideration and, based on the fact that the Locality Sensitive Hashingbased approach already scales really well for highdimensional data, it could be a good candidate for the next generation of knearest neighbors data cleaning techniques. V. FUTURE WORK An important factor in being able to achieve big data success is having the appropriate hardware and software resources for processing the accumulated data. With the recent improvements in hardware equipment which allows constant increase in data volume, investing time and money into data curation is no longer an option, therefore dealing with the uncertainty of the data is now a necessary evil that has to be dealt with. Although quite a large number of tools of varying functionality support data cleaning process, a significant portion of the cleaning work has to be done either manually or by lowlevel programs that are difficult to write and maintain. [7] This kind
6 EDIC RESEARCH PROPOSAL 6 of approach becomes increasingly difficult with the growth of the dimensions number in the data set. The focus of current research has been on efficient methods for inmemory processing of knearest neighbor queries and a new indexing approach which avoids hierarchical structures issues has been investigated. In the context of NoDB/ViDa project which enables efficient queries on raw, heterogenous data, without preformatiing or loading it into a database, future research will focus on designing algorithms aimed to improve the quality of raw data, so that the results obtained by the new processing techniques will not have to lose accuracy by processing low quality data. A. Current Research When processing data stored on disk, the majority of time was spent to transfer data back and forth to the CPU. As such, the research and improvements brought to the k nearest neighbors approaches have focused on reducing the time spent for data transfer. Given the fact that disk I/O is considerably slower than memory access, the time spent for actually computing distances between query point and objects in the index was hidden by the transfer time. While the inmemory approaches gain popularity due to the increase of available memory in computer systems, the optimizations applied on this indexing methods prove to be less efficient and the need for new ideas emerges. The SNAIL algorithm is an inmemory knearest neighbor technique which avoids hierarchical data structures by using a grid type structure which reduces the time spent traversing the index. The main idea of the algorithm is to find as many groups of points that can be added to the result set without individually computing the distance from the point to the query. The groups we are using are the grid cells, and the intuition is that, if for a cell C, the total number of points from its neighboring cells is lower that the number of searched neighbors, we can conclude that all the points in cell C can be added directly to the result set. As expected, at some point the neighboring cell will contain more than the number of searched cells, and computation of individual distances will be necessary. SNAIL has been evaluated on 3dimensional synthetic data sets with sizes up to 5GB, and the preliminary results have shown that later/greater resolution values result in greater computation time, as the number of objects that need to be individually checked grows. The best execution times have been obtained on a grid resolution of 50 cells per dimension. Current results show that SNAIL building process is almost 20 times more efficient than the building of the Rtree. Given the fact that SNAIL computes the individual distance for a large number of objects in the data set, for high enough value of k, SNAIL execution time outperforms the Rtree approach. B. Future Directions Due to the big data effect, the Curse of Dimensionality has been studied on several problems, such as clustering, indexing, nearest neighbors search and it seems that the fact that in high dimensional space the data tends to become sparse is only part of the problem. Recent results show that in high dimensional space, the concept of proximity or nearest neighbor may not be quantitatively meaningful, and the use of fractional distance metrics L p norm, where p is allowed to be a fraction smaller than 1, are shown be more accurate. [10] More precisely, the distance computation will be made using the following fractional distance metric: dist f d (x, y) = d 1/f (x i y i ) f i=1 Given that SNAIL shows great potential for efficiently processing large data sets, the possibility of adapting it for highdimensional data will be explored. One of the challenges that come up is the choice of the distance metric and the use of fractional distance will be taken into consideration in order to decide if it is appropriate for duplicates detection. Also, since SNAIL has been designed as an inmemory algorithm and the data volume is constantly growing, another challenge will be to adapt it for ondisk data processing. Data cleaning is a vast research area, and while we proposed an algorithm that helps detecting duplicates, other quality problems also exist in data files. Given that the aim is to minimize the time spent for processing data in order to correct it, another challenge is to explore the possibilities of designing algorithms that can address more than one of this issues at a time. REFERENCES [1] C. Böhm, F. Krebs, High Performance Data Mining Using the Nearest Neighbor Join, IEEE International Conference on Data Mining (ICDM), 2002 [2] C. Böhm, H. P. Kriegel, A Cost Model and Index Architecture for the Similarity Join, IEEE International Conference on Data Engineering (ICDE), 2001 [3] A. Guttman, Rtrees: A Dynamic Index Structure for Spatial Searching, ACM SIGMOD International Conference on Management of Data, 1984 [4] S. Berchtold,D. A. Keim, H. P. Kriegel The Xtree: An Index Structure for HighDimensional Data, International Conference on Very Large Data Bases, 1996 [5] N. Beckmann,H. P. Kriegel, R.Schneider, B. Seegerl The R*tree: An Efficient and Robust Access method for Points and Rectangles, International Conference on Very Large Data Bases, 1996 [6] P. Indyk, R. Motwani Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality, ACM Symposium on Theory of Computing, 1998 [7] E. Rahm, H. Hai Do Data Cleaning: Problems and Current Approaches, IEEE Data Engineering Bulletin, 2000 [8] N. Katayama, S. Satoh The SRtree: An Index Structure for High Dimensional Nearest Neighbor Queries, ACM SIGMOD International Conference on Management of Data, 1997 [9] S. Chaudhuri, V. Ganti, R. Kaushik, A Primitive Operator for Similarity Joins in Data Cleaning IEEE International Conference on Data Engineering (ICDE), 2006 [10] C. C. Aggrawal, A. Hinneburg, D. A. Keim, On the Surprising Behaviour of Distance Metrics in High Dimensional Spaces International Conference on Database Theory (ICDE), 2001 [11] G. R. Hjaltson,H. Samet, Ranking in Spatial Databases International Symposyum on Large Spatial Databases (SSD), 1995 [12] A. Gionis, P. Indyk,R. Motwani Similarity Search in High Dimensions via Hashing International Conference on Very Large Data Bases, 1999
Challenges in Finding an Appropriate MultiDimensional Index Structure with Respect to Specific Use Cases
Challenges in Finding an Appropriate MultiDimensional Index Structure with Respect to Specific Use Cases Alexander Grebhahn grebhahn@st.ovgu.de Reimar Schröter rschroet@st.ovgu.de David Broneske dbronesk@st.ovgu.de
More informationRtrees. RTrees: A Dynamic Index Structure For Spatial Searching. RTree. Invariants
RTrees: A Dynamic Index Structure For Spatial Searching A. Guttman Rtrees Generalization of B+trees to higher dimensions Diskbased index structure Occupancy guarantee Multiple search paths Insertions
More informationData Warehousing und Data Mining
Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing GridFiles Kdtrees Ulf Leser: Data
More informationSurvey On: Nearest Neighbour Search With Keywords In Spatial Databases
Survey On: Nearest Neighbour Search With Keywords In Spatial Databases SayaliBorse 1, Prof. P. M. Chawan 2, Prof. VishwanathChikaraddi 3, Prof. Manish Jansari 4 P.G. Student, Dept. of Computer Engineering&
More informationLarge Databases. mjf@inescid.pt, jorgej@acm.org. Abstract. Many indexing approaches for high dimensional data points have evolved into very complex
NBTree: An Indexing Structure for ContentBased Retrieval in Large Databases Manuel J. Fonseca, Joaquim A. Jorge Department of Information Systems and Computer Science INESCID/IST/Technical University
More informationMultidimensional index structures Part I: motivation
Multidimensional index structures Part I: motivation 144 Motivation: Data Warehouse A definition A data warehouse is a repository of integrated enterprise data. A data warehouse is used specifically for
More informationSimilarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE  Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
More informationInternational Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 11, November 2015 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationProviding Diversity in KNearest Neighbor Query Results
Providing Diversity in KNearest Neighbor Query Results Anoop Jain, Parag Sarda, and Jayant R. Haritsa Database Systems Lab, SERC/CSA Indian Institute of Science, Bangalore 560012, INDIA. Abstract. Given
More informationFast Matching of Binary Features
Fast Matching of Binary Features Marius Muja and David G. Lowe Laboratory for Computational Intelligence University of British Columbia, Vancouver, Canada {mariusm,lowe}@cs.ubc.ca Abstract There has been
More informationFAST APPROXIMATE NEAREST NEIGHBORS WITH AUTOMATIC ALGORITHM CONFIGURATION
FAST APPROXIMATE NEAREST NEIGHBORS WITH AUTOMATIC ALGORITHM CONFIGURATION Marius Muja, David G. Lowe Computer Science Department, University of British Columbia, Vancouver, B.C., Canada mariusm@cs.ubc.ca,
More informationSmartSample: An Efficient Algorithm for Clustering Large HighDimensional Datasets
SmartSample: An Efficient Algorithm for Clustering Large HighDimensional Datasets Dudu Lazarov, Gil David, Amir Averbuch School of Computer Science, TelAviv University TelAviv 69978, Israel Abstract
More informationThe DCtree: A Fully Dynamic Index Structure for Data Warehouses
The DCtree: A Fully Dynamic Index Structure for Data Warehouses Martin Ester, Jörn Kohlhammer, HansPeter Kriegel Institute for Computer Science, University of Munich Oettingenstr. 67, D80538 Munich,
More informationVector storage and access; algorithms in GIS. This is lecture 6
Vector storage and access; algorithms in GIS This is lecture 6 Vector data storage and access Vectors are built from points, line and areas. (x,y) Surface: (x,y,z) Vector data access Access to vector
More informationMultimedia Databases. WolfTilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tubs.
Multimedia Databases WolfTilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tubs.de 14 Previous Lecture 13 Indexes for Multimedia Data 13.1
More informationCluster Analysis: Advanced Concepts
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototypebased Fuzzy cmeans
More informationChapter 13: Query Processing. Basic Steps in Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationEffective Complex Data Retrieval Mechanism for Mobile Applications
, 2325 October, 2013, San Francisco, USA Effective Complex Data Retrieval Mechanism for Mobile Applications Haeng Kon Kim Abstract While mobile devices own limited storages and low computational resources,
More informationThe DCTree: A Fully Dynamic Index Structure for Data Warehouses
Published in the Proceedings of 16th International Conference on Data Engineering (ICDE 2) The DCTree: A Fully Dynamic Index Structure for Data Warehouses Martin Ester, Jörn Kohlhammer, HansPeter Kriegel
More informationCluster Analysis for Optimal Indexing
Proceedings of the TwentySixth International Florida Artificial Intelligence Research Society Conference Cluster Analysis for Optimal Indexing Tim Wylie, Michael A. Schuh, John Sheppard, and Rafal A.
More informationComp 5311 Database Management Systems. 16. Review 2 (Physical Level)
Comp 5311 Database Management Systems 16. Review 2 (Physical Level) 1 Main Topics Indexing Join Algorithms Query Processing and Optimization Transactions and Concurrency Control 2 Indexing Used for faster
More informationACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
More informationEM Clustering Approach for MultiDimensional Analysis of Big Data Set
EM Clustering Approach for MultiDimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
More informationQuickDB Yet YetAnother Database Management System?
QuickDB Yet YetAnother Database Management System? Radim Bača, Peter Chovanec, Michal Krátký, and Petr Lukáš Radim Bača, Peter Chovanec, Michal Krátký, and Petr Lukáš Department of Computer Science, FEECS,
More informationSPATIAL DATA CLASSIFICATION AND DATA MINING
, pp.4044. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal
More information2) What is the structure of an organization? Explain how IT support at different organizational levels.
(PGDIT 01) Paper  I : BASICS OF INFORMATION TECHNOLOGY 1) What is an information technology? Why you need to know about IT. 2) What is the structure of an organization? Explain how IT support at different
More informationDATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDDLAB ISTI CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
More informationIMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH
IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria
More informationCommon Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein Falaki @mhfalaki hossein@databricks.com
Common Patterns and Pitfalls for Implementing Algorithms in Spark Hossein Falaki @mhfalaki hossein@databricks.com Challenges of numerical computation over big data When applying any algorithm to big data
More informationVoronoi Treemaps in D3
Voronoi Treemaps in D3 Peter Henry University of Washington phenry@gmail.com Paul Vines University of Washington paul.l.vines@gmail.com ABSTRACT Voronoi treemaps are an alternative to traditional rectangular
More informationThe Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
More informationClustering. Data Mining. Abraham Otero. Data Mining. Agenda
Clustering 1/46 Agenda Introduction Distance Knearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in
More informationPhysical Data Organization
Physical Data Organization Database design using logical model of the database  appropriate level for users to focus on  user independence from implementation details Performance  other major factor
More informationKEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS
ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,
More informationIndexing Spatiotemporal Archives
The VLDB Journal manuscript No. (will be inserted by the editor) Indexing Spatiotemporal Archives Marios Hadjieleftheriou 1, George Kollios 2, Vassilis J. Tsotras 1, Dimitrios Gunopulos 1 1 Computer Science
More informationKnowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19  Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19  Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.standrews.ac.uk twk@standrews.ac.uk Tom Kelsey ID505919B &
More informationStorage Management for Files of Dynamic Records
Storage Management for Files of Dynamic Records Justin Zobel Department of Computer Science, RMIT, GPO Box 2476V, Melbourne 3001, Australia. jz@cs.rmit.edu.au Alistair Moffat Department of Computer Science
More informationENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #2610771
ENHANCEMENTS TO SQL SERVER COLUMN STORES Anuhya Mallempati #2610771 CONTENTS Abstract Introduction Column store indexes Batch mode processing Other Enhancements Conclusion ABSTRACT SQL server introduced
More informationWhitepaper. Innovations in Business Intelligence Database Technology. www.sisense.com
Whitepaper Innovations in Business Intelligence Database Technology The State of Database Technology in 2015 Database technology has seen rapid developments in the past two decades. Online Analytical Processing
More informationHELP DESK SYSTEMS. Using CaseBased Reasoning
HELP DESK SYSTEMS Using CaseBased Reasoning Topics Covered Today What is HelpDesk? Components of HelpDesk Systems Types Of HelpDesk Systems Used Need for CBR in HelpDesk Systems GE Helpdesk using ReMind
More informationPerformance evaluation of Web Information Retrieval Systems and its application to ebusiness
Performance evaluation of Web Information Retrieval Systems and its application to ebusiness Fidel Cacheda, Angel Viña Departament of Information and Comunications Technologies Facultad de Informática,
More informationIns+tuto Superior Técnico Technical University of Lisbon. Big Data. Bruno Lopes Catarina Moreira João Pinho
Ins+tuto Superior Técnico Technical University of Lisbon Big Data Bruno Lopes Catarina Moreira João Pinho Mo#va#on 2 220 PetaBytes Of data that people create every day! 2 Mo#va#on 90 % of Data UNSTRUCTURED
More informationAPPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM
152 APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM A1.1 INTRODUCTION PPATPAN is implemented in a test bed with five Linux system arranged in a multihop topology. The system is implemented
More information2 Associating Facts with Time
TEMPORAL DATABASES Richard Thomas Snodgrass A temporal database (see Temporal Database) contains timevarying data. Time is an important aspect of all realworld phenomena. Events occur at specific points
More informationInfiniteGraph: The Distributed Graph Database
A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086
More informationProtein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks YoungRae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationA Novel Density based improved kmeans Clustering Algorithm Dbkmeans
A Novel Density based improved kmeans Clustering Algorithm Dbkmeans K. Mumtaz 1 and Dr. K. Duraiswamy 2, 1 Vivekanandha Institute of Information and Management Studies, Tiruchengode, India 2 KS Rangasamy
More informationInvestigating the Effects of Spatial Data Redundancy in Query Performance over Geographical Data Warehouses
Investigating the Effects of Spatial Data Redundancy in Query Performance over Geographical Data Warehouses Thiago Luís Lopes Siqueira Ricardo Rodrigues Ciferri Valéria Cesário Times Cristina Dutra de
More informationEMC DATA DOMAIN SISL SCALING ARCHITECTURE
EMC DATA DOMAIN SISL SCALING ARCHITECTURE A Detailed Review ABSTRACT While tape has been the dominant storage medium for data protection for decades because of its low cost, it is steadily losing ground
More informationBM + tree: A Hyperplanebased Index Method for HighDimensional Metric Spaces
BM + tree: A Hyperplanebased Index Method for HighDimensional Metric Spaces Xiangmin Zhou 1, Guoren Wang 1, Xiaofang Zhou and Ge Yu 1 1 College of Information Science and Engineering, Northeastern University,
More informationCopyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 131
Slide 131 Chapter 13 Disk Storage, Basic File Structures, and Hashing Chapter Outline Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and Extendible
More informationInSitu Bitmaps Generation and Efficient Data Analysis based on Bitmaps. Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
InSitu Bitmaps Generation and Efficient Data Analysis based on Bitmaps Yu Su, Yi Wang, Gagan Agrawal The Ohio State University Motivation HPC Trends Huge performance gap CPU: extremely fast for generating
More informationFile Management. Chapter 12
Chapter 12 File Management File is the basic element of most of the applications, since the input to an application, as well as its output, is usually a file. They also typically outlive the execution
More informationGoing Big in Data Dimensionality:
LUDWIG MAXIMILIANS UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für
More informationOverview of Storage and Indexing
Overview of Storage and Indexing Chapter 8 How indexlearning turns no student pale Yet holds the eel of science by the tail.  Alexander Pope (16881744) Database Management Systems 3ed, R. Ramakrishnan
More informationDISTRIBUTED INDEX FOR MATCHING MULTIMEDIA OBJECTS
DISTRIBUTED INDEX FOR MATCHING MULTIMEDIA OBJECTS by Ahmed Abdelsadek B.Sc., Cairo University, 2010 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in
More informationBenchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flashbased Storage Arrays Version 1.0 Abstract
More informationManual for BEAR Big Data Ensemble of Adaptations for Regression Version 1.0
Manual for BEAR Big Data Ensemble of Adaptations for Regression Version 1.0 Vahid Jalali David Leake August 9, 2015 Abstract BEAR is a casebased regression learner tailored for big data processing. It
More informationCUBE INDEXING IMPLEMENTATION USING INTEGRATION OF SIDERA AND BERKELEY DB
CUBE INDEXING IMPLEMENTATION USING INTEGRATION OF SIDERA AND BERKELEY DB Badal K. Kothari 1, Prof. Ashok R. Patel 2 1 Research Scholar, Mewar University, Chittorgadh, Rajasthan, India 2 Department of Computer
More informationEFFICIENT DATA ANALYSIS SCHEME FOR INCREASING PERFORMANCE IN BIG DATA
EFFICIENT DATA ANALYSIS SCHEME FOR INCREASING PERFORMANCE IN BIG DATA Mr. V. Vivekanandan Computer Science and Engineering, SriGuru Institute of Technology, Coimbatore, Tamilnadu, India. Abstract Big data
More informationAn Analysis on Density Based Clustering of Multi Dimensional Spatial Data
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,
More informationWhen Is Nearest Neighbor Meaningful?
When Is Nearest Neighbor Meaningful? Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft CS Dept., University of WisconsinMadison 1210 W. Dayton St., Madison, WI 53706 {beyer, jgoldst,
More informationLecture 1: Data Storage & Index
Lecture 1: Data Storage & Index R&G Chapter 811 Concurrency control Query Execution and Optimization Relational Operators File & Access Methods Buffer Management Disk Space Management Recovery Manager
More informationChapter 13 Disk Storage, Basic File Structures, and Hashing.
Chapter 13 Disk Storage, Basic File Structures, and Hashing. Copyright 2004 Pearson Education, Inc. Chapter Outline Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT Inmemory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationJunghyun Ahn Changho Sung Tag Gon Kim. Korea Advanced Institute of Science and Technology (KAIST) 3731 Kuseongdong, Yuseonggu Daejoen, Korea
Proceedings of the 211 Winter Simulation Conference S. Jain, R. R. Creasey, J. Himmelspach, K. P. White, and M. Fu, eds. A BINARY PARTITIONBASED MATCHING ALGORITHM FOR DATA DISTRIBUTION MANAGEMENT Junghyun
More informationA Unified Approximate Nearest Neighbor Search Scheme by Combining Data Structure and Hashing
Proceedings of the TwentyThird International Joint Conference on Artificial Intelligence A Unified Approximate Nearest Neighbor Search Scheme by Combining Data Structure and Hashing Debing Zhang Genmao
More informationChapter 13. Disk Storage, Basic File Structures, and Hashing
Chapter 13 Disk Storage, Basic File Structures, and Hashing Chapter Outline Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and Extendible Hashing
More informationEFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES
ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada tcossentine@gmail.com
More informationPerformance Tuning for the Teradata Database
Performance Tuning for the Teradata Database Matthew W Froemsdorf Teradata Partner Engineering and Technical Consulting  i  Document Changes Rev. Date Section Comment 1.0 20101026 All Initial document
More informationAg + tree: an Index Structure for Rangeaggregation Queries in Data Warehouse Environments
Ag + tree: an Index Structure for Rangeaggregation Queries in Data Warehouse Environments Yaokai Feng a, Akifumi Makinouchi b a Faculty of Information Science and Electrical Engineering, Kyushu University,
More informationEchidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis
Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Abdun Mahmood, Christopher Leckie, Parampalli Udaya Department of Computer Science and Software Engineering University of
More informationSupporting Software Development Process Using Evolution Analysis : a Brief Survey
Supporting Software Development Process Using Evolution Analysis : a Brief Survey Samaneh Bayat Department of Computing Science, University of Alberta, Edmonton, Canada samaneh@ualberta.ca Abstract During
More informationFig. 1 A typical Knowledge Discovery process [2]
Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering
More information8. Query Processing. Query Processing & Optimization
ECS165A WQ 11 136 8. Query Processing Goals: Understand the basic concepts underlying the steps in query processing and optimization and estimating query processing cost; apply query optimization techniques;
More informationAsicBoost A Speedup for Bitcoin Mining
AsicBoost A Speedup for Bitcoin Mining Dr. Timo Hanke March 31, 2016 (rev. 5) Abstract. AsicBoost is a method to speed up Bitcoin mining by a factor of approximately 20%. The performance gain is achieved
More informationOverview of Storage and Indexing. Data on External Storage. Alternative File Organizations. Chapter 8
Overview of Storage and Indexing Chapter 8 How indexlearning turns no student pale Yet holds the eel of science by the tail.  Alexander Pope (16881744) Database Management Systems 3ed, R. Ramakrishnan
More informationThe primary goal of this thesis was to understand how the spatial dependence of
5 General discussion 5.1 Introduction The primary goal of this thesis was to understand how the spatial dependence of consumer attitudes can be modeled, what additional benefits the recovering of spatial
More informationCHAPTER24 Mining Spatial Databases
CHAPTER24 Mining Spatial Databases 24.1 Introduction 24.2 Spatial Data Cube Construction and Spatial OLAP 24.3 Spatial Association Analysis 24.4 Spatial Clustering Methods 24.5 Spatial Classification
More informationA Dynamic Load Balancing Strategy for Parallel Datacube Computation
A Dynamic Load Balancing Strategy for Parallel Datacube Computation Seigo Muto Institute of Industrial Science, University of Tokyo 7221 Roppongi, Minatoku, Tokyo, 1068558 Japan +81334026231 ext.
More informationSecure Similarity Search on Outsourced Metric Data
International Journal of Computer Trends and Technology (IJCTT) volume 6 number 5 Dec 213 Secure Similarity Search on Outsourced Metric Data P.Maruthi Rao 1, M.Gayatri 2 1 (M.Tech Scholar,Department of
More informationFPGA area allocation for parallel C applications
1 FPGA area allocation for parallel C applications VladMihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University
More informationPredictive Indexing for Fast Search
Predictive Indexing for Fast Search Sharad Goel Yahoo! Research New York, NY 10018 goel@yahooinc.com John Langford Yahoo! Research New York, NY 10018 jl@yahooinc.com Alex Strehl Yahoo! Research New York,
More informationA Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed
More informationEnergy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
More informationData Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototypebased clustering Densitybased clustering Graphbased
More informationMapReduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programming Model
MapReduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programming Model Subramanyam. RBV, Sonam. Gupta Abstract An important problem that appears often when analyzing data involves
More informationClustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015
Clustering Big Data Efficient Data Mining Technologies J Singh and Teresa Brooks June 4, 2015 Hello Bulgaria (http://hello.bg/) A website with thousands of pages... Some pages identical to other pages
More informationSome Computer Organizations and Their Effectiveness. Michael J Flynn. IEEE Transactions on Computers. Vol. c21, No.
Some Computer Organizations and Their Effectiveness Michael J Flynn IEEE Transactions on Computers. Vol. c21, No.9, September 1972 Introduction Attempts to codify a computer have been from three points
More informationIndexing the Trajectories of Moving Objects in Networks
Indexing the Trajectories of Moving Objects in Networks Victor Teixeira de Almeida Ralf Hartmut Güting Praktische Informatik IV Fernuniversität Hagen, D5884 Hagen, Germany {victor.almeida, rhg}@fernunihagen.de
More informationStorage Systems Autumn 2009. Chapter 6: Distributed Hash Tables and their Applications André Brinkmann
Storage Systems Autumn 2009 Chapter 6: Distributed Hash Tables and their Applications André Brinkmann Scaling RAID architectures Using traditional RAID architecture does not scale Adding news disk implies
More informationClustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
More informationLoad Balancing in Distributed Data Base and Distributed Computing System
Load Balancing in Distributed Data Base and Distributed Computing System Lovely Arya Research Scholar Dravidian University KUPPAM, ANDHRA PRADESH Abstract With a distributed system, data can be located
More informationFuzzy Duplicate Detection on XML Data
Fuzzy Duplicate Detection on XML Data Melanie Weis HumboldtUniversität zu Berlin Unter den Linden 6, Berlin, Germany mweis@informatik.huberlin.de Abstract XML is popular for data exchange and data publishing
More informationHighperformance XML Storage/Retrieval System
UDC 00.5:68.3 Highperformance XML Storage/Retrieval System VYasuo Yamane VNobuyuki Igata VIsao Namba (Manuscript received August 8, 000) This paper describes a system that integrates fulltext searching
More informationData Warehouse: Introduction
Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of base and data mining group,
More informationNew Hash Function Construction for Textual and Geometric Data Retrieval
Latest Trends on Computers, Vol., pp.483489, ISBN 9789647434, ISSN 7945, CSCC conference, Corfu, Greece, New Hash Function Construction for Textual and Geometric Data Retrieval Václav Skala, Jan
More informationBUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE
BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE Alex Lin Senior Architect Intelligent Mining alin@intelligentmining.com Outline Predictive modeling methodology knearest Neighbor
More information