MineBench: A Benchmark Suite for Data Mining Workloads
|
|
|
- Bryan Eustace Kennedy
- 9 years ago
- Views:
Transcription
1 MineBench: A Benchmark Suite for Data Mining Workloads Ramanathan Narayanan Berkin Özıṣıkyılmaz Joseph Zambreno Gokhan Memik Alok Choudhary Electrical Engineering and Computer Science Electrical and Computer Engineering Northwestern University Iowa State University Evanston, IL 60208, USA Ames, IA 50011, USA {ran310, boz283, memik, choudhar}@eecs.northwestern.edu [email protected] Jayaprakash Pisharath Architecture Performance and Projections Group Intel Corporation Santa Clara, CA [email protected] Abstract Data mining constitutes an important class of scientific and commercial applications. Recent advances in data extraction techniques have created vast data sets, which require increasingly complex data mining algorithms to sift through them to generate meaningful information. The disproportionately slower rate of growth of computer systems has led to a sizeable performance gap between data mining systems and algorithms. The first step in closing this gap is to analyze these algorithms and understand their bottlenecks. With this knowledge, current computer architectures can be optimized for data mining applications. In this paper, we present MineBench, a publicly available benchmark suite containing fifteen representative data mining applications belonging to various categories such as clustering, classification, and association rule mining. We believe that MineBench will be of use to those looking to characterize and accelerate data mining workloads. I. INTRODUCTION Data mining, a technique to understand and convert raw data into useful information, is increasingly being used in a variety of fields like marketing, business intelligence, scientific discoveries, biotechnology, internet searches, and multimedia. In data mining, the process of extracting information from raw data is automated, and mostly predictive. This is the aspect that makes data mining different from common database techniques. Data mining is essential to automate the finding of useful information from such large input data. For instance, a grocery chain would apply association rule mining to find and analyze local buying patterns. Recent advances in data extraction techniques have resulted in massive data sets for data mining applications. A survey done by University of California, Berkeley indicates that an average person collects 800MB of data a year [1]. As a result, increasingly complex data mining algorithms are being used to extract information from these vast data sets. However, limitations in overall system performance will ultimately result in prohibitive execution times for these crucial applications. Hence there is a need to redesign modern architectures to optimize their performance on data mining workloads. This process involves two steps: First, since data mining is a relatively new application area, there is a need to understand the performance characteristics of various representative applications. Second, high performance computer architectures optimized for data mining systems need to be developed and their performance on commonly used workloads needs to be evaluated. For these purposes, there is a need for a benchmark that encompasses various representative data mining algorithms. In this paper, we present MineBench, which contains fifteen representative data mining workloads from various categories. The workloads chosen represent the heterogeneity of algorithms and methods used in data mining. Applications from clustering, classification, association rule mining and optimization categories are included in MineBench. The codes are full fledged implementations of entire data mining applications, as opposed to stand-alone algorithmic kernels. We provide OpenMP parallelized codes for twelve of the fifteen applications. An important aspect of data mining applications are the data sets used. For most of the applications, we provide three categories of datasets with varying sizes: small, medium, and large. In addition, we provide source code, information for compiling the applications using various compilers, and command line arguments for all of the applications. The remainder of this paper is organized as follows. In the following section, we provide a brief overview of the related work in this area. In Section 3, we discuss the data mining applications included in our benchmark suite. Section 4 details the various input data sets available. In Section 5, we provide command line arguments for the applications, discuss compiler compatibility, and simulation methods available. Finally the
2 TABLE I OVERVIEW OF THE MINEBENCH DATA MINING BENCHMARK SUITE Application Category Description ScalParC Classification Decision tree classification Naive Bayesian Classification Simple statistical classifier K-means Clustering Mean-based data partitioning method Fuzzy K-means Clustering Fuzzy logic-based data partitioning method HOP Clustering Density-based grouping method BIRCH Clustering Hierarchical Clustering method Eclat ARM Vertical database, Lattice transversal techniques used Apriori ARM Horizontal database, level-wise mining based on Apriori property Utility ARM Utility-based association rule mining SNP Classification Hill-climbing search method for DNA dependency extraction GeneNet Structure Learning Gene relationship extraction using microarray-based method SEMPHY Structure Learning Gene sequencing using phylogenetic tree-based method Rsearch Classification RNA sequence search using stochastic Context-Free Grammars SVM-RFE Classification Gene expression classifier using recursive feature elimination PLSA Optimization DNA sequence alignment using Smith-Waterman optimization method paper is concluded in Section 6 with a summary. II. RELATED WORK Benchmarks play a major role in all domains. SPEC [2] benchmarks have been well accepted and used by several processor manufacturers and researchers to measure the effectiveness of their design. Other fields have popular benchmarking suites designed for the specific application domain: TPC [3] for database systems, SPLASH [4] for parallel architectures, and MediaBench [5] for media and communication processors. Several bioinformatics benchmark suites have been developed recently. BioInfoMark [6], BioBench [7], and BioPerf [8] all contain several applications in common, including Blast, FASTA, Clustalw, and Hmmer. The bioinformatics applications presented in MineBench differ from those included in these benchmark suites. BioInfoMark and BioBench contain only serial workloads. In Bioperf, a few applications have been parallelized, unlike in MineBench which contains full fledged OpenMP parallelized codes of all bioinformatics workloads. Compared to the above benchmarks, MineBench covers a wider spectrum of data mining algorithms, in terms of quantity and diversity. III. BENCHMARK SUITE OVERVIEW Data mining applications can be broadly classified into association rule mining, classification, clustering, data visualization, sequence mining, similarity search, optimization, and text mining, among others. Each domain contains unique algorithmic features. In establishing MineBench, we based our selection of categories on how commonly these applications are used in industry, and how likely they are to be used in the future. The fifteen applications that currently comprise MineBench are listed in Table I, and are described in more detail in the following sections. A. Classification Workloads A classification problem has an input dataset called the training set, which consists of example records with a number of attributes. The objective of a classification algorithm is to use this training dataset to build a model such that the model can be used to assign unclassified records into one of the defined classes [9]. ScalParC is an efficient and scalable variation of decision tree classification [10]. The decision tree model is built by recursively splitting the training dataset based on an optimality criterion until all records belonging to each of the partitions bear the same class label. Decision tree based models are relatively inexpensive to construct, easy to interpret and easy to integrate with commercial database systems. ScalParC uses a parallel hashing paradigm to achieve scalability during the splitting phase. This approach makes it scalable in both runtime and memory requirements. The Naive Bayesian classifier [11], a simple statistical classifier, uses an input training dataset to build a predictive model (containing classes of records) such that the model can be used to assign unclassified records into one of the defined classes. Naive Bayes classifiers are based on probability models that incorporate strong independence assumptions which often have no bearing in reality, hence the term naive. They exhibit high accuracy and speed when applied to large databases. Single nucleotide polymorphisms (SNPs), are DNA sequence variations that occur when a single nucleotide is altered in a genome sequence. Understanding the importance of the many recently identified SNPs in human genes has become a primary goal of human genetics. The SNP [12] benchmark uses the hill climbing search method, which selects an initial starting point (an initial Bayesian Network structure) and searches that point s nearest neighbors. The neighbor that has the highest score is then made the new current point. This procedure iterates until it reaches a local maximum score. Recent advances in DNA microarray technologies have made it possible to measure expression patterns of all the genes in an organism, thereby necessitating algorithms that are able to handle thousands of variables simultaneously. By represent-
3 ing each gene as a variable of a Bayesian Network (BN), the gene expression data analysis problem can be formulated as a BN structure learning problem. GeneNet [12] uses a similar hill climbing algorithm as in SNP, the main difference being that the input data is more complex, requiring much additional computation during the learning process. Moreover, unlike the SNP application, the number of variables runs into thousands, but only hundreds of training cases are available. GeneNet has been parallelized using a node level parallelization paradigm, where in each step, the nodes of the BN are distributed to different processors. SEMPHY [12] is a structure learning algorithm that is based on phylogenetic trees. Phylogenetic trees represent the genetic relationship of a species, with closely related species placed in nearby branches. Phylogenetic tree inference is a high performance computing problem as biological data size increases exponentially. SEMPHY uses the structural expectation maximization algorithm, to efficiently search for maximum likelihood phylogenetic trees. The computation in SEMPHY scales quadratically with the input data size, necessitating parallelization. The computation intensive kernels in the algorithm are identified and parallelized using OpenMP. Typically, RNA sequencing problems involve searching the gene database for homologous RNA sequences. Rsearch [12] uses a grammar-based approach to achieve this goal. Stochastic context-free grammars are used to build and represent a single RNA sequence, and a local alignment algorithm is used to search the database for homologous RNAs. Rsearch is parallelized using a dynamic load-balancing mechanism based on partitioning the variable length database sequence to fixed length chunks, with specific domain knowledge. SVM-RFE [12], or Support Vector Machines - Recursive Feature Elimination, is a feature selection method. SVM- RFE is used extensively in disease finding (gene expression problem). The selection is obtained by a recursive feature elimination process - at each RFE step, a gene is discarded from the active variables of a SVM classification model, according to some support criteria. Vector multiplication is the computation intensive kernel of SVM-RFE, and data parallelism, using OpenMP, is utilized to parallelize the algorithm. B. Clustering Workloads Clustering is the process of discovering the groups of similar objects from a database to characterize the underlying data distribution [9]. It has wide applications in market or customer segmentation, pattern recognition, and spatial data analysis. The first clustering application in MineBench is K- means [13]. K-means represents a cluster by the mean value of all objects contained in it. Given the user-provided parameter k, the initial k cluster centers are randomly selected from the database. Then, each object is assigned a nearest cluster center based on a similarity function. Once the new assignments are completed, new centers are found by finding the mean of all the objects in each cluster. This process is repeated until some convergence criteria is met. K-means tries to minimize the total intra-cluster variance. The clusters provided by the K-means algorithm are sometimes called hard clusters, since any data object either is or is not a member of a particular cluster. The Fuzzy K- means algorithm [14] relaxes this condition by assuming that a data object can have a degree of membership in each cluster. Compared to the similarity function used in K-means, the calculation for fuzzy membership results in a higher computational cost. However, the flexibility of assigning objects to multiple clusters might be necessary to generate better clustering qualities. Both K-means and Fuzzy K-means are parallelized by distributing the input objects among the processors. At the end of each iteration, extra communication is necessary to synchronize the clustering process. HOP [15], originally proposed in astrophysics, is a typical density-based clustering method. After assigning an estimation of the density for each particle, HOP associates each particle with its densest neighbor. The assignment process continues until the densest neighbor of a particle is itself. All particles reaching the same such particle are clustered into the same group. The advantage of HOP over other density based clustering methods is that it is spatially adaptive, coordinate-free and numerically straightforward. HOP is parallelized using a three dimensional KD tree data structure, which allows each thread to process only a subset of the particles, thereby reducing communication cost significantly. In the next version of MineBench there will be more clustering applications incorporated from Clusbench [16]. This clustering benchmark consists of single threaded versions of K-means online, K-means batch, Self Organizing Map- 1D (SOM-1D), SOM-2D, Hierarchical K-means online and Hierarchical SOM-1D algorithms. K-means batch algorithm is similiar to the implementation available in MineBench. However, in K-means online algorithm, the cluster center positions are updated during each assignment of the objects. Hence it is an incremental version of batch K-means. The selforganizing map is a type of artficial neural network which uses unsupervised learning to reduce the dimensions of the data. The algorithm randomizes the input weight vectors for the input data. Then for each object, Euclidean distance is used to find similarity between the object and each node s weight vector in the map. Finally, the nodes in the neighbourhood of the nearest map node are updated by pulling them closer to the input object. The difference between SOM-1D and SOM2-D is the way the original nodes weight vectors are connected in space. Hierarchical versions of this algorithm employ recursive calls of the original clustering algorithm. More information regarding these applications and their usage can be found in [16]. BIRCH [17] is an incremental and hierarchical clustering algorithm for large databases. It employs a hierarchical tree to represent the closeness of data objects. BIRCH scans the database to build a clustering-feature (CF) tree to summarize
4 TABLE II SUMMARY OF MINEBENCH EXECUTION COMMANDS Application ScalParC Naive Bayesian K-means Fuzzy K-means HOP BIRCH Eclat Apriori Utility SNP GeneNet SEMPHY Rsearch SVM-RFE PLSA Execution Command scalparc <datafile> <# records> <# attributes> <# classes> <# threads> bci [options] -d -h <hdrfile> <tabfile> <bcfile> example [options] -i <input file> example [options] -f -i <input file> para hop <num particles> <particle file> <nsmooth> <bucket size> <nhop> <nthreads> birch <parafile> <schefile> <projfile> <datafile> eclat -i <input file> -o <output file> -s <support> no output apriori -i <input file> -f <offset file> -s <support> -n <# threads> utility mine <transaction file> <offset file> <profit file> <utility threshold> <# threads> snp <input file> genenet.icc <input file> semphy.mt -s <input file> -f <input format> -m <model type> -G <alpha value> rsearch [options] <query sequence file> <data file> svm mkl <data file> <# cases> <# genes> <# iterations> parasw.mt <sequence 1> <sequence 2> <block height> <block width> <grid rows> <grid columns> <# alignments> <# threads> the cluster representation. For a large database, BIRCH can achieve good performance and scalability. It is also effective for incremental clustering of incoming data objects, or when an input data stream has to be clustered. C. ARM Workloads The goal of Association Rule Mining (ARM) is to find interesting relationships hidden in large data sets. More specifically, it attempts to find the set of all subsets of items or attributes that frequently occur in database records [9]. In addition, the uncovered relationships can be represented in the form of association rules, which state how a given subset of items influence the presence of another subset. Apriori [18] is the first ARM algorithm that pioneered the use of support-based pruning to systematically control the exponential growth of the search space. It is a level-wise algorithm that employs a generate-and-test strategy for finding frequent itemsets. It is based on the property that all non-empty subsets of a frequent itemset must all be frequent (the socalled Apriori property). For determining frequent items in a fast manner, the algorithm uses a hash tree to store candidate itemsets. Note: This hash tree has item sets at the leaves and hash tables at internal nodes. The Apriori algorithm is parallelized by distributing chunks of the input data among the processors. The master processor then gathers the local candidate itemsets, and generates globally frequent itemsets. Utility mining [19] is another association rule-based mining technique where the assumption of uniformity among items is discarded. Higher utility itemsets are identified from a database by considering different values for individual items. The goal of utility mining is to restrict the size of the candidate set so as to simplify the total number of computations required to calculate the value of items. It uses the transactionweighted downward closure property to prune the search space. The parallelization paradigm applied to Utility mining is the same as in Apriori, described above. Eclat [20] uses a vertical database format. It can determine the support of any k-itemset by simply intersecting the idlist of the first two (k-1)-length subsets that share a common prefix. It breaks the search space into small, independent, and manageable chunks. Efficient lattice traversal techniques are used to identify all the true maximal frequent itemsets. D. Optimization Workloads Sequence alignment is an important problem in bioinformatics used to align DNA, RNA or protein primary sequences so as to emphasize their regions of similarity. It is used to identify the similar and diverged regions between two sequences, which may indicate functional and evolutionary relationships between them. PLSA [12] uses a dynamic programming approach to solve this sequence matching problem. It is based on the algorithm proposed by Smith and Waterman, which uses the local alignment to find the longest common substring in sequences. PLSA uses special data structures to intelligently segment the whole sequence alignment problem into several independent subproblems, which dramatically reduce the computation necessary, thus providing more parallelism than previous sequence alignment algorithms. IV. INPUT DATASETS Input data is an integral part of data mining applications. The data used in our experiments are either real-world data obtained from various fields or widely-accepted synthetic data generated using existing tools that are used in scientific and statistical simulations. During evaluation, multiple data sizes were used to investigate the characteristics of the MineBench applications. For the non-bioinformatics applications, the input datasets were classified into three different sizes: small, medium, and large. For the ScalParC and Naive Bayesian applications, three synthetic datasets were generated by the IBM Quest data generator [21]. Apriori also uses three synthetic datasets from the IBM Quest data generator with a varying
5 TABLE III RUNTIMES OF MINEBENCH APPLICATIONS WITH DIFFERENT COMPILERS (IN SECONDS) Application ICC 7.0 ICC 8.1 ICC 9.1 GCC 2.96 GCC 3.2 GCC ScalParC Naive Bayesian K-means Fuzzy K-means HOP Birch C.E. C.E. C.E C.E. C.E. Eclat C.E Apriori Utility SNP C.E. C.E. C.E. C.E. GeneNet C.E. C.E. C.E * SEMPHY C.E Rsearch SVM-RFE C.E PLSA C.E.: Compilation Error * Compiled using GCC3.3.4 instead of GCC3.4.5 number of transactions, average transaction size, and average size of the maximal large itemsets. For HOP and Birch, three sets of real data were extracted from a cosmology application, ENZO [22], each having particles, particles and particles. A section of the real image database distributed by Corel Corporation is used for K-means and Fuzzy K-means. This database consists of scenery pictures. Each picture is represented by two features: color and edge. The color feature is a vector of 9 floating points while the edge feature is a vector of size 18. Since the clustering quality of K-means methods highly depends on the input parameter k, both K-means were executed with 10 different k values ranging from 4 to 13. Utility mining uses both real as well as synthetic datasets. The synthetic data consists of two databases generated using the IBM Quest data generator. The first synthetic dataset is a dense database, where the average transaction size is 10; the other is a relatively sparse database, where average transaction size is 20. The average size of the potentially frequent itemsets is 6 in both sets of databases. In both sets of databases, the number of transactions varies from 1000K to 8000K and the number of items varies from 1K to 8K. The real dataset consists of only one database of size 73MB, where the average transaction length is 7.2. Bioinformatics applications use datasets obtained from different biological databases. SNP uses the Human Genic Bi- Allelic Sequences (HGBASE) database containing 616,179 SNPs sequences. For GeneNet, Yeast microarray data was used as the input dataset. SEMPHY considers three datasets from Pfam database. The dataset for Rsearch, obtained from Washington University at St. Louis [23], uses a sequence of length 97 to search a database of size 100KB. SVM-RFE uses a benchmark microarray data set on ovarian cancer. This dataset contains 253 (tissue samples) x (genes) expression values, including 91 control and 162 ovarian cancer tissues with early stage cancer samples. For PLSA, nucleotides ranging from 30K to 900K length are chosen as test sequences [12]. V. USING MINEBENCH The applications in MineBench are written using C/C++ and the codes have been parallelized using OpenMP. Table II provides a summary of MineBench executables and command line arguments. The details about the data sets have been given in the previous section, and the detailed commandline usage is provided in the source distribution. In the current version, we make use of GCC v2.96, and Intel C/C++ (ICC) compiler version 7.0 to compile serial and parallel codes, respectively. We have also tried compiling the codes using different versions of ICC and GCC (for ICC v7.0, v8.1, v9.1 and GCC v2.95, v3.2, v3.4.5), on Redhat Linux, Redhat Linux Enterprise, Fedora Core and Suse distributions. Most of the applications compile properly, requiring only slight changes to header files and makefiles. The execution times using each compiler are presented in Table III. To obtain the runtimes, we have executed the MineBench applications on a dual-processor Intel Xeon 2.8GHz machine using medium sized data sets. Amongst the GCC compilers, we see that the newest version performs the best for a vast majority of the applications. On the other hand, ICC v9.1 produces the fastest running times for around half the applications. A. Simulation of MineBench Applications Because of the way these applications have been parallelized using OpenMP, it is difficult to cross-compile and simulate on most widely-used computer architecture simulators. In addition, the long execution times of these applications prohibit detailed simulations. An architectural simulator that can be used to evaluate single-threaded workloads is PTLSim [4]. On this x86-64 cycle-accurate simulator, it is possible to run executables created with ICC, using the -openmp stubs flag to create a single-threaded application binary. Another important
6 feature of the simulator is its ability to switch between simulation and native execution modes using built-in function calls. By examining the applications call graphs and inserting appropriate PTLSim function calls, we have been able to perform an in-depth study of the computationally intensive kernels. By doing so, we reduce the simulation time by several orders of magnitude. The binaries containing PTLSim function calls are also provided on our center s website [24]. VI. CONCLUSION Data mining applications are gaining significance as computationally intensive workloads, necessitating applicationspecific architectures to optimize their performance. A new data mining benchmark is required for workload characterization and architecture evaluation. In this paper, we present MineBench, a diverse benchmark suite that covers a wide spectrum of data mining applications. More information about benchmark characteristics and workload characterization can be found in [25], [26]. We have provided optimized source codes, data sets of various sizes, and information for compiling and using the MineBench applications on various platforms. MineBench is freely available at our center s website [24]. MineBench will continually evolve to reflect changing trends in the data mining field. We believe MineBench is a useful tool in computer architecture, and systems research. ACKNOWLEDGMENTS This work was supported in part by National Science Foundation (NSF) under grants NGS CNS , IIS /002, CNS , CCF , CCF , and NSF/CARP ST-HEC program under grant CCF , and in part by the Department of Energy s (DOE) SCiDAC program (Scientific Data Management Center), number DE- FC02-01ER25485, DOE grants DE-FG02-05ER25683, and DE-FG02-05ER25691, and in part by Intel Corporation. REFERENCES [1] University of California, Berkeley, The how much information? project, Available at [2] Standard Performance Evaluation Corporation, SPEC CPU2000 V1.2, CPU Benchmarks, Available at [3] Transaction Processing Performance Council, TPC-H Benchmark Revision 2.0.0, [4] S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta, The SPLASH-2 programs: Characterization and methodological considerations, in Proceedings of the 22nd International Symposium on Computer Architecture (ISCA), June 1995, pp [5] C. Lee, M. Potkonjak, and W. H. Mangione-Smith, MediaBench: A tool for evaluating and synthesizing multimedia and communications systems, in Proceedings of 30th Annual International Symposium on Microarchitecture (MICRO), Dec. 1997, pp [6] Y. Li, T. Li, T. Kahveci, and J. Fortes, Workload characterization of bioinformatics applications, in Proceedings of the 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), Sept. 2005, pp [7] K. Albayraktaroglu, A. Jaleel, X. Wu, M. Franklin, B. Jacob, C. Tseng, and D. Yeung, BioBench: A benchmark suite of bioinformatics applications, in Proceedings of The 5th International Symposium on Performance Analysis of Systems and Software (ISPASS), Mar [8] D. Bader, Y. Li, T. Li, and V. Sachdeva, BioPerf: A benchmark suite to evaluate high-performance computer architecture on bioinformatics applications, in Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), Oct [9] J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, Aug [10] M. Joshi, G. Karypis, and V. Kumar, ScalParC: A new scalable and efficient parallel classification algorithm for mining large datasets, in Proceedings of the 11th International Parallel Processing Symposium (IPPS), [11] P. Domingos and M. Pazzani, Beyond independence: Conditions for optimality of the simple bayesian classifier, in Proceedings of the International Conference on Machine Learning, [12] Y. Chen, Q. Diao, C. Dulong, W. Hu, C. Lai, E. Li, W. Li, T. Wang, and Y. Zhang, Performance scalability of data-mining workloads in bioinformatics, Intel Technology Journal, vol. 09, no. 12, pp , May [13] J. MacQueen, Some methods for classification and analysis of multivariate observations, in Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability, [14] J. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer Academic Publishers, [15] D. Eisenstein and P. Hut, Hop: A new group finding algorithm for N-body simulations, Journal of Astrophysics, no. 498, pp , [16] O. Altun, N. Dursunoglu, and M. F. Amasyali, Clustering application benchmark, in To Appear in Proceedings of the International Symposium on Workload Characterization (IISWC), Oct [17] T. Zhang, R. Ramakrishnan, and M. Livny, BIRCH: an efficient data clustering method for very large databases, in SIGMOD, [18] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Verkamo, Fast discovery of association rules, Advances in Knowledge Discovery and Data Mining, pp , [19] Y. Liu, W. Liao, and A. Choudhary, A two-phase algorithm for fast discovery of high utility itemsets, in Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), May [20] M. Zaki, Parallel and distributed association mining: A survey, IEEE Concurrency, Special Issue on Parallel Mechanisms for Data Mining, vol. 7, no. 4, pp , Dec [21] R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer, and R. Srikant, The Quest data mining system, in Proceedings of the 2nd International Conference on Knowledge Discovery in Databases and Data Mining, Aug [22] M. Norman, J. Shalf, S. Levy, and G. Daues, Diving deep: Data management and visualization strategies for adaptive mesh refinement simulations, Computing in Science and Engineering, vol. 1, no. 4, pp , [23] Sean Eddy s Lab, Rsearch software repository, [24] The Center for Ultra-scale Computing and Information Security (CU- CIS) at Northwestern University, NU-Minebench version 2.0, Available at [25] B. Ozisikyilmaz, R. Narayanan, J. Zambreno, G. Memik, and A. Choudhary, An architectural characterization study of data mining and bioinformatics workloads, in To Appear in Proceedings of International Symposium on Workload Characterization (IISWC), Oct [26] J. Zambreno, B. Ozisikyilmaz, J. Pisharath, G. Memik, and A. Choudhary, Performance characterization of data mining applications using MineBench, in 9th Workshop on Computer Architecture Evaluation using Commercial Workloads (CAECW), Feb
PERFORMANCE EVALUATION AND CHARACTERIZATION OF SCALABLE DATA MINING ALGORITHMS
PERFORMANCE EVALUATION AND CHARACTERIZATION OF SCALABLE DATA MINING ALGORITHMS Ying Liu, Jayaprakash Pisharath, Wei-keng Liao, Gokhan Memik, Alok Choudhary, Pradeep Dubey * Department of Electrical and
Random forest algorithm in big data environment
Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest
Using Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
Data Mining Analytics for Business Intelligence and Decision Support
Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing
Introduction to Data Mining
Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:
Data Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
Comparison of K-means and Backpropagation Data Mining Algorithms
Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College
Mobile Phone APP Software Browsing Behavior using Clustering Analysis
Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism Jianqiang Dong, Fei Wang and Bo Yuan Intelligent Computing Lab, Division of Informatics Graduate School at Shenzhen,
Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will
Information Management course
Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli ([email protected])
Protein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
Healthcare Measurement Analysis Using Data mining Techniques
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 03 Issue 07 July, 2014 Page No. 7058-7064 Healthcare Measurement Analysis Using Data mining Techniques 1 Dr.A.Shaik
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
A Way to Understand Various Patterns of Data Mining Techniques for Selected Domains
A Way to Understand Various Patterns of Data Mining Techniques for Selected Domains Dr. Kanak Saxena Professor & Head, Computer Application SATI, Vidisha, [email protected] D.S. Rajpoot Registrar,
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
Fast Multipole Method for particle interactions: an open source parallel library component
Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk,
CAS CS 565, Data Mining
CAS CS 565, Data Mining Course logistics Course webpage: http://www.cs.bu.edu/~evimaria/cs565-10.html Schedule: Mon Wed, 4-5:30 Instructor: Evimaria Terzi, [email protected] Office hours: Mon 2:30-4pm,
Static Data Mining Algorithm with Progressive Approach for Mining Knowledge
Global Journal of Business Management and Information Technology. Volume 1, Number 2 (2011), pp. 85-93 Research India Publications http://www.ripublication.com Static Data Mining Algorithm with Progressive
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Data, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
Principles of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing James D. Jackson Philip J. Hatcher Department of Computer Science Kingsbury Hall University of New Hampshire Durham,
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL
International Journal Of Advanced Technology In Engineering And Science Www.Ijates.Com Volume No 03, Special Issue No. 01, February 2015 ISSN (Online): 2348 7550 ASSOCIATION RULE MINING ON WEB LOGS FOR
Learning outcomes. Knowledge and understanding. Competence and skills
Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges
Manjeet Kaur Bhullar, Kiranbir Kaur Department of CSE, GNDU, Amritsar, Punjab, India
Volume 5, Issue 6, June 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Multiple Pheromone
Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.
International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 9, Issue 8 (January 2014), PP. 19-24 Comparative Analysis of EM Clustering Algorithm
SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD
White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper
Introduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH
MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH M.Rajalakshmi 1, Dr.T.Purusothaman 2, Dr.R.Nedunchezhian 3 1 Assistant Professor (SG), Coimbatore Institute of Technology, India, [email protected]
Dr. U. Devi Prasad Associate Professor Hyderabad Business School GITAM University, Hyderabad Email: [email protected]
96 Business Intelligence Journal January PREDICTION OF CHURN BEHAVIOR OF BANK CUSTOMERS USING DATA MINING TOOLS Dr. U. Devi Prasad Associate Professor Hyderabad Business School GITAM University, Hyderabad
Binary Coded Web Access Pattern Tree in Education Domain
Binary Coded Web Access Pattern Tree in Education Domain C. Gomathi P.G. Department of Computer Science Kongu Arts and Science College Erode-638-107, Tamil Nadu, India E-mail: [email protected] M. Moorthi
Classification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
Grid Density Clustering Algorithm
Grid Density Clustering Algorithm Amandeep Kaur Mann 1, Navneet Kaur 2, Scholar, M.Tech (CSE), RIMT, Mandi Gobindgarh, Punjab, India 1 Assistant Professor (CSE), RIMT, Mandi Gobindgarh, Punjab, India 2
Bioinformatics Grid - Enabled Tools For Biologists.
Bioinformatics Grid - Enabled Tools For Biologists. What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier
Data Mining: Concepts and Techniques Jiawei Han Micheline Kamber Simon Fräser University К MORGAN KAUFMANN PUBLISHERS AN IMPRINT OF Elsevier Contents Foreword Preface xix vii Chapter I Introduction I I.
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
Comparison of Data Mining Techniques for Money Laundering Detection System
Comparison of Data Mining Techniques for Money Laundering Detection System Rafał Dreżewski, Grzegorz Dziuban, Łukasz Hernik, Michał Pączek AGH University of Science and Technology, Department of Computer
SPMF: a Java Open-Source Pattern Mining Library
Journal of Machine Learning Research 1 (2014) 1-5 Submitted 4/12; Published 10/14 SPMF: a Java Open-Source Pattern Mining Library Philippe Fournier-Viger [email protected] Department
International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: 2347-937X DATA MINING TECHNIQUES AND STOCK MARKET
DATA MINING TECHNIQUES AND STOCK MARKET Mr. Rahul Thakkar, Lecturer and HOD, Naran Lala College of Professional & Applied Sciences, Navsari ABSTRACT Without trading in a stock market we can t understand
The basic data mining algorithms introduced may be enhanced in a number of ways.
DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,
Email Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email
Detecting E-mail Spam Using Spam Word Associations
Detecting E-mail Spam Using Spam Word Associations N.S. Kumar 1, D.P. Rana 2, R.G.Mehta 3 Sardar Vallabhbhai National Institute of Technology, Surat, India 1 [email protected] 2 [email protected]
A Grid-based Clustering Algorithm using Adaptive Mesh Refinement
Appears in the 7th Workshop on Mining Scientific and Engineering Datasets 2004 A Grid-based Clustering Algorithm using Adaptive Mesh Refinement Wei-keng Liao Ying Liu Alok Choudhary Abstract Clustering
Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca
Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?
A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery
A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery Runu Rathi, Diane J. Cook, Lawrence B. Holder Department of Computer Science and Engineering The University of Texas at Arlington
DATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS
UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 [email protected] What is Learning? "Learning denotes changes in a system that enable
How To Balance In Cloud Computing
A Review on Load Balancing Algorithms in Cloud Hareesh M J Dept. of CSE, RSET, Kochi hareeshmjoseph@ gmail.com John P Martin Dept. of CSE, RSET, Kochi [email protected] Yedhu Sastri Dept. of IT, RSET,
Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang
Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental
Top Top 10 Algorithms in Data Mining
ICDM 06 Panel on Top Top 10 Algorithms in Data Mining 1. The 3-step identification process 2. The 18 identified candidates 3. Algorithm presentations 4. Top 10 algorithms: summary 5. Open discussions ICDM
MS1b Statistical Data Mining
MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Administrivia and Introduction Course Structure Syllabus Introduction to
Top 10 Algorithms in Data Mining
Top 10 Algorithms in Data Mining Xindong Wu ( 吴 信 东 ) Department of Computer Science University of Vermont, USA; 合 肥 工 业 大 学 计 算 机 与 信 息 学 院 1 Top 10 Algorithms in Data Mining by the IEEE ICDM Conference
131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10
1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom
Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis
Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Abdun Mahmood, Christopher Leckie, Parampalli Udaya Department of Computer Science and Software Engineering University of
Data Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based
A Comparative Study of clustering algorithms Using weka tools
A Comparative Study of clustering algorithms Using weka tools Bharat Chaudhari 1, Manan Parikh 2 1,2 MECSE, KITRC KALOL ABSTRACT Data clustering is a process of putting similar data into groups. A clustering
EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH
EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH SANGITA GUPTA 1, SUMA. V. 2 1 Jain University, Bangalore 2 Dayanada Sagar Institute, Bangalore, India Abstract- One
Distributed Data Mining Algorithm Parallelization
Distributed Data Mining Algorithm Parallelization B.Tech Project Report By: Rishi Kumar Singh (Y6389) Abhishek Ranjan (10030) Project Guide: Prof. Satyadev Nandakumar Department of Computer Science and
MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM
MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM J. Arokia Renjit Asst. Professor/ CSE Department, Jeppiaar Engineering College, Chennai, TamilNadu,India 600119. Dr.K.L.Shunmuganathan
Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking
Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking Kathlene Hurt and Eugene John Department of Electrical and Computer Engineering University of Texas at San Antonio
IJCSES Vol.7 No.4 October 2013 pp.165-168 Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS
IJCSES Vol.7 No.4 October 2013 pp.165-168 Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS V.Sudhakar 1 and G. Draksha 2 Abstract:- Collective behavior refers to the behaviors of individuals
Sanjeev Kumar. contribute
RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 [email protected] 1. Introduction The field of data mining and knowledgee discovery is emerging as a
How To Cluster Of Complex Systems
Entropy based Graph Clustering: Application to Biological and Social Networks Edward C Kenley Young-Rae Cho Department of Computer Science Baylor University Complex Systems Definition Dynamically evolving
Analytics on Big Data
Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
Classification and Prediction
Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser
TOWARD BIG DATA ANALYSIS WORKSHOP
TOWARD BIG DATA ANALYSIS WORKSHOP 邁 向 巨 量 資 料 分 析 研 討 會 摘 要 集 2015.06.05-06 巨 量 資 料 之 矩 陣 視 覺 化 陳 君 厚 中 央 研 究 院 統 計 科 學 研 究 所 摘 要 視 覺 化 (Visualization) 與 探 索 式 資 料 分 析 (Exploratory Data Analysis, EDA)
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
Map/Reduce Affinity Propagation Clustering Algorithm
Map/Reduce Affinity Propagation Clustering Algorithm Wei-Chih Hung, Chun-Yen Chu, and Yi-Leh Wu Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology,
Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland
Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data
Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health
Lecture 1: Data Mining Overview and Process What is data mining? Example applications Definitions Multi disciplinary Techniques Major challenges The data mining process History of data mining Data mining
Chapter 20: Data Analysis
Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification
A Review of Data Mining Techniques
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,
A Survey on Outlier Detection Techniques for Credit Card Fraud Detection
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. VI (Mar-Apr. 2014), PP 44-48 A Survey on Outlier Detection Techniques for Credit Card Fraud
Mining an Online Auctions Data Warehouse
Proceedings of MASPLAS'02 The Mid-Atlantic Student Workshop on Programming Languages and Systems Pace University, April 19, 2002 Mining an Online Auctions Data Warehouse David Ulmer Under the guidance
Big Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,
Customer Classification And Prediction Based On Data Mining Technique
Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor
SPATIAL DATA CLASSIFICATION AND DATA MINING
, pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal
Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013
Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013 James Maltby, Ph.D 1 Outline of Presentation Semantic Graph Analytics Database Architectures In-memory Semantic Database Formulation
Explanation-Oriented Association Mining Using a Combination of Unsupervised and Supervised Learning Algorithms
Explanation-Oriented Association Mining Using a Combination of Unsupervised and Supervised Learning Algorithms Y.Y. Yao, Y. Zhao, R.B. Maguire Department of Computer Science, University of Regina Regina,
Mathematical Models of Supervised Learning and their Application to Medical Diagnosis
Genomic, Proteomic and Transcriptomic Lab High Performance Computing and Networking Institute National Research Council, Italy Mathematical Models of Supervised Learning and their Application to Medical
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
