Parallel Methods for Scaling Data Mining Algorithms to Large Data Sets
|
|
|
- Dominick Baldwin
- 10 years ago
- Views:
Transcription
1 Parallel Methods for Scaling Data Mining Algorithms to Large Data Sets ROBERT GROSSMAN Magnify, Inc. & Laboratory for Advanced Computing/National Center for Data Mining, University of Illinois at Chicago, USA YIKE GUO Department of Computing Imperial College, University of London, UK This is a draft of the following publication: R. L. Grossman and Yike Guo, Parallel Methods for Scaling Data Mining Algorithms to Large Data Sets, Hanndbook on Data Mining and Knowledge Discovery, Jan M Zytkow, editor, Oxford University Press, 2002, pages In this chapter, we describe some approaches and specific techniques for scaling data mining algorithms to large data sets through parallel processing. We then analyse in more detail three core algorithms that can be scaled to large data sets: building decision trees, discovering association rules, and creating clusters. C Introduction A fundamental challenge is to extend data mining to large data sets. In this chapter, we introduce some of the basic approaches and techniques that have proved successful and describe in some detail work on scaling three fundamental data mining algorithms: trees, clustering algorithms and association rules. Section C introduces computational models for working with large data sets. By large we mean data that does not fit into the memory of a single processor. Parallel RAM computational models describe algorithms which are distributed between several processors. Hierarchical memory computational models describe algorithms that require working with data both in memory and on disk. Parallel disk computational models describe algorithms in which data is distributed over several processors and disks. Section C surveys some of the basic approaches to scaling data mining algorithms. The most basic approach is to manipulate the data until it fits into memory. Another fundamental technique is to use specialized data structures to work with data which is disk resident. We also describe techniques for distributing algorithms between several processors, precomputing various quantities, and intelligently reducing the amount of data. Sections C5.10.4, C and C describe work on scaling tree-based algorithms, association rules, and clustering algorithms to large data sets. C Computational and Programming Models In this section, we briefly describe some of the different computational and programming models that are used in high performance and parallel computing. We begin by discussing the cost of computation. We describe 4 models: a RAM model, a parallel RAM model, a disk model and a parallel disk model. Next, we describe two basic distinctions between the various programming models used in high performance computing. The first distinction is whether the data itself is used to determine the parallelism (data parallelism) or whether the parallelism is determined explicitly by the programmer (task parallelism). The second distinction is how different processors communicate: this
2 can be done with shared memory, with message passing, or with remote memory operations. Computational Models. The standard model for measuring the complexity of an algorithm is the random access machine model (RAM) (Aho 1974). A RAM model has a single processor with unlimited memory, which can store and access data with unit cost. With the RAM model, sorting N records has cost O(N log N). Parallel computers exploit multiple processors. Shared memory parallel computers allow more than one processor to share the same memory space. With the P-RAM model, different processors may simultaneously read the same memory location, but may not simultaneously write to the same memory. In distributed memory parallel computers, each processor has its own memory and processors communicate by explicitly sending messages to each other over an interconnection network. See (Kumar 1994) for more details about shared memory and distributed memory parallel computers. In practice, accessing data from disk can affect the running time of an algorithm by one or two orders of magnitude so that an order O(N 3 ) algorithm effectively becomes an order O(N 5 ) algorithm. To model this, the most basic i/o model assumes that data is either in memory or on disk, and that data in memory can be accessed uniformly with unit cost, while data on disk can be accessed uniformly, but at a higher cost. On a parallel computer, there will usually be several disks which can read and write blocks in parallel. With the Parallel Disk Model (Vitter 1994), B data records (a block of data) can be read from disk into memory at unit cost and that D blocks of data can be read or written at once. A typical algorithm will read M=DB records into memory, compute with them, and write out necessary information to disk. An external memory algorithm is designed to work with N > M records so that several memory loads of M records must be used to examine all of the data. With the parallel disk model, sorting has cost [Vitter 1994] O((N/DB) log(n/b)/log(m/b)). N P M B D number of input records number of processors number of records that fit into the aggregate internal memories of the P processors number of records that can be transferred in a single block number of disks, and more generally the number of disk blocks that can be transferred with one parallel read or parallel write operation Communication Primitives. Certain communication patterns in parallel algorithms are very common and typically special hardware and software is provided to support them. We describe three of these here. Scatter takes a value at one processor and sends it to all the other processors. Gather takes values at all the processors and brings them to a common processor. Reduction takes values at all of the processors, computes the sum, and places the sum in each of the processors. Reduction can also be used for computing the max, min, and similar operations. Data Parallelism. With data parallelism, data is divided into different partitions, the same program is run on each partition, and the results combined. Finding the maximum value in a list of N elements has an easy data parallel solution. If the list is divided into P sublists and one is stored in the local memory of each processor, then each processor can determine its local maximum and send the maximum to a central processor or place it in common shared memory. The global maximum is then the largest of the P values. As another example, a data parallel approach to growing a tree, splits the data into P partitions, grows a single tree on each partition, and then produces an ensemble of P trees
3 (Grossman, Bodek et. al. 1996). Task Parallelism. Task parallelism is specified explicitly by the programmer. For example, a task parallel approach to growing a tree uses P processors to speed up the computation of locating the best split for a single node in the tree. In the simplest task parallel approach, the data is distributed evenly between the P processors and, for each attribute, each processor computes the class distribution information for that attribute using its local data. See Section C for an example of class distribution information. Reduction is used to exchange local class distribution information with each of the other P-1 processors to compute global class distribution information. A single split value is computed and scattered to each of the other P-1 processors. Using this split value, the data is distributed between the two nodes produced by the split and the process repeats. Notice, that unlike the ensemble based approach described above, the different processors need to communicate class distribution information before a split can be determined. On the other hand, a single tree is computed, where as the ensemble based approach yields a collection of trees. Shared Memory. The simplest way for different processors to communicate is for each to share some global memory. Locking is used to control conflicts when different processors write to the same memory location. As the number of processors grows, it becomes more difficult to design machines in which all the global memory can be accessed uniformly. Some architectures allow each processor access to global memory, but different processors may require different amounts of time to read and write the common shared memory. A variant is for each processor to have some local memory and some global memory. Message Passing. With message passing, each processor has its own memory, and different processors communicate by explicitly sending and receiving messages between them with a send and receive command. Messages are simply buffers of data of specified length. Remote Memory Operations. With remove memory operations, a processor can explicitly access memory of other processors, but different operations are used for accessing local and remote memory. For example, local memory access is implicit, while remote memory access requires explicit get or put commands. Unlike message passing, in which the remote processor must explicitly receive the message, with remote memory operations, all the work is done by the local processor. C Five Basic Approaches for Scaling Data Intensive Computing Approach 1. Manipulate the data so that it fits into memory. We begin with the most common approach. There are four basic variants. The first is to sample the data until the number of records N is smaller than the memory of a single processor. The second is to select features until the amount of data is smaller than the memory a single processor. The third is to partition the data so that although it doesn t fit into the memory of a single processor, it does fit into the aggregate memory M of the processors. The fourth technique is to summarize the data in some fashion so that the summarized or partly summarized data can fit into memory. These four techniques can be used in any combination with each other. Broadly speaking, these techniques arose from the statistical community. Approach 2. Reduce the time to access out of memory data. Special care is required when accessing data from disk. No more time is required to access all B records in a block on disk than is required to access any single one of them. Three basic techniques are common. The first uses specialized data structures to access data on disk. The most familiar is the B+-tree which uses a tree structure to determine which block contains a desired record and which has efficient operations for adding new blocks and merging existing blocks (Ramakrishnan 1997). The second technique is to lay out the data on disk to benefit from block reads. For example, some algorithms proceed faster if data is organized by record and others if data is organized by attribute. The third technique organizes data on disk to benefit from parallel block reads. A basic example is provided by matrix transpose. Instead of organizing the disk by row or column, a slightly more complicated organization can cut the number of reads by a factor of 2 (Shriver 1996). Broadly speaking, these techniques arose from the database community.
4 Approach 3. Use several processors. One of the easiest ways to speed up algorithms on large data sets is to use more than one processor. The success depends upon how easy it is to break up the problem into sub-problems which can be assigned to the different processors. As described above, there are two basic techniques. The first technique is data parallelism. With this technique, essentially the same program is applied to different partitions of the data. The second technique is task parallelism. With this technique, the program itself is broken into sub-tasks, which are distributed among the available processors. We will examine several examples of both techniques later in this chapter. Broadly speaking these techniques come from the high performance computing community. Approach 4. Precompute. The most expensive part of building tree based classifiers on continuous attributes is sorting. For example, the table below contains the class distribution information for the continuous attribute 8. Computing the best split value for the tree requires sorting the data by the attributes values as indicated. Precomputing these sorts reduces the cost of the algorithm (Mehta 1996). For efficiency, specialized data structures are usually employed (Shafer, Agrawal and Mehta 1996). Intermediate computations can sometimes be shared across algorithms. For example cross-tab tables, such as the one below for ordinal attribute 5, are of intrinsic interest and are also used by different algorithms including trees. Precomputing such tables can often save significant time. A closely related approach is to provide very efficient implementations for certain basic operations, such as computing statistics on columns, which can be shared across algorithms. Sometimes these are known as data mining primitives. Broadly speaking, these techniques were developed by the data mining system implementation community. Class Distribution Information for Attribute 8 Attribute Value Fraud Fraud No Fraud etc. etc. etc. Class Distribution Information for Attribute 5 Attribute Value Fraud Fraud No Fraud 0 (codes 0-1) (codes 2-5) (codes 5-9) (codes >9) Approach 5. Reduce the amount of data. This approach is very similar to Approach 1, except that there is no expectation the data will be able to fit into memory. Three of the techniques mentioned in Approach 1 apply here without change: sampling, selecting features, and summarizing data. We also mention three more specialized techniques, which can also sometimes be used in Approach 1. Discrete data points can be smoothed and replaced by a continuous approximation specified by one or more parameters. For example, a set of points can be replaced by its center, a measure of dispersion, the number of points, the sum of the errors, and the sum of the squared errors. Data can also be compressed and computations done directly on the compressed data. Finally, data can be transformed with more complicated transformations which reduce the size of the data and variants of
5 algorithms can be applied directly to the transformed data. For example, data can be reduced with a principal components analysis. C Parallel Tree Induction In this section we describe some of the ways that have been used to scale tree algorithms. For simplicity we describe these approaches in the context of the C4.5 system (Quinlan 1993). C4.5 attempts to find the simplest classification tree that describes the structure of the data by applying search heuristics based on information theory. At any given node in the tree, the algorithm chooses the most suitable attribute to further expand it based on the concept of information gain, a measure of the ability of an attribute to minimise the information needed to classify the cases in the resulting subtrees. The algorithm constructs a tree recursively using a depth first divide-and-conquer approach. Other tree induction algorithms share a similar computation structure. See Section C There are three main approaches for building trees in parallel. Move Class Distribution Information. This approach is based on dividing the initial data set evenly among the P processors. The processors leave the data in place but move the class distribution information (C5.10.3) using reduction in order to computer the splitting values, as described in the section above on task parallelism. In more detail, consider the expansion of a single node into its children using splitting values. Each processor computes the class distribution information for each attribute using its local data. Each processor then uses reduction with the other P-1 processors to compute the global class distribution information. The processors then simultaneously compute the splitting criteria and scatter the value of the attribute with the best split value. Using these splitting values, the data is then assigned to the children and the process continues. The main advantage of this approach is that no data needs to be moved. On the other hand, moving the class distribution information can have a high communication cost and can result in a load imbalance. In particular, the deeper the tree, the less the data, and the greater the overhead of the communication. Move Data. The advantage of this approach is that, when possible, different processors can work on different nodes at the same time. The basic idea is simple. Assume a group of processors are assigned to a node. The processors work together to compute the split value as described with the Move Class Distribution Information approach above. Assume that the number of children computed by the split is less than the number of processors available. Split the processors between the children and then distribute the data to the processors that are assigned to it. This partitions the underlying data between several processors so that they can work simultaneously. The processors are next used to compute the split value. Processors assigned to different nodes can proceed independently. The case in which the number of children is greater than the number of processors is handled similarly. Kumar (1998) gives a performance formula for this approach. This method is also referred to as search parallelisation (Provost and Kolluri 1999) since the search space is divided among the processors so that different processors search different portions of the space in parallel. A disadvantage of this approach is that moving data can result in a high communication overhead. Another disadvantage is that the work load can become unbalanced. On the other hand, an advantage of this approach is that once a single processor is assigned to a node it can compute the subtree without any communication overhead. Zaki (1998) applies this approach to parallel tree induction by taking advantage of shared memory multiprocessor architecture. Ensemble-based Methods. With this approach, the data is divided into partitions, perhaps overlapping, and one or more processors are used to build a separate tree for each partition (Grossman et. al. 1996, Grossman and Poor 1996). This produces an ensemble of trees, which can be combined
6 using a variety of methods. An ensemble is a collection of statistical models, together with a rule for a combining the models into a single model. For example, the models may be combined with a voting function or a function which averages the various values produced by the separate models. See (Dietterich 1997) for additional information about ensembles in data mining. Two or more of these approaches may be combined to produce hybrid algorithms. For example, Kumar et al. (1998) describe an algorithm which starts by exploiting the approach of moving class distribution information. When the inter-processor communication and synchronisation requirements increase pass a certain threshold, the implementation switches to exploiting a mixture of approaches, involving moving both data and class distribution information. This method is also referred to as parallel matching (Provost and Kolluri 1999). This approach has been adopted in the work of Provost and Aronis (1996) and in the parallelisation of the SPRINT algorithm (Shafer, Agrawal and Mehta 1996) as well as in the recently proposed ScalParC parallel tree induction system (Joshi 1998). Other examples are given by Pearson (1994) who uses a vertical partitioning strategy, and Han (1999) who uses a horizontal partitioning strategy. Ensemble based methods have also been combined with approaches that move both data and class distribution information (Grossman et. al. 1996). It should be noted that after generating a classification tree by an algorithm such as C4.5, several postprocessing steps might still be required. These are applied in order to simplify the tree and to translate it into a set of production rules (Quinlan 1993). Kufrin (1997) has noted that these post-processing steps may require more computation time than the actual tree generation phases and has describes how such steps can be parallelised. C Parallel Association Rule Discovery Algorithms for uncovering associations were introduced in Agrawal et. al. (1993). The well known Apriori algorithm (Agrawal 1993) generates association rules by computing frequent item sets (C5.2.3). Frequent item sets of length 1 are simply singleton sets. Given a frequent item set of length n, there are efficient algorithms for computing frequent item sets of length n+1 (Agrawal 1994). The Apriori algorithm uses a hash tree to maintain items of length n while it computes frequent item sets of length n+1. The hash tree is required to remain in main memory, although the transaction data set is not. Association rules can be read easily off from frequent item sets. There are two steps to construct frequent item sets of length n+1. In the first step, a set of candidate frequent item sets is created. In the second step, the entire database is scanned to count the number of transactions that the candidate sets contain. Concurrency can be used in both steps parallel processing can be used to speed the creation of candidate frequent item sets and to speed up the counting of transactions. Data parallel approaches which distribute the transaction data among several processors and count the transactions in parallel have been proposed by Park et. al. (Park 1995) and Agrawal et. al. (Agrawal 1996). The Count Distribution (CD) algorithm of Agrawal et. al. is an adaptation of the Apriori algorithm. At each iteration, the algorithm generates the candidate sets at each local processor by applying the same generation function as that used in the Apriori algorithm. Each processor then computes the local support counts of all the candidate sets and uses a reduction for computing the global frequent item sets for that iteration. In this way, the CD algorithm scales linearly with the number of transactions. On the other hand, since the CD algorithm, like Apriori, requires the hash tree to fit into the memory of a single processor, it does not scale as the number of candidates in the frequent item sets increases. To scale as the number of candidates in the frequent item sets increases, the frequent item sets themselves can be distributed among the processors. In this case, a simple hash tree fitting into the memory of a single processor can no longer be used. Simple implementations of this idea require moving all the data to each of the processors in order to compute the counts. Sometimes this is called
7 the Data Distribution (DD) method. Simple DD algorithms do not perform well due to communications overhead, but more complex implementations have been developed with better performance. CD style algorithms scale to large transaction data sets since the transactions are partitioned. DD style algorithms scale to problems with large candidate sets since the candidates are partitioned. Some algorithms combine these two approaches to achieve scalability along both dimensions (Han 1997). Chung (1996) observed that a global frequent item set must be a local frequent item for some processor. With this property, much smaller candidate sets can be generated in parallel at each processor. Moreover, local pruning can be applied by removing those sets that are not locally large. The communication required for exchanging support counts is therefore reduced from the O(P 2 ) of directly parallelising Apriori, to O(P), where P is the number of distributed processors or computers. This type of algorithm can be extended easily to parallelize any pattern discovery algorithms which employs a level by level monotonic search component like that of Apriori. Speeding up association rules through sampling is discussed in Lee et. al. (1998). In their recent paper (Pei 2000), Pei and Han et.al. proposed the CLOSET algorithm for computing association rules. Instead of generating frequent item sets, the algorithm computes a much smaller set of candidates. Their algorithm also employs a compact representation of association rules. This algorithm is based on a memorisation mechanism which avoids redundant computations. A partitionbased approach can be used to scale this algorithm to large data sets. At this time, parallel versions of this algorithm have not been studied in detail. C Parallel Clustering Algorithms Clustering algorithms can be broadly divided into three types: distance-based clustering, hierarchical clustering and density based clustering (C5.5). In general, clustering algorithms employ a two-stage search: an outer loop over possible cluster members and an inner loop to fit the best possible clustering for a given number of clusters. With distance-based clustering, n clusters are constructed by computing a locally optimal solution to minimise the sum of the distances within the data clusters. This is either done by starting from scratch and constructing a new solution or by using a valid cluster solution as a starting point for improvements. A common distance-based algorithm is the K-Means algorithm, which minimises the sum of the distance between each data point and its nearest cluster centre (Selim and Ismail 1984). Parallelism in the distance-based clustering methods can be exploited in both the outer level, by trying different cluster numbers concurrently, and in the inner level by computing the distance metrics in parallel. Hierarchical clustering groups data with a given similarity measurement into a sequence of nested partitions. Two different approaches can be employed. One is to start with each data point as a single cluster and then in each step, merge pairs of points together. This is known as the agglomerative approach. The alternative is to start with all data points in one cluster and then in each divide one cluster in two clusters in each step. This is the divisive approach. For both methods, O(N 2 ) algorithms are known. Recent attempts have been made to develop parallel algorithms for hierarchical clustering using several distance metrics in parallel (Olson 1995). With the density-based clustering approach, clustering is done by postulating a hidden density model indicating the cluster membership. The data is assumed to be generated from a mixture model with hidden cluster identifiers. The clustering problem is then one of finding parameters for each individual cluster which maximise the likelihood of the data set given the mixture model. A typical density-based clustering method is the EM algorithm which employs an iterative search procedure to find the best parameters of a mixture model to fit the data. The iteration procedure comprises the following steps:
8 1. Initialise the model parameters, thereby producing a current model 2. Decide memberships of the data items to clusters, assuming that the current model is correct 3. Re-estimate the parameters of the current model assuming that the data memberships obtained in 2 are correct, producing new model 4. If current model and new model are sufficiently close to each other then terminate, else goto 2. This procedure has the same structure as the K-Means method where the only model parameter is the distance between assumed cluster centres and data points. The search hierarchy in EM algorithms includes the outermost-level search on cluster numbers, the middle-level search for functional forms and the inner-level search for parameter values. The rich inherent parallelism of the algorithm may be exploited by combining the decomposition of loops (task parallelism) and partition of data (data parallelism). Subramonian (1998) presents a parallelisation of the EM algorithm. The model employs three different methods for parallelising each of the three levels of search loops: vectorise the computation of parameters (inner level search); exploit data parallelism in computing the cluster model given the cluster number (middle level search); concurrently search cluster numbers using parallel machine clusters. This method provides a general framework for parallelising iterative clustering procedures. C Discussion and Summary Broadly speaking, techniques for scaling data mining algorithms can be divided into five basic categories : 1) manipulating the data so that it fits into memory, 2) using specialized data structures to manage out of memory data, 3) distributing the computation so that it exploits several processors, 4) precomputing intermediate quantities of interest, and 5) reducing the amount of data mined. During the past several years, there have been successes scaling tree based predictors and association rules to large data sets which do not fit into memory, but rather fill the memories of several processors, spill onto disks, or both. More recently, techniques have been introduced which scale clustering algorithms. These successes have typically involved combining two or more of the approaches described above. References Agrawal, R., Faloutsos. C, and Swami, A (1993). " Efficient Similarity search in Sequence Databases." In Proceedings of 4 th International Conference on Foundations of Data Organization and Algorithms. Agrawal, R. Imielinski, T., and Swami, A. (1993). Mining Association Rules Between Sets of Items in Large Databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, page 207. ACM. Agrawal, R. and Srikant, R. (1994). Fast Algorithms for Mining Association Rules in Large Databases. In Proceedings of the Twentieth International Conference on Very Large Databases, pages Agrawal, R. and Shafer, J. (1996) "Parallel Mining of Association Rules" In IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No.6 Aho, A., Hopcroft, J. and Ullman, J, (1974). The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, Massachusetts. Allmuallim, H., Akiba, Y., and Kaneda, S. (1995) On Handing Tree-Structure Attributes in Decision Tree Learning. In Proceedings of the Twelfth International Conference on Machine Learning.
9 Bradley, P., Fayyad, U. and Reina, C. (1998). "Scaling Clustering Algorithms to Large Databases." In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining. Brieman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and Regression Trees. Belmont: Wadsworth. Brieman, L. (1996). Bagging Predictors. Machine Learning, 24-2: pages Catlett, J. (1991). Megainduction: Machine learning on very large databases. PhD Thesis, University of Sydney. Chan, P. and Stolfo (1997). "On the Accuracy of Meta-Learning for Scalable Data Mining" In Journal of Intelligent Information Systems, 8. Chattratichat, J., Darlington, J., Ghanem, M., Guo, Y., Huning, H., Kohler, M., Sutwaraphun, J., To, H. and Yang, D. (1997). "Large Scale Data Mining: Challenges and Responses." In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. Chattratichat, J., Darlington, J., Guo, Y., Hedvall, S., Kohler, M. and Syed, J. (1998). "An Architecture for Distributed Enterprise Data Mining." In Proceedings of the Seventh International Conference on High-Performance Networking and Computing (HPCN Europe). Cheung, D.W., Han, J. Ng, V.T. and Wong, C.Y. (1996) "Maintenance of Discovered Association Rules in Large Databases : An incremental Updating Techniques." In Proceedings of International Conference on Data Engineering. Craven, M.W. (1996) "Extracting Comprehensible Models from Trained Neural Networks" Ph.D Thesis, University of Wisconson, Technical Report No.1326 Dietterich, T. G. (1997). Machine Learning Research. AI Magazine, 18: Domingos, P. (1997). "Knowledge Acquisition from Examples via Multiple Models." In Proceedings of the Fourteenth International Conference on Machine Learning (ICML-97), Freitas, A. and Lavington, S. (1998). "Mining Very Large Databases with Parallel Processing." Kluwer Academic Publishers. Bennet, K., Fayyad, U and Geiger, D. (1999). "Density-Based Indexing for Approximate Nearest- Neighbor Queries." In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining. Freund, Y. and Schapire, R. E. (1995). A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Proceedings of the Second European Conference on Computational Learning Theory, pages Berlin: Springer-Verlag. Graefe, G., Fayyad, U. and Chaudhuri, S. (1998). "On the Efficient Gathering of Sufficient Statistics for Classification from Large SQL Databases." In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining. Grossman, R. L., Bodek, H., Northcutt, D., and Poor, H. V. (1996). Data Mining and Tree-based Optimization. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, E. Simoudis, J. Han and U. Fayyad, editors, pages Menlo Park, California: AAAI Press. Grossman, R. L. and Poor, H. V. (1996). Optimization Driven Data Mining and Credit Scoring. In Proceedings of the IEEE/IAFE 1996 Conference on Computational Intelligence for Financial Engineering (CIFEr). Piscataway: IEEE, pages Guo, Y. and Sutiwaraphun, J. (1998). "Knowledge Probing in Distributed Data Mining." In Working
10 Notes of the KDD-97 Workshop on Distributed Data Mining, pp Han, E., Karypis, G., and Kuma, V. (1997). Scalable parallel data mining for association rules. In Proceedings of the 1997 ACM-SIGMOD International Conference on the Management of Data. Han, J., Fu, Y., Wang, W., Chiang, J., Gong, W., Koperski, K., Li, D., Lu, Y., Rajan, A., Stefanovic, N., Xia, B. and Zaiane, O.(1996) "DBMiner : A System for Mining Knowledge in Large Relational Databases" In Proceedings of the Second International Conference on Data Mining and Knowledge Discovery. Han, S. et al. (1998) "Parallel Formulations of Decision-Tree Classification Algorithms." In Proc. of the 1998 International Conference on Parallel Processing. John, G. (1997). Enhancements to the Data Mining Process. PhD Thesis, Stanford University. John, G. and Langley, P. (1996). "Static Versus Dynamic Sampling for Data Mining." In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. Joshi, M.V., Karypis, G. and Kumar, V (1998) " ScalParc : A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets." In Proceedings of the International Parallel Processing Symposium.. Kohavi, R. and John, G. (1997) "Wrappers for Feature Subset Selection." In Artificial Intelligence 97(1-2) pp Kononenko, I. (1994). "Estimating Attributes: Analysis and Extensions of RELIEF." In Proceedings of the European Conference on Machine Learning, Kufrin, R. (1997) "Generating C4.5 Production Rules in Parallel." In Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI-97), AAAI-Press. Kumar, V., Grama, A., Karypis, G Introduction to Parallel Computing: Design and Analysis of Algorithms. Benjamin Cummings Publishing Company. Lee, S. D., Cheung, D. W., Kao, B. (1998). Is Sampling Useful in Data Mining? A Case Study in the Maintenance of Discovered Association Rules. Data Mining and Knowledge Discovery: 2: Mehta, M., Agrawal, R., and Rissanen, J (1996). SLIQ: A Fast Scalable Classifier for Data Mining. In Proceedings of the Fifth International Conference on Extending Database Technology. Avignon France. Olson, C.F. (1995) "Parallel Algorithms for Hierarchical Clustering." In Parallel Computing, 21(8): Pei, J, Han, J, and Mao, R (2000 ) `` CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets (PDF) '', In Proc ACM-SIGMOD Int. Workshop on Data Mining and Knowledge Discovery (DMKD'00)}, Dallas, TX Park, J.S., Chen, M and Yu. P.S. (1995). "An effective Hash-based Algorithm for Mining Association Rules" In Proceedings of the ACM SIGMOD International Conference on Management of Data. Pearson, R. A. (1994). "A Coase Grained Parallel Induction Heuristic." In H. Kitano, V. Kumar and C.B. Sutter, editors, Parallel Processing for Artificial Intelligence 2, Pages , Elsevier Science. Provost, F. and Aronis, J. (1996) "Scaling up Inductive Learning with Massive Parallelism." In
11 Machine Learning 23, Provost, F., Jensen, D. and Oates, T. (1999). "Efficient Progressive Sampling." In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining. Provost, F. and Kolluri, V. (1999). "A Survey of Methods for Scaling Up Inductive Algorithms." To appear in Data Mining and Knowledge Discovery Journal. Quinlan, J. (1993). C4.5 Programs for Machine Learning. Morgan Kaufmann. Ramakrishnan, Ramu (1997). Database Management Systems. McGraw-Hill, New York. Sarawagi, S., Thomas, S. and Agrawal, R. (1998). "Integrating Association Rule Mining with Relational Database Systems: Alternatives and Implications." In Proceedings of the ACM SIGMOD International Conference on Management of Data Selim, S.Z. and Ismail, M.A. (1984). "K-Means-Type Algorithms : A Generalized Convergence Theorem and Characterization of Local Optimality" In IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6:81-87 Shafer, J., Agrawal, R. and Mehta, M. (1996). "SPRINT: A Scalable Parallel Classifier for Data Mining." In Proceedings of the 22 nd International Conference on Very Large Databases (VLDB 1996). Subramonian, R. and Parthasarathy, S. (1998) " A Framework for Distributed Data Mining" In Proceedings of KDD98 Workshop on Distributed Data Mining. Thomas, S. and Sarawagi, S. (1998). "Mining Generalized Association Rules and Sequential Patterns Using SQL Queries." In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining. Vitter, J. S. and Shriver, E. A. M. (1994). Algorithms for Parallel Memory I: Two Level Memories. Algorithmica. Volume 12, pages Wolpert, D.H. (1992) "Stacked Generalization" In Journal of Neural Networks, Vol 5. Zaki, M.J., C. Ho, and R. Agrawal (1999). "Scalable Parallel Classification for Data Mining on Shared Memory Multiprocessors" In Proceedings of IEEE International Conference on Data Engineering.
Static Data Mining Algorithm with Progressive Approach for Mining Knowledge
Global Journal of Business Management and Information Technology. Volume 1, Number 2 (2011), pp. 85-93 Research India Publications http://www.ripublication.com Static Data Mining Algorithm with Progressive
Mining various patterns in sequential data in an SQL-like manner *
Mining various patterns in sequential data in an SQL-like manner * Marek Wojciechowski Poznan University of Technology, Institute of Computing Science, ul. Piotrowo 3a, 60-965 Poznan, Poland [email protected]
Top 10 Algorithms in Data Mining
Top 10 Algorithms in Data Mining Xindong Wu ( 吴 信 东 ) Department of Computer Science University of Vermont, USA; 合 肥 工 业 大 学 计 算 机 与 信 息 学 院 1 Top 10 Algorithms in Data Mining by the IEEE ICDM Conference
Top Top 10 Algorithms in Data Mining
ICDM 06 Panel on Top Top 10 Algorithms in Data Mining 1. The 3-step identification process 2. The 18 identified candidates 3. Algorithm presentations 4. Top 10 algorithms: summary 5. Open discussions ICDM
On the effect of data set size on bias and variance in classification learning
On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent
Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis
IOSR Journal of Computer Engineering (IOSRJCE) ISSN: 2278-0661, ISBN: 2278-8727 Volume 6, Issue 5 (Nov. - Dec. 2012), PP 36-41 Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis
A Way to Understand Various Patterns of Data Mining Techniques for Selected Domains
A Way to Understand Various Patterns of Data Mining Techniques for Selected Domains Dr. Kanak Saxena Professor & Head, Computer Application SATI, Vidisha, [email protected] D.S. Rajpoot Registrar,
Principles of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
Decision Tree Learning on Very Large Data Sets
Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa
MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM
MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM J. Arokia Renjit Asst. Professor/ CSE Department, Jeppiaar Engineering College, Chennai, TamilNadu,India 600119. Dr.K.L.Shunmuganathan
Data Mining Analytics for Business Intelligence and Decision Support
Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing
Data Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
REVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
The Management and Mining of Multiple Predictive Models Using the Predictive Modeling Markup Language (PMML)
The Management and Mining of Multiple Predictive Models Using the Predictive Modeling Markup Language (PMML) Robert Grossman National Center for Data Mining, University of Illinois at Chicago & Magnify,
Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner
24 Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner Rekha S. Nyaykhor M. Tech, Dept. Of CSE, Priyadarshini Bhagwati College of Engineering, Nagpur, India
CREATING MINIMIZED DATA SETS BY USING HORIZONTAL AGGREGATIONS IN SQL FOR DATA MINING ANALYSIS
CREATING MINIMIZED DATA SETS BY USING HORIZONTAL AGGREGATIONS IN SQL FOR DATA MINING ANALYSIS Subbarao Jasti #1, Dr.D.Vasumathi *2 1 Student & Department of CS & JNTU, AP, India 2 Professor & Department
L25: Ensemble learning
L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA
Welcome Xindong Wu Data Mining: Updates in Technologies Dept of Math and Computer Science Colorado School of Mines Golden, Colorado 80401, USA Email: xwu@ mines.edu Home Page: http://kais.mines.edu/~xwu/
A Data Mining Tutorial
A Data Mining Tutorial Presented at the Second IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN 98) 14 December 1998 Graham Williams, Markus Hegland and Stephen
Random forest algorithm in big data environment
Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
Mining an Online Auctions Data Warehouse
Proceedings of MASPLAS'02 The Mid-Atlantic Student Workshop on Programming Languages and Systems Pace University, April 19, 2002 Mining an Online Auctions Data Warehouse David Ulmer Under the guidance
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
CLOUDS: A Decision Tree Classifier for Large Datasets
CLOUDS: A Decision Tree Classifier for Large Datasets Khaled Alsabti Department of EECS Syracuse University Sanjay Ranka Department of CISE University of Florida Vineet Singh Information Technology Lab
Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL
Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL Jasna S MTech Student TKM College of engineering Kollam Manu J Pillai Assistant Professor
A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery
A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery Runu Rathi, Diane J. Cook, Lawrence B. Holder Department of Computer Science and Engineering The University of Texas at Arlington
How To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző [email protected] 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier
Data Mining: Concepts and Techniques Jiawei Han Micheline Kamber Simon Fräser University К MORGAN KAUFMANN PUBLISHERS AN IMPRINT OF Elsevier Contents Foreword Preface xix vii Chapter I Introduction I I.
Mining Association Rules: A Database Perspective
IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.12, December 2008 69 Mining Association Rules: A Database Perspective Dr. Abdallah Alashqur Faculty of Information Technology
Data Mining and Database Systems: Where is the Intersection?
Data Mining and Database Systems: Where is the Intersection? Surajit Chaudhuri Microsoft Research Email: [email protected] 1 Introduction The promise of decision support systems is to exploit enterprise
Classification and Prediction
Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,
Scalable Parallel Clustering for Data Mining on Multicomputers
Scalable Parallel Clustering for Data Mining on Multicomputers D. Foti, D. Lipari, C. Pizzuti and D. Talia ISI-CNR c/o DEIS, UNICAL 87036 Rende (CS), Italy {pizzuti,talia}@si.deis.unical.it Abstract. This
A New Approach for Evaluation of Data Mining Techniques
181 A New Approach for Evaluation of Data Mining s Moawia Elfaki Yahia 1, Murtada El-mukashfi El-taher 2 1 College of Computer Science and IT King Faisal University Saudi Arabia, Alhasa 31982 2 Faculty
Performance Analysis of Decision Trees
Performance Analysis of Decision Trees Manpreet Singh Department of Information Technology, Guru Nanak Dev Engineering College, Ludhiana, Punjab, India Sonam Sharma CBS Group of Institutions, New Delhi,India
Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca
Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?
Mining Interesting Medical Knowledge from Big Data
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 06-10 www.iosrjournals.org Mining Interesting Medical Knowledge from
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
Model Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH
MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH M.Rajalakshmi 1, Dr.T.Purusothaman 2, Dr.R.Nedunchezhian 3 1 Assistant Professor (SG), Coimbatore Institute of Technology, India, [email protected]
Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm
R. Sridevi et al Int. Journal of Engineering Research and Applications RESEARCH ARTICLE OPEN ACCESS Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm R. Sridevi,*
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
Mobile Phone APP Software Browsing Behavior using Clustering Analysis
Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
Comparison of Data Mining Techniques for Money Laundering Detection System
Comparison of Data Mining Techniques for Money Laundering Detection System Rafał Dreżewski, Grzegorz Dziuban, Łukasz Hernik, Michał Pączek AGH University of Science and Technology, Department of Computer
A Review of Data Mining Techniques
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,
CLUSTERING AND PREDICTIVE MODELING: AN ENSEMBLE APPROACH
CLUSTERING AND PREDICTIVE MODELING: AN ENSEMBLE APPROACH Except where reference is made to the work of others, the work described in this thesis is my own or was done in collaboration with my advisory
Comparative Analysis of Serial Decision Tree Classification Algorithms
Comparative Analysis of Serial Decision Tree Classification Algorithms Matthew N. Anyanwu Department of Computer Science The University of Memphis, Memphis, TN 38152, U.S.A manyanwu @memphis.edu Sajjan
EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH
EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH SANGITA GUPTA 1, SUMA. V. 2 1 Jain University, Bangalore 2 Dayanada Sagar Institute, Bangalore, India Abstract- One
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 12, December 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
Data Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
Data Mining: A Preprocessing Engine
Journal of Computer Science 2 (9): 735-739, 2006 ISSN 1549-3636 2005 Science Publications Data Mining: A Preprocessing Engine Luai Al Shalabi, Zyad Shaaban and Basel Kasasbeh Applied Science University,
Database Marketing, Business Intelligence and Knowledge Discovery
Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski
Introducing diversity among the models of multi-label classification ensemble
Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and
The basic data mining algorithms introduced may be enhanced in a number of ways.
DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,
International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: 2347-937X DATA MINING TECHNIQUES AND STOCK MARKET
DATA MINING TECHNIQUES AND STOCK MARKET Mr. Rahul Thakkar, Lecturer and HOD, Naran Lala College of Professional & Applied Sciences, Navsari ABSTRACT Without trading in a stock market we can t understand
Ensemble Data Mining Methods
Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods
A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms
Proceedings of the International Conference on Computational and Mathematical Methods in Science and Engineering, CMMSE 2009 30 June, 1 3 July 2009. A Hybrid Approach to Learn with Imbalanced Classes using
Selection of Optimal Discount of Retail Assortments with Data Mining Approach
Available online at www.interscience.in Selection of Optimal Discount of Retail Assortments with Data Mining Approach Padmalatha Eddla, Ravinder Reddy, Mamatha Computer Science Department,CBIT, Gandipet,Hyderabad,A.P,India.
GraphZip: A Fast and Automatic Compression Method for Spatial Data Clustering
GraphZip: A Fast and Automatic Compression Method for Spatial Data Clustering Yu Qian Kang Zhang Department of Computer Science, The University of Texas at Dallas, Richardson, TX 75083-0688, USA {yxq012100,
Classification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal [email protected] Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal [email protected]
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations
Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations Binomol George, Ambily Balaram Abstract To analyze data efficiently, data mining systems are widely using datasets
BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL
The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University
Advanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS Abstract D.Lavanya * Department of Computer Science, Sri Padmavathi Mahila University Tirupati, Andhra Pradesh, 517501, India [email protected]
Integrating Pattern Mining in Relational Databases
Integrating Pattern Mining in Relational Databases Toon Calders, Bart Goethals, and Adriana Prado University of Antwerp, Belgium {toon.calders, bart.goethals, adriana.prado}@ua.ac.be Abstract. Almost a
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism Jianqiang Dong, Fei Wang and Bo Yuan Intelligent Computing Lab, Division of Informatics Graduate School at Shenzhen,
Understanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
PERFORMANCE EVALUATION AND CHARACTERIZATION OF SCALABLE DATA MINING ALGORITHMS
PERFORMANCE EVALUATION AND CHARACTERIZATION OF SCALABLE DATA MINING ALGORITHMS Ying Liu, Jayaprakash Pisharath, Wei-keng Liao, Gokhan Memik, Alok Choudhary, Pradeep Dubey * Department of Electrical and
Robust Outlier Detection Technique in Data Mining: A Univariate Approach
Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,
Unsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: [email protected]
KNIME TUTORIAL Anna Monreale KDD-Lab, University of Pisa Email: [email protected] Outline Introduction on KNIME KNIME components Exercise: Market Basket Analysis Exercise: Customer Segmentation Exercise:
Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data
International Journal of Advanced Research in Computer Science and Software Engineering
Volume, Issue, March 201 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com An Efficient Approach
SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
Using Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
