UNPCC: A Novel Unsupervised Classification Scheme for Network Intrusion Detection
|
|
|
- Hannah Bryant
- 10 years ago
- Views:
Transcription
1 UNPCC: A Novel Unsupervised Classification Scheme for Network Intrusion Detection Zongxing Xie, Thiago Quirino, Mei-Ling Shyu Department of Electrical and Computer Engineering, University of Miami Coral Gables, FL 33124, USA [email protected], [email protected], [email protected] Shu-Ching Chen Distributed Multimedia Information System Laboratory School of Computing and Information Sciences Florida International University, Miami, FL, USA [email protected] LiWu Chang Center for High Assurance Computer Systems Naval Research Laboratory Washington, DC, USA [email protected] Abstract The development of effective classification techniques, particularly unsupervised classification, is important for real-world applications since information about the training data before classification is relatively unknown. In this paper, a novel unsupervised classification algorithm is proposed to meet the increasing demand in the domain of network intrusion detection. Our proposed UNPCC (Unsupervised Principal Component Classifier) algorithm is a multi-class unsupervised classifier with absolutely no requirements for any a priori class related data information (e.g., the number of classes and the maximum number of instances belonging to each class), and an inherently natural supervised classification scheme, both which present high detection rates and several operational advantages (e.g., lower training time, lower classification time, lower processing power requirement, and lower memory requirement). Experiments have been conducted with the KDD Cup 99 data and network traffic data simulated from our private network testbed, and the promising results demonstrate that our UNPCC algorithm outperforms several well-known supervised and unsupervised classification algorithms. 1 Introduction Nowadays, very few effective unsupervised classification methods capable of adapting to various domains have been proposed and developed. Unsupervised classification usually requires a combination of clustering and supervised classification algorithms to be employed, when information is relatively unknown about the training data before classification. It is a process that possesses a contrasting challenge to that of supervised classification, where the main goal is to determine a measure from instances belonging to the same class that is large enough to reject instances belonging to other various classes, while at the same time low enough so as to take into account the variability found among the instances belonging to the same class. However, most of the existing unsupervised classification algorithms, especially those which do not require any known class related information such as the number of classes and the maximum number of instances in each class, suffer from the lack of high classification accuracy [4][5] and broad effectiveness in applications with data sets with different inter-class variability. Recently, network intrusion detection has become more and more important in order to safeguard the network systems and crucial information being stored and manipulated through the Internet, and several intrusion detection algorithms using advanced machine learning techniques have been developed for this purpose [2][7][9]. Unsupervised classification is particularly useful in detecting previously unobserved attacks in network intrusion detection domain since new attacks on computers and networks can occur any time. Furthermore, unsupervised classification algorithms are usually employed after a failure of a misuse detection classifier [1] to identify an instance as belonging to a known
2 attack type, in order to uncover new types of intrusions. At the same time, a robust unsupervised classification algorithm could possibly eliminate the need for a human analyst in the assignment of labels to unknown attack types by fully automating the labeling process. Moreover, it is important to realize the two main issues associated with pre-label processing in intrusion detection as was concluded by [6], i.e., it can be extremely difficult or impossible to obtain labels, and one can never be completely sure that a set of available labeled examples reflect all possible existing attacks in a real-world application. In this paper, in an attempt to meet the above increasing demands, a novel unsupervised classification algorithm called UNPCC (Unsupervised Principal Component Classifier) which is based on Principal Component Analysis (PCA) [12] is proposed. Various training and testing data sets composed of KDD Cup 99 [3] and the simulated realnetwork traffic generated through the Relative Assumption Model [16] were used to evaluate the performance of UNPCC and its inherently natural supervised classification stage, in comparison to the other well-known unsupervised and supervised classification algorithms. The experimental results demonstrate that (i) the supervised classification stage of the UNPCC algorithm performs better than C4.5 decision tree, Decision Table (DT), NN, and KNN, and (ii) the unsupervised classification performance of the UNPCC algorithm outperforms K-Means, Expectation Maximization (EM), FarthestFirst, Cobweb, and Density Based Clustering with K-Means. This paper is organized as follows. The motivations and concepts associated with the proposed unsupervised classification algorithm are presented in Section 2. Section 3 introduces the proposed UNPCC algorithm. Section 4 narrates the experimental setup and illustrates the experimental results. Finally, in Section 5, we conclude our paper. 2 Motivation Principal Component Analysis (PCA), the core of the PCC (Principal Component Classifier) anomaly detection algorithm that we developed [10][11], is employed to reduce the dimension of a training data set, allowing for further data analysis and easier exploration of the statistical information present in the data set. Principal components are particular linear combinations of the original variables with two important properties: 1. The first principal component is the vector pointing in the direction of the largest correlation of the original features and the second principal component is the vector pointing in the direction of the second largest correlation of the original features, and so on. 2. The total variation represented by the the principal components is equal to the total variation present in all original variables. The distribution of the original data features contributes to the direction of the principal components, so it is necessary in PCA to use all original features to generate principal components which account for all the variability of the original training data set. Once the principal component set is found, only a few principal components and their related eigen-values and scores [12] need to be stored and employed to represent a large percentage of the variance in the original training data. Moveover, suitable selection and combination of principal components with certain specific features is required to represent specific characteristics of the training data set. PCC is a good example to demonstrate the uses of the major components representing the overall structure of the training data set and the minor components capturing observations that do not conform to the training data correlation structure. In PCC, two distance measures C1 and C2 based on the major and minor components respectively are defined as classification threshold values and produced promising classification accuracy with low false alarm rates. Inspired by the concept of similarity representation of the training data set in PCC, which is assumed to be composed of a fixed class with a known label, and also by the promising performance of the combination of the distance measures C1 and C2 in PCC, it is highly possible that suitable selection and combination of certain principal components could be used to represent the dissimilarity present in a training data set which is composed of several classes of data sets with unknown labels. For this purpose, a distance measure, denoted as C0, capable of partitioning various classes of data sets based fully on their dissimilarity features in the transformed principal component space is defined. Our main idea is validated through a series of operations in the transformed principal component space: Plotting and observation of principal component score arrays of KDD Cup 1999 data [3] through MATLAB, including various combination of four kinds of network attack types: teardrop, back, neptune, and smurf; Selection of certain principal components with high variability through graphical analysis; Definition of the dissimilarity measure C0 of an instance using Equation (1), where i indexes the selected representative principal components, and λ i and y i correspond to the eigenvalue of the i th principal component and the score of the i th feature in the principal component space respectively. The term score means the projection of an instance onto the
3 eigenspace composed of all principal components acquired from the training data set. C0 = ( y2 i λ i ) (1) Plotting, in MATLAB, the distribution { of the C0 measure, given by the vector C0 = C0 (k)},wherek = 1, 2,...,N, and N is the number of unlabeled instances in a training data set. Figure 3. Sorted C0 array of mixed four attacks (neptune, teardrop, back, and smurf) Figure 1. Sorted C0 array of mixed two attacks (neptune and smurf) Figure 2. Sorted C0 array of mixed three attacks (teardrop, neptune, and back) Figures 1, 2, and 3 correspond to the plots of the C0 distributions of the training data sets with two, three, and four different attack classes, respectively. An interesting observation from these three figures is that each of the classes composing the training data set matches a distinct zone of fracture which is approximately horizontal in the axis of the C0 distance measure in the principal component space. That is, these figures exhibit distinguishable differences among the C0 values of different classes. In Figure 1, the steep slope separating the two nearly horizontal lines depicts the large differences between the C0 scores of the instances belonging to the two different classes. The same pattern can also be observed in Figure 2 and Figure 3, where a steep slope separates nearly horizontal sections representing the scores of instances belonging to the same class type. Therefore, it can be concluded from these results that, theoretically, the distinct C0 segments, which can be categorized by suitable C0 measure threshold values, can be employed to cluster the related class instances into nearly horizontal subsections of C0 values. In addition, the distinct C0 threshold values acquired from the unsupervised classification process, which designate different class types, can be employed in a supervised classification scheme. That is, if the C0 score value calculated for a certain testing instance falls within one of the distinct C0 segments categorized by a certain selected C0 threshold value, then it can be concluded that the testing instance in question belongs to the corresponding clustered class. All these characteristics make the proposed classification scheme a complete unsupervised classification method consisting of a natural combination of clustering and supervised classification algorithms.
4 Motivated by these observations, a novel unsupervised classification scheme is proposed, leaving two challenges to be studied: (i) How to identify and select the representative principal components in an automated manner using the statistical information present in the data set, thus replacing the manual effort of the application of graphical analysis? and (ii) How to automatically identify suitable C0 threshold values to represent different classes, a step which is vital to both the clustering and supervised classification phases? 3 The Proposed UNPCC Algorithm Based on the previously presented theoretical analysis and graphical presentation of Figures 1, 2, and 3, the proposed UNPCC algorithm is defined as an ordered combination of the following processes: 1. Normalize the data in the training data set with multiple classes to minimize the issue of scaling features across multivariate data. 2. Apply PCA to the whole training data set to obtain the principal components, their eigenvalues, and data instance scores as was done in [10][11]. Define Y= { yij }, i = 1, 2,...,p and j = 1, 2,...,N,asthe training data score matrix. Y is a p N-dimensional normalized projection of the unlabeled training data matrix onto the p-dimensional eigenspace, where the term score matrix means the projection of each of the N unlabeled training instances onto the p principal components attained from the training data set through PCA [10][11]. 3. Automatically select the principal components which effectively capture the dissimilarities among of various classes in the training data set through a threshold function. This function is based on the standard deviation values of the features in the principal component space to reflect the dissimilarity degree of a score array. Assume that SCORE i =(y i1, y i2,...,y in ), i=1,2,..., p, is the score row vector corresponding to the i th feature in the eigenspace. First, we refine the number of score row vectors by selecting those which satisfy Equation (2) and discarding all others, in order to prevent the impact of score arrays possessing very low standard deviation values, which possibly correspond to very low eigenvalues and their corresponding score row vectors generated by the PCA projection: STD(SCORE υ ) >φ,where (2) φ is an adjustable coefficient whose value is set to 0.01 as the default value, based on our empirical studies; STD(SCORE υ ) is the standard deviation of the score row vector satisfying the refinement equation and corresponding to the the (υ) th principal component; and υ V is defined as the refined principal component space containing all score row vectors satisfying Equation (2). After that, the principal components whose respective score row vectors satisfy the selection function defined in Equation (3) are selected. STD(SCORE j ) > a (Mean STD (SCORE υ )) (3) where: a is the weighting coefficient. Its value is set to 3 in all the experiments based on our empirical studies; Mean STD (SCORE υ ) is the average value of all the standard deviation values of the score arrays in the refined principal component space V; STD(SCORE j ) is the standard deviation of the selected score array satisfying the selection function and corresponding to the j th principal component; and j J, which can be defined as the selected principal component space, is composed of those principal components whose corresponding score row vectors satisfy both the refinement and principal component selection functions given by Equation (2) and Equation (3), respectively. Accordingly, these selected principal components are then used in calculating the C0 dissimilarity measure of all the N unlabeled training instances using Equation (1), effectively yielding a distribution for C0 which is entirely based on the statistics of the original training data set and from which suitable C0 threshold values can be acquired to categorize the different data classes. The distribution of the C0 score measures is given by the sorted vector C0 for all N unlabeled training instances. 4. From the C0 vector, determine the C0 values to be used as the thresholds for both clustering and supervised classification purposes, through the proposed Automated-Cluster-Threshold-Discovery procedure. Let C0 k be the k th sorted C0 value of the C0 distribution of unlabeled training instances, H i be the expected threshold value of the i th identified class, N be the total number of the unlabeled training data instances, and Initial be a transition variable used to
5 change the selection condition. Also, let m be the index of the current C0 value (C0 C0) in the analysis by the algorithm, T be the total number of the thresholds generated for the T classes identified, and C0 m0 and C0 mmax be the 0.5% th and 99.5% th percentiles of the sorted distribution vector. Here, m 0 and m max are indexes corresponding to the nearest integers to N and N, respectively, and are used to filter some extreme values from both ends of the C0 distribution vector. This filtering step is necessary due to the fact that most of the data sets inevitably contain a few unusual observations which may impact the ability of finding meaningful thresholds in the proposed approach. The Automated-Cluster-Threshold-Discovery procedure for automatically determining the threshold values is given below. Automated-Cluster-Threshold-Discovery Initial = C0 m0 ; H 1 = Initial; i =2; for (m =1;m<m max ; m ++){ if ( C0m Initial Initial > 1) { Initial = C0 m+m0 ; H i = Initial; i = i +1; } } T = i 1; The threshold values yielded in vector H by the procedure above can be used for two independent tasks: (i) clustering the training data into the T different classes identified, and (ii) supervised classification of unlabeled testing instances, without any additional classifier training requirements. Please note that this clustering/classification scheme does not require any aprioriknowledge of class related information such as the number of classes in the training data set or the maximum number of instances in each class. Let us focus on the supervised classification task that naturally rises via the comparison of the C0 measure of an incoming testing instance, which is given by Equation (1), with the T different threshold values in vector H that were obtained through the Automated-Cluster-Threshold- Discovery procedure. As was previously elaborated and illustrated in Figures 1, 2, and 3, the values in vector C0 are sortedinascendingorder. Asaresult, vectorh is inherently sorted in ascending order of values because the Automated- Cluster-Threshold-Discoveryprocedure traverses vector C0 in a linear fashion. With these concepts in mind, let us define the column vector Y =(y 1, y 2,...,y p ) as the normalized projection of a p-dimensional testing instance, normalized using parameters such as the mean feature value and the feature standard deviation value obtained previously from the training data set matrix [10][11], onto the p- dimensional eigenspace. The class i of the testing instance will be identified through a linear search through vector H, starting at index i =1and increasing toward the maximum value of i = T corresponding to the threshold of the last cluster, until the last element H i is found which satisfies Equation (4) below: j J ( (y j )2 λ j ) > H i (4) where y j is the jth row (or eigenspace feature) of Y, and H i is the last ordered element in the threshold vector H whose value is less than the testing instance s computed C0 measure. In other words, class i corresponds to the last threshold value H i whichislessthanthec0 measure of the testing instance. 4 Experiments The experiments are conducted in two parts, corresponding to the two phases of an unsupervised classification algorithm. First, for the clustering phase (the core phase of UN- PCC), the UNPCC algorithm is compared to different methods such as K-Means, Expectation Maximization (EM), Farthest First (FF), Cobweb, and Density Based Clustering with K-Means. Second, for the supervised classification phase, the UNPCC algorithm is evaluated against various supervised classification algorithms such as the C4.5 decision tree, Decision Table (DT), NN, and KNN (K=5). 4.1 Experimental Setup KDD Cup 1999 [3] data set and three different types of network attacks (namely, attack1, attack2, andattack3) generated in our network test-bed [16] through the application of the Relative Assumption Model algorithm [16] with different parameter values, are employed to assess the performance of the proposed unsupervised scheme. Four groups of data sets with variable numbers of class types are used in the experiments to evaluate both the performance of the unsupervised clustering scheme and the performance of the supervised classification scheme in comparison to some existing methods in the Waikato Environment for Knowledge Analysis (WEKA) package [13]. The four groups of the training and testing data sets are as follows. 1. Group 1: Two types of network attacks from the KDD data set, namely back (441 training and 111 testing in-
6 stances) and teardrop (194 training and 77 testing instances) for the two-class case. 2. Group 2: Three types of network attacks from the KDD data set, namely back (441 training and 111 testing instances), smurf (400 training and testing instances), and neptune (300 training and testing instances) for the three-class case. 3. Group 3: Four types of network attacks, namely neptune (300 training and testing instances), teardrop (194 training and 77 testing instances), back (441 training and 111 testing instances), and smurf (400 training and testing instances) for the fourclass case. 4. Group 4: Three types of network attacks generated through our network testbed for the three-class case: 1187 attack1 abnormal network connections (300 for training and the remaining ones for testing), 1093 attack2 abnormal network connections (300 for training and the remaining ones for testing), and 441 attack3 network connections (300 for training and the remaining ones for testing). 4.2 Selected Algorithms in Comparison In our experiments, various clustering algorithms [8][14][15] in WEKA [13] are selected in the performance comparison. Some of them are further discussed as follows. Expectation Maximization (EM): This algorithm finds the maximum likelihood estimate of the parameters of probabilistic models. In WEKA, the EM algorithm utilizes a finite Gaussian mixture model to learn the trends present in the training data sets. Some of the algorithm s drawbacks are its assumptions that all attributes are independent and normally distributed and its high processing power requirements, which makes it poorly suited for real-time applications with large data sets. K-Means: This is a simple iterative algorithm. It randomly chooses the initial locations of k-clusters and assigns instances to the clusters based on their proximity to the center of each cluster. Next, new k-cluster means are calculated using the attribute values of the instances belonging to each cluster. Finally, the process is repeated until no more instances need to be re-assigned to a different cluster. This algorithm has many drawbacks such as its assumption that attributes are independent and normally distributed, it requires the number of clusters to be specified manually, and its training time can be very high for a large number of training instances. Cobweb: In Cobweb, the clustering phase is performed on an instance by instance basis, updating the clustering instance by instances. The end result is a tree whose leaves represent the training instances, nodes represent the clusters and sub-clusters, and root node represents the whole data set. The major drawback of the algorithm is the impact that the order which training instances are read as the input to the algorithm may have on the formation of the tree. Farthest First (FF): Similarly to K-Means, this algorithm requires the number of clusters to be specified. Density Based Clustering with KNN algorithm (KN- NDB): This method produces a probability distribution for each testing instance which estimates the membership of the instance in each of the clusters estimated by the K-Means algorithm. 4.3 Experimental Results Table 1 compares of the overall clustering accuracy of the UNPCC, K-Means, Expectation Maximization (EM), Cobweb, Farthest First (FF), and KNN Density Based (KN- NDB) algorithms when each of the training and testing data groups is employed. Table 1 indicates that our proposed UNPCC algorithm outperforms all the other unsupervised classification methods, in addition to maintaining an accuracy rate of over 90% in the experiments with all four groups of training data sets. Another observation that can be attained from the experimental results shown in Table 1 is the fact that the classification accuracy of some methods vary significantly among the data groups employed. For example, the accuracy of the K-Means algorithm ranges from 79.16% to one single occurrence of 100%, and the accuracy of the EM algorithm ranges from 56.06% to 78.01%. These results indicate that the predicative models generated do not always work well for various types of training data sets used in the experiments. On the other hand, our proposed UNPCC algorithm maintains a high accuracy of over 90% for all the data groups, indicating that the predictive models work well for all these different types of data sets. Furthermore, it has been observed that UNPCC has lower training time, lower classification time, and less memory and storage requirements to process and store the components required by the classification stage than all the other algorithms. Particularly, the instance-based methods such as K-Means and FF require large memory and storage when hundreds of training data set instances are involved. In order to assess the performance of the supervised classification stage without the cascading error effects of the previous unsupervised clustering stage, the class label set used to assign labels to the instances in the different clusters and generated by the UNPCC method (which has been
7 Table 1. Overall clustering accuracy (%) comparison among UNPCC, K-Means, EM, Cobweb, FF and KNNDB (K=5) with each group of data Accuracy Group 1 Group 2 Group 3 Group 4 UNPCC 100% 100% 92.81% 90.60% K-Means 100% 92.44% 92.47% 79.16% EM 56.06% 78.01% 67.10% 73.87% Cobweb 24.88% 45.48% 34.90% 22.09% FF 99.68% 99.80% 81.57% 89.85% KNNDB 96.53% 90.38% 64.55% 83.45% proven to possess the highest clustering accuracy in Table 1) is used in the experiments to compare with the other supervised classification algorithms in order to set the same initial conditions in the other algorithms as those in the UN- PCC method. Table 2 shows a comparison of the overall supervised classification accuracy of the UNPCC, C4.5, Decision Table (DT), NN, and KNN (K=5) algorithms for each group of data. It can be clearly concluded from the presented results that the UNPCC algorithm always outperforms the other methods in the supervised classification stage, even when all the algorithms use the same high accuracy clustering method. In addition, it has been observed that UNPCC has a lower classification time than all the other algorithms in the comparison. Table 2. Overall supervised classification accuracy (%) comparison among UNPCC, C4.5, DT, NN and KNN (K=5) with each group data Accuracy Group 1 Group 2 Group 3 Group 4 UNPCC 100% 99.62% 98.55% 88.20% C % 99.55% 86.26% 87.95% DT 100% 99.55% 76.00% 85.58% NN 100% 99.55% 86.26% 83.30% KNN 99.46% 95.98% 92.35% 76.04% Based on the performance comparison results, UNPCC seems to offer an excellent lightweight and highly accurate solution to the domain of unsupervised classification. In comparison with all the unsupervised classification algorithms in the experiments, our proposed UNPCC algorithm has four advantages: An inherently natural combination of clustering and supervised classification: UNPCC can perform either clustering or/and supervised classification. In fact, the supervised classification phase naturally rises after the clustering step, which can save additional processing efforts in terms of additional training efforts. Purely clustering or/and unsupervised classification algorithm: The proposed method does not require any aprioriclass related information such as the number of the classes (clusters) and the maximum number of the instances per class. Lightweight characteristics: UNPCC does not require training instances to be stored for use in the classification phase. Instead, only the principal components, eigen values, and threshold values are stored for the classification phase. This amount of data is usually far less than hundreds of training data instances, which makes it a lightweight algorithm and thus is feasible to be employed in computers with less processing power for most of the real-world applications. Fast computation: Experimental results have demonstrated that UNPCC requires significantly less predictive model training time and testing time than all the algorithms in the performance comparison experiments, under the same operational conditions. This is due to the fact that the most costly computations required for clustering and supervised classification are only based on the single threshold value C0, eigen-values, and the score matrix, once the principal components are generated. This advantage makes it particularly useful and suitable to be used in real-time demanding applications. 5 Conclusion In the network intrusion detection application domain, there are great challenges and increasing demands coexisting in the unsupervised classification techniques. In this paper, we propose a novel unsupervised classification algorithm called UNPCC with promising experimental results for network intrusion detection. The proposed algorithm is based on the concepts of robust PCC and uses an automated method to select representative principal components from the data set to generate a single classification measure C0. It is a multi-class unsupervised classification algorithm with an inherently natural supervised classification stage, which both present high classification accuracy as the experimental results clearly indicate, and various operational benefits such as lower memory and processing power requirements than the other supervised classification algorithms used in the experiments. These are achieved due to the fact that only very little information about the principal components, eigen values, and threshold values have to be stored for the proper execution of the classification phase. Furthermore, UNPCC was observed to perform
8 significantly faster than algorithms such as K-Means and Expectation Maximization (EM) which are based on iterative processes potentially requiring prohibitive training time and testing time. Various training and testing data sets are employed to evaluate the performance of the UNPCC algorithm and its inherently natural supervised classification stage, in comparison to some existing well-known unsupervised and supervised classification algorithms. As the experimental results have shown, the UNPCC algorithm outperforms all the other supervised and unsupervised methods in the performance comparison experiments, maintaining high classification accuracies throughout all the experiments with the four groups of training and testing data sets. From all the experimental results, it can also be clearly shown that the operational benefits of the proposed novel UNPCC algorithm make it a suitable algorithm for real-time applications. 6 Acknowledgments For Mei-Ling Shyu, this research was supported in part by NSF ITR (Medium) IIS For Shu-Ching Chen, this research was supported in part by NSF EIA and NSF HRD References [1] D. Barbara, N. Wu, and S. Jajodia. Detecting novel network intrusions using Bayes estimator. In First SIAM Conference on Data Mining, Chicago, IL, USA, April [2] A. Ghosh, A. Schwartzbard, and M. Schatz. Learning program behavior profiles for intrusion detection. In The 1st USENIX Workshop on Intrusion Detection and Network Monitoring, pages 51 62, Santa Clara, USA, [3] KDD. KDD Cup 1999 Data. html, [4] R. Lark. A reappraisal of unsupervised classification. I Correspondence between spectral and conceptual classes. International Journal of Remote Sensing, 16: , [5] R. Lark. A reappraisal of unsupervised classification. II Optimal adjustment of the map legend and a neighbourhood approach for mapping legend units. International Journal of Remote Sensing, 16: , [6] P. Laskov, P. Dussel, C. Schafer, and K. Rieck. Learning intrusion detection: supervised or unsupervised? In ICIAP, Cagliari, ITALY, October [7] W. Leea, S. Stolfo, and K. Mok. A data mining framework for building intrusion detection models. In IEEE Symposium on Security and Privacy, pages , Santa Clara, USA, [8] MNSU. The EM algorithm for unsupervised clustering. Maximization-EM.html. [9] S. Mukkamala, G. Janoski, and A. Sung. Intrusion detection using neural networks and support vector machines. IEEE Internation Joint Conference on Neural Networks, pages , [10] M.-L. Shyu, S.-C. Chen, K. Sarinnapakorn, and L. Chang. A novel anomaly detection scheme based on principal component classifier. In IEEE Foundations and New Directions of Data Mining Workshop, in conjunction with ICDM 03, pages , Melbourne, Florida, USA, November [11] M.-L. Shyu, S.-C. Chen, K. Sarinnapakorn, and L. Chang. Principal component-based anomaly detection scheme. In Foundations and Novel Approaches in Data Mining, volume 9, pages T.Y. Lin, S. Ohsuga, C.J. Liau, and X. Hu, editors, Springer- Verlag, [12] L. I. Smith. A tutorial on principal components analysis. shlens/pub/notes/pca.pdf, February [13] Weka. [14] Weka. Class DensityBasedClusterer. zimek/diplomathesis/implementations/ EHNDs/doc/weka/clusterers/DensityBasedClusterer. html. [15] Wikipedia. Expectation-Maximization algorithm. [16] Z. Xie, T. Quirino, M.-L. Shyu, S.-C. Chen, and L. Chang. A distributed agent-based approach to intrusion detection using the lightweight pcc anomaly detection classier. In IEEE International Conference on Sensor Networks, Ubiquitous, and Trustworthy Computing (SUTC2006), pages , Taichung, Taiwan, R.O.C., June 2006.
Florida International University - University of Miami TRECVID 2014
Florida International University - University of Miami TRECVID 2014 Miguel Gavidia 3, Tarek Sayed 1, Yilin Yan 1, Quisha Zhu 1, Mei-Ling Shyu 1, Shu-Ching Chen 2, Hsin-Yu Ha 2, Ming Ma 1, Winnie Chen 4,
International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015
RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering
Intrusion Detection via Machine Learning for SCADA System Protection
Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. [email protected] J. Jiang Department
How To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
Environmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Categorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
A Survey on Outlier Detection Techniques for Credit Card Fraud Detection
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. VI (Mar-Apr. 2014), PP 44-48 A Survey on Outlier Detection Techniques for Credit Card Fraud
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
Classification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
Network Intrusion Detection using Semi Supervised Support Vector Machine
Network Intrusion Detection using Semi Supervised Support Vector Machine Jyoti Haweliya Department of Computer Engineering Institute of Engineering & Technology, Devi Ahilya University Indore, India ABSTRACT
FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected]
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected] WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection
Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.
Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,
COC131 Data Mining - Clustering
COC131 Data Mining - Clustering Martin D. Sykora [email protected] Tutorial 05, Friday 20th March 2009 1. Fire up Weka (Waikako Environment for Knowledge Analysis) software, launch the explorer window
Adaptive Anomaly Detection for Network Security
International Journal of Computer and Internet Security. ISSN 0974-2247 Volume 5, Number 1 (2013), pp. 1-9 International Research Publication House http://www.irphouse.com Adaptive Anomaly Detection for
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
Neural Networks Lesson 5 - Cluster Analysis
Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm [email protected] Rome, 29
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Robust Outlier Detection Technique in Data Mining: A Univariate Approach
Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,
An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework
An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University
Network Intrusion Detection Using an Improved Competitive Learning Neural Network
Network Intrusion Detection Using an Improved Competitive Learning Neural Network John Zhong Lei and Ali Ghorbani Faculty of Computer Science University of New Brunswick Fredericton, NB, E3B 5A3, Canada
Data Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based
Comparison of K-means and Backpropagation Data Mining Algorithms
Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
How To Perform An Ensemble Analysis
Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier
Using Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
Cluster Analysis: Advanced Concepts
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means
D-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
Chapter ML:XI (continued)
Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained
IDS IN TELECOMMUNICATION NETWORK USING PCA
IDS IN TELECOMMUNICATION NETWORK USING PCA Mohamed Faisal Elrawy 1, T. K. Abdelhamid 2 and A. M. Mohamed 3 1 Faculty of engineering, MUST University, 6th Of October, Egypt [email protected] 2,3
Prediction of DDoS Attack Scheme
Chapter 5 Prediction of DDoS Attack Scheme Distributed denial of service attack can be launched by malicious nodes participating in the attack, exploit the lack of entry point in a wireless network, and
A Study of Web Log Analysis Using Clustering Techniques
A Study of Web Log Analysis Using Clustering Techniques Hemanshu Rana 1, Mayank Patel 2 Assistant Professor, Dept of CSE, M.G Institute of Technical Education, Gujarat India 1 Assistant Professor, Dept
Introducing diversity among the models of multi-label classification ensemble
Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Data Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
KEITH LEHNERT AND ERIC FRIEDRICH
MACHINE LEARNING CLASSIFICATION OF MALICIOUS NETWORK TRAFFIC KEITH LEHNERT AND ERIC FRIEDRICH 1. Introduction 1.1. Intrusion Detection Systems. In our society, information systems are everywhere. They
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
203.4770: Introduction to Machine Learning Dr. Rita Osadchy
203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:
Component Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: [email protected]
Maschinelles Lernen mit MATLAB
Maschinelles Lernen mit MATLAB Jérémy Huard Applikationsingenieur The MathWorks GmbH 2015 The MathWorks, Inc. 1 Machine Learning is Everywhere Image Recognition Speech Recognition Stock Prediction Medical
HT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
Sanjeev Kumar. contribute
RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 [email protected] 1. Introduction The field of data mining and knowledgee discovery is emerging as a
A new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique
A new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique Aida Parbaleh 1, Dr. Heirsh Soltanpanah 2* 1 Department of Computer Engineering, Islamic Azad University, Sanandaj
Model-Based Cluster Analysis for Web Users Sessions
Model-Based Cluster Analysis for Web Users Sessions George Pallis, Lefteris Angelis, and Athena Vakali Department of Informatics, Aristotle University of Thessaloniki, 54124, Thessaloniki, Greece [email protected]
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents
Learning is a very general term denoting the way in which agents:
What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);
Medical Information Management & Mining. You Chen Jan,15, 2013 [email protected]
Medical Information Management & Mining You Chen Jan,15, 2013 [email protected] 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
NETWORK-BASED INTRUSION DETECTION USING NEURAL NETWORKS
1 NETWORK-BASED INTRUSION DETECTION USING NEURAL NETWORKS ALAN BIVENS [email protected] RASHEDA SMITH [email protected] CHANDRIKA PALAGIRI [email protected] BOLESLAW SZYMANSKI [email protected] MARK
Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center
Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376
Course Director: Dr. Kayvan Najarian (DCM&B, [email protected]) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.
How To Solve The Kd Cup 2010 Challenge
A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China [email protected] [email protected]
Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics
Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University
HYBRID INTRUSION DETECTION FOR CLUSTER BASED WIRELESS SENSOR NETWORK
HYBRID INTRUSION DETECTION FOR CLUSTER BASED WIRELESS SENSOR NETWORK 1 K.RANJITH SINGH 1 Dept. of Computer Science, Periyar University, TamilNadu, India 2 T.HEMA 2 Dept. of Computer Science, Periyar University,
Final Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA
CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA Professor Yang Xiang Network Security and Computing Laboratory (NSCLab) School of Information Technology Deakin University, Melbourne, Australia http://anss.org.au/nsclab
KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: [email protected]
KNIME TUTORIAL Anna Monreale KDD-Lab, University of Pisa Email: [email protected] Outline Introduction on KNIME KNIME components Exercise: Market Basket Analysis Exercise: Customer Segmentation Exercise:
Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.
International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
The SPSS TwoStep Cluster Component
White paper technical report The SPSS TwoStep Cluster Component A scalable component enabling more efficient customer segmentation Introduction The SPSS TwoStep Clustering Component is a scalable cluster
Getting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand [email protected] ABSTRACT Ensemble Selection uses forward stepwise
A Review of Anomaly Detection Techniques in Network Intrusion Detection System
A Review of Anomaly Detection Techniques in Network Intrusion Detection System Dr.D.V.S.S.Subrahmanyam Professor, Dept. of CSE, Sreyas Institute of Engineering & Technology, Hyderabad, India ABSTRACT:In
Intrusion Detection System using Log Files and Reinforcement Learning
Intrusion Detection System using Log Files and Reinforcement Learning Bhagyashree Deokar, Ambarish Hazarnis Department of Computer Engineering K. J. Somaiya College of Engineering, Mumbai, India ABSTRACT
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
Specific Usage of Visual Data Analysis Techniques
Specific Usage of Visual Data Analysis Techniques Snezana Savoska 1 and Suzana Loskovska 2 1 Faculty of Administration and Management of Information systems, Partizanska bb, 7000, Bitola, Republic of Macedonia
UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS
UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 [email protected] What is Learning? "Learning denotes changes in a system that enable
Application of Data Mining Techniques in Intrusion Detection
Application of Data Mining Techniques in Intrusion Detection LI Min An Yang Institute of Technology [email protected] Abstract: The article introduced the importance of intrusion detection, as well as
Standardization and Its Effects on K-Means Clustering Algorithm
Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03
A survey on Data Mining based Intrusion Detection Systems
International Journal of Computer Networks and Communications Security VOL. 2, NO. 12, DECEMBER 2014, 485 490 Available online at: www.ijcncs.org ISSN 2308-9830 A survey on Data Mining based Intrusion
Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch
Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 12 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
Introduction to Data Mining
Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:
Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer
Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
not possible or was possible at a high cost for collecting the data.
Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation
Clustering Technique in Data Mining for Text Documents
Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications
Advanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
Chapter 12 Discovering New Knowledge Data Mining
Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to
Data, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
D A T A M I N I N G C L A S S I F I C A T I O N
D A T A M I N I N G C L A S S I F I C A T I O N FABRICIO VOZNIKA LEO NARDO VIA NA INTRODUCTION Nowadays there is huge amount of data being collected and stored in databases everywhere across the globe.
Concept and Project Objectives
3.1 Publishable summary Concept and Project Objectives Proactive and dynamic QoS management, network intrusion detection and early detection of network congestion problems among other applications in the
