Anomaly Detection using multidimensional reduction Principal Component Analysis
|
|
|
- Byron Austin
- 9 years ago
- Views:
Transcription
1 IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: , p- ISSN: Volume 16, Issue 1, Ver. II (Jan. 2014), PP Anomaly Detection using multidimensional reduction Principal Component Analysis Krushna S.Telangre, Prof.S.B.Sarkar Department Of Computer Engineering Stes s Skn-Sinhgad Institute Of Technology Lonavala,India Department Of Computer Engineering Stes s Skn-Sinhgad Institute Of Technology Lonavala,India Abstract: Anomaly detection has been an important research topic in data mining and machine learning. Many real-world applications such as intrusion or credit card fraud detection require an effective and efficient framework to identify deviated data instances. However, most anomaly detection methods are typically implemented in batch mode, and thus cannot be easily extended to large-scale problems without sacrificing computation and memory requirements. In this paper, we propose multidimensional reduction principal component analysis (MdrPCA) algorithm to address this problem, and we aim at detecting the presence of outliers from a large amount of data via an online updating technique. Unlike prior principal component analysis (PCA)-based approaches, we do not store the entire data matrix or covariance matrix, and thus our approach is especially of interest in online or large-scale problems. By using multidimensional reduction PCA the target instance and extracting the principal direction of the data, the proposed MdrPCA allows us to determine the anomaly of the target instance according to the variation of the resulting dominant eigenvector. Since our MdrPCA need not perform eigen analysis explicitly, the proposed framework is favored for online applications which have computation or memory limitations. Compared with the well-known power method for PCA and other popular anomaly detection algorithms Index Terms: Anomaly detection, online updating, least squares, oversampling, principal component analysis I. INTRODUCTION Anomaly detection refers to the problem of finding patterns in data that do not conform to expected behavior. These non-conforming patterns are often referred to as anomalies, outliers, discordant observations, exceptions, aberrations, surprises, peculiarities or contaminants in different application domains. Of these, anomalies and outliers are two terms used most commonly in the context of anomaly detection; sometimes interchangeably. Anomaly detection finds extensive use in a wide variety of applications such as fraud detection for credit cards, insurance or health care, intrusion detection for cyber-security, fault detection in safety critical systems, and military surveillance for enemy activities. The importance of anomaly detection is due to the fact that anomalies in data translate to significant (and often critical) actionable information in a wide variety of application domains. For example, an anomalous traffic pattern in a computer network could mean that a hacked computer is sending out sensitive data to an unauthorized destination. Despite the rareness of the deviated data, its presence might enormously affect the solution model such as the distribution or principal directions of the data. For example, the calculation of data mean or the least squares solution of the associated linear regression model is both sensitive to outliers. As a result, anomaly detection needs to solve an unsupervised yet unbalanced data learning problem. Similarly, we observe that removing (or adding) an abnormal data instance will affect the principal direction of the resulting data than removing (or adding) a normal one does. Using the above leave one out (LOO) strategy, we can calculate the principal direction of the data set without the target instance present and that of the original data set. Thus, the outlierness (or anomaly) of the data instance can be determined by the variation of the resulting principal directions. More precisely, the difference between these two eigenvectors will indicate the anomaly of the target instance. By ranking the difference scores of all data points, one can identify the outlier data by a predefined threshold or a predetermined portion of the data. We note that the above framework can be considered as a decremental PCA (dpca)-based approach for anomaly detection. While it works well for applications with moderate data set size, the variation of principal directions might not be significant when the size of the data set is large. In real-world anomaly detection problems dealing with a large amount of data, adding or removing one target instance only produces negligible difference in the resulting eigenvectors, and one cannot simply apply the dpca technique for anomaly detection. To address this practical problem, we advance the oversampling strategy to duplicate the target instance, and we perform an oversampling PCA (ospca) on such an oversampled data set. It is obvious that the effect of an outlier instance will be amplified due to its duplicates present in the principal component analysis (PCA) formulation, and this makes the detection of outlier data 86 Page
2 easier. However, this LOO anomaly detection procedure with an oversampling strategy will markedly increase the computational load. For each target instance, one always needs to create a dense covariance matrix and solves the associated PCA problem. This will prohibit the use of our proposed framework for real-world largescale applications. Although the well known power method is able to produce approximated PCA solutions, it requires the storage of the covariance matrix and cannot be easily extended to applications with streaming data or online settings. An online updating technique for ospca. This updating technique allows us to efficiently calculate the approximated dominant eigenvector without performing eigen analysis or storing the data covariance matrix. Compared to the power method or other popular anomaly detection algorithms, the required computational costs and memory requirements are significantly reduced, and thus our method is especially preferable in online, streaming data, or large-scale problems. Moreover, many learning algorithms encounter the curse of dimensionality problem in a extremely high-dimensional space. In ospca method, although we are able to handle high-dimensional data since we do not need to compute or to keep the covariance matrix, PCA might not be preferable in estimating the principal directions for such kind of data. So as proposed to solution to this problem is MdrPCA i.e. multi dimensional reduction PCA. We will MdrPCA as it not easy to use linear models such as PCA to estimate the data distribution if there exists multiple data clusters. Using MdrPCA we can handle multiple cluster, high dimensions, curse of dimensionality problem. II. LITERATURE SURVEY In past few years many anomaly detection algorithms has been proposed. These approaches are broadly divided into three categories:- Statistical Approach:- Statistical approaches [1], [3] assume that the data follows some standard or predetermined distributions, and this type of approach aims to find the outliers which deviate from such disributions. However, most distribution models are assumed univariate, and thus the lack of robustness for multidimensional data is a concern. Moreover, since these methods are typically implemented in the original data space directly, their solution models might suffer from the noise present in the data. Distance- Based Approach:- For distance-based methods [4], the distances between each data point of interest and its neighbors are calculated. If the result is above some predetermined threshold, the target instance will be considered as an outlier. While no prior knowledge on data distribution is needed, these approaches might encounter problems when the data distribution is complex (e.g., multiclustered structure). In such cases, this type of approach will result in determining improper neighbors, and thus outliers cannot be correctly identified. Density- Based Approach:- To alleviate the aforementioned problem, density-based methods are proposed [5]. One of the representatives of this type of approach is to use a density-based local outlier factor (LOF) to measure the outlierness of each data instance. Based on the local density of each data instance, the LOF determines the degree of outlierness, which provides suspicious ranking scores for all samples. The most important property of the LOF is the ability to estimate local data structure via density estimation. This allows users to identify outliers which are sheltered under a global data structure. However, it is worth noting that the estimation of local data density for each instance is very computationally expensive, especially when the size of the data set is large. Beside this some recently proposed approaches are given below:- Angle Based Outlier Detection (ABOD) Method:- A novel approach named ABOD (Angle-Based Outlier Detection) [6] and some variants assessing the variance in the angles between the difference vectors of a point to the other points. This way, the effects of the curse of dimensionality are alleviated compared to purely distance based approaches. A main advantage of this approach is that ABOD method does not rely on any parameter selection influencing the quality of the achieved ranking. As compare with well established distance-based method LOF performance of very well especially on high dimensional data. Consequently, a fast ABOD algorithm is proposed to generate an approximation of the original ABOD solution. The difference between the standard and the fast ABOD approaches is that the latter only considers the variance of the angles between the target instance and its k nearest neighbors. However, the search of the nearest neighbors still prohibits its extension to large scale problems (batch or online modes), since the user will need to keep all data instances to calculate the required angle information. 87 Page
3 Incremental Local Outlier Detection for Data Streams:- Incremental LOF (Local Outlier Factor) [7] algorithm provides equivalent detection performance as the iterated static LOF algorithm (applied after insertion of each data record), while requiring significantly less computational time. In addition, the incremental LOF algorithm also dynamically updates the profiles of data points. This is a very important property, since data profiles may change over time. Incremental LOF provides theoretical evidence that insertion of a new data point as well as deletion of an old data point influence only limited number of their closest neighbors and thus the number of updates per such insertion/deletion does not depend on the total number of points N in the data set. It is found that their computational cost or memory requirements might not always satisfy online detection scenarios. For example, while the incremental LOF in [7] is able to update the LOFs when receiving a new target instance, this incremental method needs to maintain a preferred (or filtered) data subset. Thus, the memory requirement for the incremental LOF is O (np), where n and p are the size and dimensionality of the data subset of interest, respectively. Online Anomaly Detection using KDE:- Large backbone networks are regularly affected by a range of anomalies. Online anomaly detection algorithm based on Kernel Density Estimates [8]. Algorithm sequentially and adaptively learns the definition of normality in the given application, assumes no prior knowledge regarding the underlying distributions, and then detects anomalies subject to a userset tolerance level for false alarms. Comparison with the existing methods of Geometric Entropy Minimization, Principal Component Analysis and One-Class Neighbor Machine demonstrates that the proposed method achieves superior performance with lower complexity. But an online kernel density estimation for anomaly detection algorithm requires at least O(np 2 +p 2 ) for computation complexity [8]. In online settings or large-scale data problems, the abovementioned method might not meet the online requirement, in which both computation complexity and memory requirement are as low as possible. Oversampling PCA (ospca):- The proposed ospca [9] scheme will duplicate the target instance multiple times, and the idea is to amplify the effect of outlier rather than that of normal data. While it might not be sufficient to perform anomaly detection simply based on the most dominant eigenvector and ignore the remaining ones, our online ospca method aims to efficiently determine the anomaly of each target instance without sacrificing computation and memory efficiency. More specifically, if the target instance is an outlier, this oversampling scheme allows us to overemphasize its effect on the most dominant eigenvector, and thus we can focus on extracting and approximating the dominant principal direction in an online fashion, instead of calculating multiple eigenvectors carefully. III. PROBLEM STATEMENT However, most anomaly detection methods are typically implemented in batch mode, and thus can not be easily extended to large-scale problems without sacrificing computation and memory requirements. In base paper, they propose an online oversampling principal component analysis (ospca) algorithm to address this problem, and they aim at detecting the presence of outliers from a large amount of data via an online updating technique. Unlike prior principal component analysis (PCA)-based approaches, we do not store the entire data matrix or covariance matrix, and thus our approach is especially of interest in online or large-scale problems. By oversampling the target instance and extracting the principal direction of the data, ospca allows us to determine the anomaly of the target instance according to the variation of the resulting dominant eigenvector. Since ospca need not perform eigen analysis explicitly, the proposed framework is favored for online applications which have computation or memory limitations. Limitations of Existing Methods 1. Normal data with multi clustering structure, and data in a extremely high dimensional space. For the former case, it is typically not easy to use linear models such as PCA to estimate the data distribution if there exists multiple data clusters. 2. Moreover, many learning algorithms encounter the curse of dimensionality problem in a extremely highdimensional space. 3. Although we are able to handle high-dimensional data since we do not need to compute or to keep the covariance matrix, PCA might not be preferable in estimating the principal directions for such kind of data. IV. OBJECTIVE In this project we have main aim is to present the extended method for Anomaly Detection using multi dimensional reduction Principal Component Analysis with improved reliability and performance: To present literature review different methods Principal Component Analysis. 88 Page
4 To present the present new framework and methods. To present the practical simulation of proposed algorithms and evaluate its performances. To present the comparative analysis of existing and proposed algorithms in order to claim the efficiency. V. SCOPE OF PROPOSED SYSTEM A major problem is the curse of dimensionality. If the data x lies in high dimensional space, then an enormous amount of data is required to learn distributions or decision rules. Example: 50 dimensions. Each dimension has 20 levels. This gives a total of cells. But the no. of data samples will be far less. There will not be enough data samples to learn. One way to deal with dimensionality is to assume that we know the form of the probability distribution. For example, a Gaussian model in N dimensions has N + N (N-1)/2 parameters to estimate. Requires data to learn reliably. This may be practical. One way to avoid the curse of dimensionality is by projecting the data onto a lowerdimensional space. Multidimensional reduction (Mdr) methods are projection techniques that tend to preserve, as much as possible, the distances among data. Therefore data that are close in the original data set should be projected in such a way that their projections, in the new space (output space), are still close. This depends only on the distances between data. When the rank order of the distances in the output space is the same as the rank order of the distances in the original data space, stress is zero. Stress is minimized by iteratively moving the data in the output space from their initially randomly chosen positions according to a gradient-descent algorithm. The intrinsic dimensionality is determined in the following way. The minimum stress for projections of different dimensionalities is computed. Then a plot of the minimum stress versus dimensionality of the output space is performed. ID is the dimensionality value for which there is a knee or a flattening of the curve. VI. METHODOLOGY Figure 1.Proposed Methodology Data Cleaning:- In the data cleaning phase, our goal is to filter out the most deviated data using our ospca before performing online anomaly detection. This data cleaning phase is done offline, and the percentage of the training normal data to be disregarded can be determined by the user. It can be anything as specified by user may be 5% or 10% as per the requirement of application. And that much amount of instances is removed from the original dataset. Normal Pattern Extraction:- In this phase normal pattern of remaining instances of previous phase is derived and Principal Direction is computed by using Principal Component Analysis. Online Anomaly Detection:- After ignoring of specific amount of the training normal data after this data cleaning process, here we use the smallest score of outlierness of the remaining training data instances as the threshold for outlier detection. More specifically, in this phase of online detection, we use this threshold to determine the anomaly of each received data point. If smallest score of a newly received data instance is above the threshold, it will be identified as an outlier; otherwise, it will be considered as a normal data point, and we will update our MdrPCA model accordingly.. 89 Page
5 VII. Proposed System Figure 2. Sequence Diagram of Proposed System Steps explanation- Step 1:- Data cleaning is done on contaminated data. In this the instances that are deflecting the principal direction in large amount is removed by using Oversampling Principal Component Analysis. And normal data is secured. Step 2:- Once we get a normal data, we instances i.e the principal direction is computed. Step 3:- Analyzing threshold value for detecting the outliers. Step 4:- Smallest score of outlierness for new instance is measured. Step 5:- Smallest score of outlierness of new instance is compared with the threshold value, and if the smallest score of outlierness is outlier and ignored to update the current PC. Step 6:- If the instance is normal i.e the value of smallest score of outlier is smaller than defined threshold value, it is detected as normal instance and will extract the normal pattern for the remaining greater than threshold value then the instance is detected as updated directly to the PC. VIII. CONCLUSION This paper proposed multidimensional reduction principal component analysis (MdrPCA).that is quiet different than ospca. Performance of MdrPCA is better than ospca because we have found that the ospca strategy amplifies the effect of outliers, and thus we can successfully use the variation of the dominant principal direction to identify the presence of rare but abnormal data. But we are extending the same strategy of ospca for the detection of outlier in data with high dimensional space using online updating technique with the help of MdrPCA. REFERENCES [1] D.M. Hawkins, Identification of Outliers. Chapman and Hall, [2] M. Breunig, H.-P. Kriegel, R.T. Ng, and J. Sander, LOF: Identifying Density-Based Local Outliers, Proc. ACM SIGMOD Int l Conf. Management of Data, [3] V. Chandola, A. Banerjee, and V. Kumar, Anomaly Detection: A Survey, ACM Computing Surveys, vol. 41, no. 3, pp. 15:1-15:58, [4] L. Huang, X. Nguyen, M. Garofalakis, M. Jordan, A.D. Joseph, and N. Taft, In-Network Pca and Anomaly Detection, Proc. Advances in Neural Information Processing Systems 19, [5] H.-P. Kriegel, M. Schubert, and A. Zimek, Angle-Based Outlier Detection in High-Dimensional Data, Proc. 14th ACM SIGKDD Int l Conf. Knowledge Discovery and data Mining, [6] A. Lazarevic, L. Erto z, V. Kumar, A. Ozgur, and J. Srivastava, A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection, Proc. Third SIAM Int l Conf. Data Mining, [7] X. Song, M. Wu, and C.J., and S. Ranka, Conditional Anomaly Detection, IEEE Trans. Knowledge and Data Eng., vol. 19, no. 5, pp , May [8] S. Rawat, A.K. Pujari, and V.P. Gulati, On the Use of Singular Value Decomposition for a Fast Intrusion Detection System, Electronic Notes in Theoretical Computer Science, vol. 142, no. 3, pp , [9] Yuh-Jye Lee, Yi-Ren Yeh, and Yu-Chiang Frank Wang, Anomaly Detection via Online Oversampling Principal Component Analysis IEEE Transactions on Knowledge and Data Engineering, Vol. 25, No. 7, July Page
A Survey on Outlier Detection Techniques for Credit Card Fraud Detection
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. VI (Mar-Apr. 2014), PP 44-48 A Survey on Outlier Detection Techniques for Credit Card Fraud
Histogram-based Outlier Score (HBOS): A fast Unsupervised Anomaly Detection Algorithm
Histogram-based Outlier Score (HBOS): A fast Unsupervised Anomaly Detection Algorithm Markus Goldstein and Andreas Dengel German Research Center for Artificial Intelligence (DFKI), Trippstadter Str. 122,
An Evaluation of Machine Learning Method for Intrusion Detection System Using LOF on Jubatus
An Evaluation of Machine Learning Method for Intrusion Detection System Using LOF on Jubatus Tadashi Ogino* Okinawa National College of Technology, Okinawa, Japan. * Corresponding author. Email: [email protected]
Adaptive Anomaly Detection for Network Security
International Journal of Computer and Internet Security. ISSN 0974-2247 Volume 5, Number 1 (2013), pp. 1-9 International Research Publication House http://www.irphouse.com Adaptive Anomaly Detection for
Robust Outlier Detection Technique in Data Mining: A Univariate Approach
Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,
OUTLIER ANALYSIS. Data Mining 1
OUTLIER ANALYSIS Data Mining 1 What Are Outliers? Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism Ex.: Unusual credit card purchase,
Principal Component Analysis
Principal Component Analysis ERS70D George Fernandez INTRODUCTION Analysis of multivariate data plays a key role in data analysis. Multivariate data consists of many different attributes or variables recorded
Local Anomaly Detection for Network System Log Monitoring
Local Anomaly Detection for Network System Log Monitoring Pekka Kumpulainen Kimmo Hätönen Tampere University of Technology Nokia Siemens Networks [email protected] [email protected] Abstract
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
Fuzzy Threat Assessment on Service Availability with Data Fusion Approach
Journal of Advances in Computer Research Quarterly pissn: 2345-606x eissn: 2345-6078 Sari Branch, Islamic Azad University, Sari, I.R.Iran (Vol. 5, No. 4, November 2014), Pages: 79-90 www.jacr.iausari.ac.ir
Unsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015
RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering
Supervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
How To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,
Utility-Based Fraud Detection
Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Utility-Based Fraud Detection Luis Torgo and Elsa Lopes Fac. of Sciences / LIAAD-INESC Porto LA University of
Local outlier detection in data forensics: data mining approach to flag unusual schools
Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential
Adaptive Framework for Network Traffic Classification using Dimensionality Reduction and Clustering
IV International Congress on Ultra Modern Telecommunications and Control Systems 22 Adaptive Framework for Network Traffic Classification using Dimensionality Reduction and Clustering Antti Juvonen, Tuomo
Accurate and robust image superresolution by neural processing of local image representations
Accurate and robust image superresolution by neural processing of local image representations Carlos Miravet 1,2 and Francisco B. Rodríguez 1 1 Grupo de Neurocomputación Biológica (GNB), Escuela Politécnica
Intrusion Detection via Machine Learning for SCADA System Protection
Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. [email protected] J. Jiang Department
OUTLIER ANALYSIS. Authored by CHARU C. AGGARWAL IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
OUTLIER ANALYSIS OUTLIER ANALYSIS Authored by CHARU C. AGGARWAL IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Kluwer Academic Publishers Boston/Dordrecht/London Contents Preface Acknowledgments
Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.
International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 9, Issue 8 (January 2014), PP. 19-24 Comparative Analysis of EM Clustering Algorithm
KEITH LEHNERT AND ERIC FRIEDRICH
MACHINE LEARNING CLASSIFICATION OF MALICIOUS NETWORK TRAFFIC KEITH LEHNERT AND ERIC FRIEDRICH 1. Introduction 1.1. Intrusion Detection Systems. In our society, information systems are everywhere. They
Credit Card Fraud Detection Using Self Organised Map
International Journal of Information & Computation Technology. ISSN 0974-2239 Volume 4, Number 13 (2014), pp. 1343-1348 International Research Publications House http://www. irphouse.com Credit Card Fraud
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Map-Reduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programming Model
Map-Reduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programming Model Subramanyam. RBV, Sonam. Gupta Abstract An important problem that appears often when analyzing data involves
Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data
Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents
Email Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
Dimensionality Reduction: Principal Components Analysis
Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely
Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health
Lecture 1: Data Mining Overview and Process What is data mining? Example applications Definitions Multi disciplinary Techniques Major challenges The data mining process History of data mining Data mining
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Principal components analysis
CS229 Lecture notes Andrew Ng Part XI Principal components analysis In our discussion of factor analysis, we gave a way to model data x R n as approximately lying in some k-dimension subspace, where k
Conclusions and Future Directions
Chapter 9 This chapter summarizes the thesis with discussion of (a) the findings and the contributions to the state-of-the-art in the disciplines covered by this work, and (b) future work, those directions
Predict the Popularity of YouTube Videos Using Early View Data
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
A Two-Step Method for Clustering Mixed Categroical and Numeric Data
Tamkang Journal of Science and Engineering, Vol. 13, No. 1, pp. 11 19 (2010) 11 A Two-Step Method for Clustering Mixed Categroical and Numeric Data Ming-Yi Shih*, Jar-Wen Jheng and Lien-Fu Lai Department
AUTONOMOUS NETWORK SECURITY FOR DETECTION OF NETWORK ATTACKS
AUTONOMOUS NETWORK SECURITY FOR DETECTION OF NETWORK ATTACKS Nita V. Jaiswal* Prof. D. M. Dakhne** Abstract: Current network monitoring systems rely strongly on signature-based and supervised-learning-based
Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression
Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Saikat Maitra and Jun Yan Abstract: Dimension reduction is one of the major tasks for multivariate
Outlier Detection in Stream Data by Machine Learning and Feature Selection Methods
International Journal of Advanced Computer Science and Information Technology (IJACSIT) Vol. 2, No. 3, 2013, Page: 17-24, ISSN: 2296-1739 Helvetic Editions LTD, Switzerland www.elvedit.com Outlier Detection
Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland
Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data
Application of Data Mining Techniques in Intrusion Detection
Application of Data Mining Techniques in Intrusion Detection LI Min An Yang Institute of Technology [email protected] Abstract: The article introduced the importance of intrusion detection, as well as
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup
Network Anomaly Detection A Machine Learning Perspective Dhruba Kumar Bhattacharyya Jugal Kumar KaKta»C) CRC Press J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor
SYMMETRIC EIGENFACES MILI I. SHAH
SYMMETRIC EIGENFACES MILI I. SHAH Abstract. Over the years, mathematicians and computer scientists have produced an extensive body of work in the area of facial analysis. Several facial analysis algorithms
Data Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based
Detecting Network Anomalies. Anant Shah
Detecting Network Anomalies using Traffic Modeling Anant Shah Anomaly Detection Anomalies are deviations from established behavior In most cases anomalies are indications of problems The science of extracting
Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis
IOSR Journal of Computer Engineering (IOSRJCE) ISSN: 2278-0661, ISBN: 2278-8727 Volume 6, Issue 5 (Nov. - Dec. 2012), PP 36-41 Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis
A Review of Anomaly Detection Techniques in Network Intrusion Detection System
A Review of Anomaly Detection Techniques in Network Intrusion Detection System Dr.D.V.S.S.Subrahmanyam Professor, Dept. of CSE, Sreyas Institute of Engineering & Technology, Hyderabad, India ABSTRACT:In
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Standardization and Its Effects on K-Means Clustering Algorithm
Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03
Data, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
Unsupervised and supervised dimension reduction: Algorithms and connections
Unsupervised and supervised dimension reduction: Algorithms and connections Jieping Ye Department of Computer Science and Engineering Evolutionary Functional Genomics Center The Biodesign Institute Arizona
APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
Efficient Detection of Ddos Attacks by Entropy Variation
IOSR Journal of Computer Engineering (IOSRJCE) ISSN: 2278-0661, ISBN: 2278-8727 Volume 7, Issue 1 (Nov-Dec. 2012), PP 13-18 Efficient Detection of Ddos Attacks by Entropy Variation 1 V.Sus hma R eddy,
Traffic Prediction in Wireless Mesh Networks Using Process Mining Algorithms
Traffic Prediction in Wireless Mesh Networks Using Process Mining Algorithms Kirill Krinkin Open Source and Linux lab Saint Petersburg, Russia [email protected] Eugene Kalishenko Saint Petersburg
A Fast Fraud Detection Approach using Clustering Based Method
pp. 33-37 Krishi Sanskriti Publications http://www.krishisanskriti.org/jbaer.html A Fast Detection Approach using Clustering Based Method Surbhi Agarwal 1, Santosh Upadhyay 2 1 M.tech Student, Mewar University,
A Survey on Pre-processing and Post-processing Techniques in Data Mining
, pp. 99-128 http://dx.doi.org/10.14257/ijdta.2014.7.4.09 A Survey on Pre-processing and Post-processing Techniques in Data Mining Divya Tomar and Sonali Agarwal Indian Institute of Information Technology,
How To Perform An Ensemble Analysis
Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier
A MULTIVARIATE OUTLIER DETECTION METHOD
A MULTIVARIATE OUTLIER DETECTION METHOD P. Filzmoser Department of Statistics and Probability Theory Vienna, AUSTRIA e-mail: [email protected] Abstract A method for the detection of multivariate
A new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique
A new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique Aida Parbaleh 1, Dr. Heirsh Soltanpanah 2* 1 Department of Computer Engineering, Islamic Azad University, Sanandaj
Public Transportation BigData Clustering
Public Transportation BigData Clustering Preliminary Communication Tomislav Galba J.J. Strossmayer University of Osijek Faculty of Electrical Engineering Cara Hadriana 10b, 31000 Osijek, Croatia [email protected]
Random Forest Based Imbalanced Data Cleaning and Classification
Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem
HYBRID INTRUSION DETECTION FOR CLUSTER BASED WIRELESS SENSOR NETWORK
HYBRID INTRUSION DETECTION FOR CLUSTER BASED WIRELESS SENSOR NETWORK 1 K.RANJITH SINGH 1 Dept. of Computer Science, Periyar University, TamilNadu, India 2 T.HEMA 2 Dept. of Computer Science, Periyar University,
THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS
THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS O.U. Sezerman 1, R. Islamaj 2, E. Alpaydin 2 1 Laborotory of Computational Biology, Sabancı University, Istanbul, Turkey. 2 Computer Engineering
Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool
Comparison and Analysis of Various Clustering Metho in Data mining On Education data set Using the weak tool Abstract:- Data mining is used to find the hidden information pattern and relationship between
Software Defect Prediction Modeling
Software Defect Prediction Modeling Burak Turhan Department of Computer Engineering, Bogazici University [email protected] Abstract Defect predictors are helpful tools for project managers and developers.
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du [email protected] University of British Columbia
Going Big in Data Dimensionality:
LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für
Statistics in Retail Finance. Chapter 7: Fraud Detection in Retail Credit
Statistics in Retail Finance Chapter 7: Fraud Detection in Retail Credit 1 Overview > Detection of fraud remains an important issue in retail credit. Methods similar to scorecard development may be employed,
Adaptive Demand-Forecasting Approach based on Principal Components Time-series an application of data-mining technique to detection of market movement
Adaptive Demand-Forecasting Approach based on Principal Components Time-series an application of data-mining technique to detection of market movement Toshio Sugihara Abstract In this study, an adaptive
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
Anomaly Detection for Univariate Time-Series Data
Anomaly Detection for Univariate Time-Series Data Krati Nayyar Saket Vishwasrao Saurabh Chakravarty Sina Dabiri [email protected] [email protected] [email protected] [email protected] Abstract Some of the biggest
Identifying erroneous data using outlier detection techniques
Identifying erroneous data using outlier detection techniques Wei Zhuang 1, Yunqing Zhang 2 and J. Fred Grassle 2 1 Department of Computer Science, Rutgers, the State University of New Jersey, Piscataway,
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
An Efficient Way of Denial of Service Attack Detection Based on Triangle Map Generation
An Efficient Way of Denial of Service Attack Detection Based on Triangle Map Generation Shanofer. S Master of Engineering, Department of Computer Science and Engineering, Veerammal Engineering College,
How To Predict Web Site Visits
Web Site Visit Forecasting Using Data Mining Techniques Chandana Napagoda Abstract: Data mining is a technique which is used for identifying relationships between various large amounts of data in many
How To Cluster On A Search Engine
Volume 2, Issue 2, February 2012 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: A REVIEW ON QUERY CLUSTERING
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center
Novelty Detection in image recognition using IRF Neural Networks properties
Novelty Detection in image recognition using IRF Neural Networks properties Philippe Smagghe, Jean-Luc Buessler, Jean-Philippe Urban Université de Haute-Alsace MIPS 4, rue des Frères Lumière, 68093 Mulhouse,
Visual Data Mining with Pixel-oriented Visualization Techniques
Visual Data Mining with Pixel-oriented Visualization Techniques Mihael Ankerst The Boeing Company P.O. Box 3707 MC 7L-70, Seattle, WA 98124 [email protected] Abstract Pixel-oriented visualization
Final Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
Client Perspective Based Documentation Related Over Query Outcomes from Numerous Web Databases
Beyond Limits...Volume: 2 Issue: 2 International Journal Of Advance Innovations, Thoughts & Ideas Client Perspective Based Documentation Related Over Query Outcomes from Numerous Web Databases B. Santhosh
Component Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL
International Journal Of Advanced Technology In Engineering And Science Www.Ijates.Com Volume No 03, Special Issue No. 01, February 2015 ISSN (Online): 2348 7550 ASSOCIATION RULE MINING ON WEB LOGS FOR
Adaptive Face Recognition System from Myanmar NRC Card
Adaptive Face Recognition System from Myanmar NRC Card Ei Phyo Wai University of Computer Studies, Yangon, Myanmar Myint Myint Sein University of Computer Studies, Yangon, Myanmar ABSTRACT Biometrics is
Chapter 12 Discovering New Knowledge Data Mining
Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to
Supervised and unsupervised learning - 1
Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in
A Clustering Model for Mining Evolving Web User Patterns in Data Stream Environment
A Clustering Model for Mining Evolving Web User Patterns in Data Stream Environment Edmond H. Wu,MichaelK.Ng, Andy M. Yip,andTonyF.Chan Department of Mathematics, The University of Hong Kong Pokfulam Road,
