Anomaly Detection for Univariate Time-Series Data
|
|
|
- Kathleen Osborne
- 9 years ago
- Views:
Transcription
1 Anomaly Detection for Univariate Time-Series Data Krati Nayyar Saket Vishwasrao Saurabh Chakravarty Sina Dabiri Abstract Some of the biggest challenges in anomaly based network intrusion detection systems have to do with being able to handle anomaly detection at huge scale, in real time. The incoming data stream is homogeneous, containing different anomalous patterns along with a large amount of normal data. We pose the problem as that of detecting the anomaly in the data stream in realtime. We define an approach to classify the pattern in the sliding window based data stream and attempt to identify the class of anomaly a pattern belongs to. These classes contain the normal class along with the anomalous ones. If the data detected in the sliding window is classified as belonging to one of the anomalous patterns, we dispatch it to the respective algorithmic handler for predicting which data points in the sliding window are anomalous, based on the model that we have trained in the handlers. Experiments show that we are able to get good reasonable accuracy based on our methods. There is always that chance of generating false positives and considering the problem domain, we conclude that it is fine to generate a few false alarms rather than being silent on an actual anomaly. We use the F1 score attributes like precision and recall to measure our experimental results quantitatively. CSx824/ECEx424 final project report. Copyright 2016 by the author(s). 1. Introduction Anomaly detection refers to the problem of finding patterns in data that do not conform to the normal behaviour. In other words, anomaly detection is a kind of technique that seeks to specify a region representing normal behaviour and declare any observation in the data which does not belong to this normal region as an anomaly. This problem is also referred to novelty detection, outlier detection, oneclass classification, exceptions, aberrations and surprises. However, anomalies and outliers are the most common terms in the literature of anomaly-based intrusion detection in networks(k. & K., 2013). Anomaly detection is applied to a broad spectrum of domains including IT, security, finance, vehicle tracking, health care, energy grid monitoring as well as e-commerce. In the real world, several studies investigated the role of anomaly detection algorithms to explore an anomalous traffic pattern in a computer network, an anomalous MRI image, anomalies in credit card transaction data as well as anomalous readings from a space craft sensor(chandola V. & V., 2009). Anomaly detection can also be applied to find unexpected patterns in time series data. A time series is a sequence of data points, typically consisting of successive measurements made over a time interval. In this study, we strove for developing a framework for a univariate time series data set. We used publicly available dataset released by Yahoo. A value and a label (0 for normal points and 1 for anomaly points) are assigned to each data points. For specific formulation of the problem, we need to evaluate various aspects of our data including types of anomalies, data labels and output of anomaly detection. In the literature, types of anomalies in time series are categorized into following groups: (1) Point Anomalies. If an
2 individual data instance can be considered as anomalous with respect to the rest of data, then the instance is termed as a point anomaly, (2) Contextual Anomalies. If a data instance is anomalous in a specific context (but not otherwise), then it is termed as a contextual anomaly, (3) Collective Anomalies. If a collection of related data instances is anomalous with respect to the entire data set, it is termed as a collective anomaly. This classification indicates that extracting different anomaly patterns in our data set is vital step before developing any algorithms. As our training data set is labeled as anomaly versus normal, we are going to focus on supervised anomaly detection. The output of our anomaly detection are labels, which assigns anomaly or normal to each test instance. In this project, we envisioned to develop a framework for detecting anomaly points in our data set. As the values of data points in different files are not compatible with each other, we scaled them in the similar format. Scaled data provided an opportunity to combine all data sets. In the next step, we plotted data set to visualize different anomaly patterns. We classified each time-window data set to four categories, three anomaly groups and one normal group. For example, if there were no anomaly points in a portion of data set, we assigned a normal label to that portion. Thus, we developed a multi-classification algorithm, which defined the category of each time-window. Then, for each of anomaly time-window data, we applied a specific algorithm to detect the anomaly points in that time-window data. In the last step, we reported the accuracy of the algorithms by two indexes: Prediction accuracy and F-score. The later index is more valuable as it takes into account the effects of false positive points and false negative points. 2. Our Approach Analysis of the dataset suggests that each window in which an anomaly occurs can be classified into 3 distinct patterns. Figure 1 shows a threshold based pattern. Such patterns have values of the anomalous points significantly larger than the global average. Section 4.1 discussed later suggests our approach in detecting anomalies in such windows. Figure 2 shows the anomaly detected due to change in the past pattern to a completely new pattern. Section 4.2 describes this. Figure 3 shows anomalous patterns where anomaly occurs due to change in frequencies. Such patterns are detected using a sliding window based algorithm that correlates the spectrum of successive windows as described in Section 4.3. From our observations, we find that a single anomaly detection algorithm in general does not give good results as different algorithms have different strengths. So we do a preprocessing on the data to identify which pattern it belongs to. Due to the limitations of time we have not created a machine learning algorithm for classifying a window to a given class, but we label each window manually. The rest of the paper is organized as follows. We discuss our strategies for scaling the data in Section 3.1 and then we propose various algorithms as suggested earlier. Section 5 provides the quantitative analysis of the results of the algorithms. 3. Dataset We used a real world data set which is created by Yahoo! consisting of Web requests time series statistics. This data set is a combination of real world and synthetic data sets. Real world data set was considered for this project. A value and a label (0 for normal points and 1 for anomaly points) are assigned to each data point. For specific formulation of the problem, we evaluated various aspects of our data including types of anomalies, data labels and output of anomaly detection. Figure 1. Threshold Based Anomaly points Figure 2. Anomaly alerted due to step change
3 Figure 3. Frequency based anomaly points 3.1. Scaling the data The dataset consisted of 67 files and each file had a time series data. As these files were not scaled on the same magnitude, in order to process them together, they had to be scaled to a fix magnitude. The following steps were followed for scaling the Yahoo dataset: Time series in each file was centered about the mean Average value of all the maximas was calculated Average value of all the minimas was calculated Scaling factor of each file was set at (maxima + minima)/2 Concatenation of all files into single file for training. 4. Algorithms Proposed The classification algorithm classifies the data pattern into three classes. Depending upon the classification, the algorithm corresponding to the class would be run on the time window dataset to detect the anomaly points. The algorithms for the classes are: 1. Point anomaly algorithm for Threshold based anomaly. 2. Step Change anomaly detection for anomaly alerted due to sudden transition in the mean. 3. Frequency based algorithm for anomaly points which are caused due to change in periodicity in the dataset Point anomaly algorithm In this section, we only focus on the portions of the data set which contain the threshold based anomaly points. Many approaches in the literature were based on fitting statistical model to the given data set, then apply a statistical inference test to determine if an unseen instance belongs to this model. The main assumption for these approaches is that normal data instances occur in high probability regions of the fitted statistical model whereas anomalies occur in the low probability regions of the model(chandola V. & V., 2009). Thus, we applied the Kolmogorov-Smirnov test to compare our normal data set with a variety of well-known probability distribution. The null hypothesis (H 0 ) and the alternative hypothesis (H a ) are as follows: H 0 = Our normal data comes from an assumed probability distribution, H a = It does not come from the assumed probability distribution. Although many statistical anomaly detection techniques assumed that the data is generated from a Gaussian distribution (Grubbs, 1969; Stefansky, 1972), we investigated other famous probability distributions such as Weibull and Exponential. However, as the P-value for all assumed distributions was approximately zero, we could not reject the null hypothesis. Consequently, the algorithm for detecting threshold based anomaly points needed to be independent of any assumed probability distribution for data set. The main concept in the proposed algorithm is that to extract the behavior of anomaly points in comparison to normal points. Like many machine-learning based methods, the algorithm is divided in two parts: training model and test model. In the training model, first of all, the normal data is separated from anomaly data. Then, the absolute distance of all anomaly points from the average of normal data are calculated and assigned to a vector. This vector is employed for computing a score function for unseen test instances in the test data model. For labeling unseen test data instances to a normal or anomaly, a threshold value needs to be optimized. In this algorithm, the threshold value shows the percentage of all anomaly points in the training data which are either less than average of local maxima or greater than average of local minima. The rationale behind such a selection for threshold value is to find the portion of anomaly points which behaves more similar to normal data in comparison to the anomaly data. The training model can be summarized in the following three steps: 1. Separate normal data from anomaly data 2. Find the average of normal data, and plug it into model.mean 3. Find the absolute distance of each anomaly point to model.mean, and plug it into the vector of model.anomalydistance 4. Find a threshold value α with following computation: X 1 = Number of positive anomaly points less than av-
4 erage of normal local maxima X 2 = Number of negative anomaly points greater than average of normal local minima α = (X 1 + X 2 ) N umberof Anomalypoints In the test model, we consider a score function associated to each unseen test instance (η). We declare η to be anomalous if its score is greater than α. For finding the score value, the absolute distance of the unseen test instance to model.mean needs to be calculated. We called this value TestDistance. If this distance is greater than each value in model.anomalydistance, the test instance receives a penalty. The following score function maps the score value to [0,1]: F (η) = 1 N N i=1 (2) Where N is the number of anomaly points in the training data, and II is the indicator function. The larger value of F(η) increases the possibility that η is anomaly. The three following steps illustrate the test model: (1) (II) T estdistance(η) Model.AnomalyDistance(i) 1. Find the absolute distance of each unseen test point (η) to model.mean, and plug it into TestDistance 2. Find score function for the test instance, F(η). 3. If F(η)>α, then η is anomaly. Otherwise, η is normal Identifying anomalies through statistical methods First Attempt at the algorithm This algorithm is designed for data which had sudden transition in the mean between two consecutive data windows. This transition is considered to be anomalous. This is a type of supervised algorithm which uses the technique of k-nearest neighbors as it takes the mean of k points in the time-window. The algorithm is divided into training the model and testing the model obtained. The training algorithm is as follows: Window size for the running window time frame is initialized Set M odel.threshold = max(datap oints) For every window, calculate the average of the data points in the window If the new data point included in the window frame is an anomaly, take the difference between the present and previous average = ɛ If(ɛ<threshold), update threshold to the new value M odel.windowsize = windowsize for maximum accuracy Algorithm for the predictions: Window frame of Model.windowSize is considered If the difference between two consecutive average is less than Model.threshold, mark the point as anomaly For increasing the accuracy of the algorithm: Take the weighted average of the data points in the window frame e.g. Squared index weights, Exponential weights Multiplier was used for taking a better range of the threshold Multiplier acts as a function of the variance and we continue to tune this function Based on the pattern of the data that we believed could be used for detecting anomalies, we started off with a pure threshold based pattern. In our training approach, we were using a k size window of data. We continue to have this sliding window of size k and compute the mean of the values in this window. For each anomalous point, we deduct the new mean from the previous mean and set that as a threshold. We vary the window size and select the lowest threshold. This approach was not giving us a good result since there were a lot of false positives. As we were selecting the least threshold during our training, it was not working well for the test set since the data variances in the test dataset were beyond the threshold in a lot of cases and were predicted as anomalies, but were not anomalies. We concluded that using a hard threshold won t work. Second attempt After researching some more on this, we had to make this threshold a variable value that is a function of the variance of the data, if we consider the data in the sliding window to be distributed normally. The distribution of the data points does not need to actually be normal, but it is better if it is at least normal-like. For our approach, since the pattern that we are targeting will have the non-anomalous data in the normal form, we can use the statistical approach of fitting the data in a normal distribution and detect outliers this way. A typical formulation for this technique is to find the mean and variance for the feature of the data in the window, then define the probability p(x) of a combination of a new data point, that just
5 follows the sliding window. An anomaly occurs whenever p(x)<ɛ. Determining ɛ The algorithm is fit to negative examples (non-anomalies). But ɛ is determined from the cross-validation set, and is typically selected as the value that provides the best F1- score. To compute the F1-score, we needed to know what is anomalous and what is not; that is true positives are when the system predicts an anomaly and it actually is an anomaly, false positives are predicted anomalies that actually aren t and so on. We determined the best ɛ by analyzing the anomalous points and find a tolerance range as a function of the variance that determines whether a point is anomalous or not. The following diagram illustrates the point in a better way. Figure 4 shows the distribution of normal and anomalous points from the perspective of probability densities. Figure 5. Anomalies corresponding to periodicity change in data points where Blue represents the data points and Red are the Anomalous points Figure 4. Relation between Variance and Confidence between data points(gao) As per the diagram above, we just need to find a variation based threshold that we can train that will give us the model parameter that we can use to predict anomalies. For our case, if the probability density of the new point falls outside the confidence limits, we mark that point as anomalous Frequency Based Anomaly Detection This class of anomaly detection techniques deals with the patterns in time series where the anomaly occurs due change in periodicity or sudden change in periodic data trend. One of the methods of detecting such a trend is to find the correlation coefficient between the frequency spectrums of successive windows. Windows with similar frequency components will have high correlation(et al Giorgio Glacinto, 2008). Figure 5 shows the actual data and Figure 6 plots the correlation between successive windows. Figure 6. Three out of four Anomaly window detected 5. Experimental Results Accuracy measurement We started with the method of calculating accuracy based on whether our predictions matched the labeled anomalous points. We were getting pretty high accuracy using this methodology, but it was flawed. This methodology doesn t take into account the facts that there were a lot of misclassifications since we predicted a lot of normal data points as anomalous. These false positives are bound to reduce the effectiveness of the system in production because a network operations administrator will start to ignore the alerts
6 Table 1. Truth Table Matrix for F1 Score POSITIVE NEGATIVE POSITIVE TRUE POSITIVE FALSE POSITIVE NEGATIVE FALSE NEGATIVE TRUE NEGATIVE for anomalous data because of the huge number of false positives. We needed to use a metric that could quantitatively take the misclassifications into account and penalize the accuracy score accordingly. We came across such an exact methodology named the F1 score using the truth table matrix. This method identifies the results in comparison with the ground truth and slots them in each of the 4 different buckets and calculates the following metrics. Precision can be defined as the classifiers exactness. A low precision can also indicate a large number of False Positives P recision = T rue P ositive T rue P ositive + F alse P ositive Recall can be defined as classifiers completeness. A low recall indicates a large number of false negatives. Recall = The F1 score is defined as T rue P ositive T rue P ositive + F alse Negative F 1 = 2 precision recall precision + recall The truth table matrix, Table 1, shows the comparison between the predictions and the ground truth Point anomaly algorithm This algorithm worked well on threshold based anomalies as it was assumed to be. This would only provide the right anomalous points for the time window classified in the threshold based class. The accuracy for the algorithm was 0.98 and Fscore was 0.68 for the dataset Anomalies through statistical methods The first part of the algorithm which included step based anomaly detection using weighted average of the data points in the time window gave an accuracy of 0.92 for the dataset with sudden transitions in mean and due to the some points having false positives F1 Score turned out to be (3) (4) (5) 5.3. Frequency Based Anomaly Detection The evaluation is slightly different from the rest of the methods in the sense that it does not give us a particular point of anomaly, but a small window within which the anomaly lies. The window size used in evaluation is 50 and successive windows are 25 units apart (sliding window). There are no false positives but, this method detects anomalous windows with 51.7% accuracy which is equal to ) ( number of anomalous windows detected total anomalous windows This method performs well for highly periodic data, with major changes in frequency but cannot detect anomaly points where the data is not perfectly periodic. Also the selection of window size matters because the closer it is to the period of the series, the better the performance of the algorithm. 6. Conclusion The paper discusses the algorithms for different types of anomalies present in the Yahoo dataset along with the quantitative analysis of the algorithms proposed. The multi classification algorithm along with separate algorithms for each type of anomaly detected in the time window worked well for the real world dataset. 7. Future Work In this project, we explored three anomaly patterns although other types of anomalies may exist. In pursuit of more investigation on available data, we lead to detect more anomalous patterns. On the other hand, applying a variety of algorithms to our data set results in having a more complex model.one way is to fuse various algorithms and propose a unique algorithm, which is able to detect more than one anomalous type. However, integrating algorithms may increase the chance of missing a portion of anomalies. Thus, future research in this area could focus on finding the optimized trade-off between the complexity of models and the number of applied algorithms which brings about higher accuracy. A better, but a slightly complicated algorithm as continuation of Frequency based anomaly detection would be to use the spectrum to extract the top n harmonics from the window and use this as a n-dimensional vector for performing clustering. Windows that fall in cluster of larger size can be considered as normal while the windows in other group will be anomalous.we assume that the data is classified into one of the 3 classes.an algorithm could be built that given a window, it classifies the window in one of the classes.
7 Acknowledgments This project would not have been possible without the constant guidance of Dr. Bert Huang and we are very grateful for all the support provided by him during the duration of this project. References Chandola V., Banerjee, A. and V., Kumar. Anomaly detection: A survey. pp. 41, E., Grubbs F. Procedures for detecting outlying observations in samples. In Technometrics, pp. 11, et al Giorgio Glacinto. Anomaly detection with score functions based on nearest neighbor graphs. In Information Fusion 9.1, pp , Fen Fu, Mooi Choo Chuah. Ecg anomaly detection via time series analysis. Gao, Jing. Anomaly detection. In Advances in Neural Information Processing Systems, SUNY, Buffalo. K., Bhattacharyya D. and K., Kalita J. (eds.). Network anomaly detection: A machine learning perspective. CRC Press, L. Portnoy, E. Eskin, S. Stolfo. Intrusion detection with unlabelled data using clustering. In Proceedings of ACM CSS Workshop on Data Mining Applied to Security, labs Yahoo. A benchmark dataset for time series anomaly detection, Mangi Zhao, Venkatesh Saligrama. Anomaly detection with score functions based on nearest neighbor graphs. In Advances in Neural Information Processing Systems, Michail Vlachos, Philip Yu, Vittorio Castelli. On periodicity detection and structural periodic similarity. W., Stefansky. Rejecting outliers in factorial designs. In Technometrics, pp. 14, 1972.
OUTLIER ANALYSIS. Data Mining 1
OUTLIER ANALYSIS Data Mining 1 What Are Outliers? Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism Ex.: Unusual credit card purchase,
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Sentiment analysis using emoticons
Sentiment analysis using emoticons Royden Kayhan Lewis Moharreri Steven Royden Ware Lewis Kayhan Steven Moharreri Ware Department of Computer Science, Ohio State University Problem definition Our aim was
ANALYZING NETWORK TRAFFIC FOR MALICIOUS ACTIVITY
CANADIAN APPLIED MATHEMATICS QUARTERLY Volume 12, Number 4, Winter 2004 ANALYZING NETWORK TRAFFIC FOR MALICIOUS ACTIVITY SURREY KIM, 1 SONG LI, 2 HONGWEI LONG 3 AND RANDALL PYKE Based on work carried out
A Survey on Outlier Detection Techniques for Credit Card Fraud Detection
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. VI (Mar-Apr. 2014), PP 44-48 A Survey on Outlier Detection Techniques for Credit Card Fraud
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Anomaly detection. Problem motivation. Machine Learning
Anomaly detection Problem motivation Machine Learning Anomaly detection example Aircraft engine features: = heat generated = vibration intensity Dataset: New engine: (vibration) (heat) Density estimation
Adaptive Anomaly Detection for Network Security
International Journal of Computer and Internet Security. ISSN 0974-2247 Volume 5, Number 1 (2013), pp. 1-9 International Research Publication House http://www.irphouse.com Adaptive Anomaly Detection for
Fairfield Public Schools
Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity
Predict the Popularity of YouTube Videos Using Early View Data
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate
II. DISTRIBUTIONS distribution normal distribution. standard scores
Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,
CAB TRAVEL TIME PREDICTI - BASED ON HISTORICAL TRIP OBSERVATION
CAB TRAVEL TIME PREDICTI - BASED ON HISTORICAL TRIP OBSERVATION N PROBLEM DEFINITION Opportunity New Booking - Time of Arrival Shortest Route (Distance/Time) Taxi-Passenger Demand Distribution Value Accurate
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Tutorial for proteome data analysis using the Perseus software platform
Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents
Towards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
Anomaly Detection using multidimensional reduction Principal Component Analysis
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 1, Ver. II (Jan. 2014), PP 86-90 Anomaly Detection using multidimensional reduction Principal Component
An Anomaly-Based Method for DDoS Attacks Detection using RBF Neural Networks
2011 International Conference on Network and Electronics Engineering IPCSIT vol.11 (2011) (2011) IACSIT Press, Singapore An Anomaly-Based Method for DDoS Attacks Detection using RBF Neural Networks Reyhaneh
INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)
INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA) As with other parametric statistics, we begin the one-way ANOVA with a test of the underlying assumptions. Our first assumption is the assumption of
Cross-Validation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015
RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering
This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.
Algebra I Overview View unit yearlong overview here Many of the concepts presented in Algebra I are progressions of concepts that were introduced in grades 6 through 8. The content presented in this course
Performance Metrics for Graph Mining Tasks
Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical
Final Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic
A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic Report prepared for Brandon Slama Department of Health Management and Informatics University of Missouri, Columbia
KEITH LEHNERT AND ERIC FRIEDRICH
MACHINE LEARNING CLASSIFICATION OF MALICIOUS NETWORK TRAFFIC KEITH LEHNERT AND ERIC FRIEDRICH 1. Introduction 1.1. Intrusion Detection Systems. In our society, information systems are everywhere. They
Session 7 Bivariate Data and Analysis
Session 7 Bivariate Data and Analysis Key Terms for This Session Previously Introduced mean standard deviation New in This Session association bivariate analysis contingency table co-variation least squares
HYBRID INTRUSION DETECTION FOR CLUSTER BASED WIRELESS SENSOR NETWORK
HYBRID INTRUSION DETECTION FOR CLUSTER BASED WIRELESS SENSOR NETWORK 1 K.RANJITH SINGH 1 Dept. of Computer Science, Periyar University, TamilNadu, India 2 T.HEMA 2 Dept. of Computer Science, Periyar University,
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
STATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT
Conclusions and Future Directions
Chapter 9 This chapter summarizes the thesis with discussion of (a) the findings and the contributions to the state-of-the-art in the disciplines covered by this work, and (b) future work, those directions
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Data Mining Yelp Data - Predicting rating stars from review text
Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University [email protected] Chetan Naik Stony Brook University [email protected] ABSTRACT The majority
Data Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
Histogram-based Outlier Score (HBOS): A fast Unsupervised Anomaly Detection Algorithm
Histogram-based Outlier Score (HBOS): A fast Unsupervised Anomaly Detection Algorithm Markus Goldstein and Andreas Dengel German Research Center for Artificial Intelligence (DFKI), Trippstadter Str. 122,
Introduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
Descriptive Statistics
Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize
End-to-End Sentiment Analysis of Twitter Data
End-to-End Sentiment Analysis of Twitter Data Apoor v Agarwal 1 Jasneet Singh Sabharwal 2 (1) Columbia University, NY, U.S.A. (2) Guru Gobind Singh Indraprastha University, New Delhi, India [email protected],
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Evaluation & Validation: Credibility: Evaluating what has been learned
Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model
Network Intrusion Detection Systems
Network Intrusion Detection Systems False Positive Reduction Through Anomaly Detection Joint research by Emmanuele Zambon & Damiano Bolzoni 7/1/06 NIDS - False Positive reduction through Anomaly Detection
COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments
Contents List of Figures Foreword Preface xxv xxiii xv Acknowledgments xxix Chapter 1 Fraud: Detection, Prevention, and Analytics! 1 Introduction 2 Fraud! 2 Fraud Detection and Prevention 10 Big Data for
Data Mining and Visualization
Data Mining and Visualization Jeremy Walton NAG Ltd, Oxford Overview Data mining components Functionality Example application Quality control Visualization Use of 3D Example application Market research
Tutorial 5: Hypothesis Testing
Tutorial 5: Hypothesis Testing Rob Nicholls [email protected] MRC LMB Statistics Course 2014 Contents 1 Introduction................................ 1 2 Testing distributional assumptions....................
An Introduction to Machine Learning
An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia [email protected] Tata Institute, Pune,
Beating the MLB Moneyline
Beating the MLB Moneyline Leland Chen [email protected] Andrew He [email protected] 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
Sample Size and Power in Clinical Trials
Sample Size and Power in Clinical Trials Version 1.0 May 011 1. Power of a Test. Factors affecting Power 3. Required Sample Size RELATED ISSUES 1. Effect Size. Test Statistics 3. Variation 4. Significance
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
How To Perform An Ensemble Analysis
Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier
SPSS Explore procedure
SPSS Explore procedure One useful function in SPSS is the Explore procedure, which will produce histograms, boxplots, stem-and-leaf plots and extensive descriptive statistics. To run the Explore procedure,
A Frequency-Based Approach to Intrusion Detection
A Frequency-Based Approach to Intrusion Detection Mian Zhou and Sheau-Dong Lang School of Electrical Engineering & Computer Science and National Center for Forensic Science, University of Central Florida,
Application of Data Mining Techniques in Intrusion Detection
Application of Data Mining Techniques in Intrusion Detection LI Min An Yang Institute of Technology [email protected] Abstract: The article introduced the importance of intrusion detection, as well as
Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)
Spring 204 Class 9: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.) Big Picture: More than Two Samples In Chapter 7: We looked at quantitative variables and compared the
Testing Research and Statistical Hypotheses
Testing Research and Statistical Hypotheses Introduction In the last lab we analyzed metric artifact attributes such as thickness or width/thickness ratio. Those were continuous variables, which as you
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup
Network Anomaly Detection A Machine Learning Perspective Dhruba Kumar Bhattacharyya Jugal Kumar KaKta»C) CRC Press J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor
1 Maximum likelihood estimation
COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N
Introduction. A. Bellaachia Page: 1
Introduction 1. Objectives... 3 2. What is Data Mining?... 4 3. Knowledge Discovery Process... 5 4. KD Process Example... 7 5. Typical Data Mining Architecture... 8 6. Database vs. Data Mining... 9 7.
Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics
Descriptive statistics is the discipline of quantitatively describing the main features of a collection of data. Descriptive statistics are distinguished from inferential statistics (or inductive statistics),
Using Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
99.37, 99.38, 99.38, 99.39, 99.39, 99.39, 99.39, 99.40, 99.41, 99.42 cm
Error Analysis and the Gaussian Distribution In experimental science theory lives or dies based on the results of experimental evidence and thus the analysis of this evidence is a critical part of the
Principal components analysis
CS229 Lecture notes Andrew Ng Part XI Principal components analysis In our discussion of factor analysis, we gave a way to model data x R n as approximately lying in some k-dimension subspace, where k
Big Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
Linear Models in STATA and ANOVA
Session 4 Linear Models in STATA and ANOVA Page Strengths of Linear Relationships 4-2 A Note on Non-Linear Relationships 4-4 Multiple Linear Regression 4-5 Removal of Variables 4-8 Independent Samples
AUTONOMOUS NETWORK SECURITY FOR DETECTION OF NETWORK ATTACKS
AUTONOMOUS NETWORK SECURITY FOR DETECTION OF NETWORK ATTACKS Nita V. Jaiswal* Prof. D. M. Dakhne** Abstract: Current network monitoring systems rely strongly on signature-based and supervised-learning-based
An Evaluation of Machine Learning Method for Intrusion Detection System Using LOF on Jubatus
An Evaluation of Machine Learning Method for Intrusion Detection System Using LOF on Jubatus Tadashi Ogino* Okinawa National College of Technology, Okinawa, Japan. * Corresponding author. Email: [email protected]
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
Association Between Variables
Contents 11 Association Between Variables 767 11.1 Introduction............................ 767 11.1.1 Measure of Association................. 768 11.1.2 Chapter Summary.................... 769 11.2 Chi
Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS
DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi - 110 012 [email protected] 1. Descriptive Statistics Statistics
Medical Information Management & Mining. You Chen Jan,15, 2013 [email protected]
Medical Information Management & Mining You Chen Jan,15, 2013 [email protected] 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
Simple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
Lecture 2: Descriptive Statistics and Exploratory Data Analysis
Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, [email protected] Abstract: Independent
Using News Articles to Predict Stock Price Movements
Using News Articles to Predict Stock Price Movements Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 9237 [email protected] 21, June 15,
Introduction to nonparametric regression: Least squares vs. Nearest neighbors
Introduction to nonparametric regression: Least squares vs. Nearest neighbors Patrick Breheny October 30 Patrick Breheny STA 621: Nonparametric Statistics 1/16 Introduction For the remainder of the course,
Server Load Prediction
Server Load Prediction Suthee Chaidaroon ([email protected]) Joon Yeong Kim ([email protected]) Jonghan Seo ([email protected]) Abstract Estimating server load average is one of the methods that
Why is Internal Audit so Hard?
Why is Internal Audit so Hard? 2 2014 Why is Internal Audit so Hard? 3 2014 Why is Internal Audit so Hard? Waste Abuse Fraud 4 2014 Waves of Change 1 st Wave Personal Computers Electronic Spreadsheets
Random Forest Based Imbalanced Data Cleaning and Classification
Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem
A new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique
A new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique Aida Parbaleh 1, Dr. Heirsh Soltanpanah 2* 1 Department of Computer Engineering, Islamic Azad University, Sanandaj
Intrusion Detection System using Log Files and Reinforcement Learning
Intrusion Detection System using Log Files and Reinforcement Learning Bhagyashree Deokar, Ambarish Hazarnis Department of Computer Engineering K. J. Somaiya College of Engineering, Mumbai, India ABSTRACT
Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression
Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati [email protected], [email protected]
SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this
Unit 26 Estimation with Confidence Intervals
Unit 26 Estimation with Confidence Intervals Objectives: To see how confidence intervals are used to estimate a population proportion, a population mean, a difference in population proportions, or a difference
Algebra 1 2008. Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard
Academic Content Standards Grade Eight and Grade Nine Ohio Algebra 1 2008 Grade Eight STANDARDS Number, Number Sense and Operations Standard Number and Number Systems 1. Use scientific notation to express
How To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS
QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS This booklet contains lecture notes for the nonparametric work in the QM course. This booklet may be online at http://users.ox.ac.uk/~grafen/qmnotes/index.html.
Simple linear regression
Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between
