Finding soontofail disks in a haystack


 Marcus Lyons
 2 years ago
 Views:
Transcription
1 Finding soontofail disks in a haystack Moises Goldszmidt Microsoft Research Abstract This paper presents a detector of soontofail disks based on a combination of statistical models. During operation the detector takes as input a performance signal from each disk and sends and alarm when there is enough evidence (according to the models) that the disk is not healthy. The parameters of these models are automatically trained using signals from healthy and failed disks. In an evaluation on a population of 1190 production disks from a popular customerfacing internet service, the detector was able to predict 15 out of the 17 failed disks (88.2% detection) with 30 false alarms (2.56% false positive rate). 1 Introduction The ability to provide advance warning of hard disk failures is extremely useful in internet and cloud services. For cloud services this ability enables completely uninterrupted service as preventive reallocation may be executed prior to imminent disk failures. For internet services such as , content (news, search, movies), and ecommerce, advance warning may trigger immediate replication or replacement which will increase availability and reliability statistics, and may even reduce replications and prevent data loss. In addition to preventive actions, one can imagine more invasive ones involving perhaps intrusive diagnostics and repairs. IT departments in regular businesses and individual users would obtain similar benefits from having a detector producing these alarms. In this paper we describe, characterize, and evaluate a software based detector of soontofail disks. This detector which we call DFAD (for Disk Failure Advance Detector), takes as input a performance signal from each disk and produces an alarm according to a combination of statistical models. The parameters in these models are automatically trained from a population of healthy and failed disks, using machine learning techniques. Predicting the future is notoriously difficult. There is a tradeoff between the ability to accurately predict failure (indicated by the detection rate), and the number of false alarms (indicated by false positive rate). This tradeoff is reflected in the tension between the cost of ignoring an alarm for a failing disk and the cost of unnecessary action in reaction to a false one. Ignoring an alarm for an impending failure will result in a degradation of the quality, availability, and reliability of the service. Acting on a false alarm will result on unnecessary costs related to, for example, migrating users in a cloud service. The decision about how to act on this tradeoff is entirely dependent on economic and business considerations of the service. Thus, we report on the detection and false positive rates of DFAD providing the necessary information to act on this tradeoff. When applied to a population of 1190 production disks from a popular 24x7 customerfacing internet service D FAD predicts 15 out of 17 failed disks (88.2% detection rate), with 30 false alarms (2.56% false positive rate). In the case of the preventive action being moving users in a cloud service, this translates to 30 additional moves replicas for guaranteeing uninterrupted service in spite of 15 disks failures. The parameters of the statistical models for DFAD were fitted using 200 disks, containing a mixture of 17 failed disks and 183 healthy disks. In addition, we used 700 disks to estimate the false positive rate. The disks used for training and those used for testing (evaluating the performance of DFAD) came from the same population of disks. We divided this total population in two sets, maintaining the same proportion of failed and healthy disks in each set. Signals from the disks in the testing set were only used during the evaluation of DFAD. Also, for the purposes of this paper, as long as DFAD sends an alarm between 18 days and 4 hours of the disk failing we will regard the impending failure as detected. By training DFAD on the specific workload that disks experience 1
2 in production, the models are not subject to the inherent uncertainty in the failure rates specified by the manufacturer which can present a lot of variability (see [5]), and which were estimated from proprietary stress workloads. Another way to look at the benefits of DFAD is that it reduces the uncertainty of identifying failing disks in the whole population of disks (using the manufacturer failure rate), to a smaller population composed of the of detected failures (15 in our case) plus the false alarms (30 in our case). In summary, this paper proposes and evaluates DFAD a software based detector of soontofail disks, with the following benefits: STF signal Healthy signals The input to the detector is a single performance signal, average maximum latency; consequently it is efficient at scale. It is based on models fitted directly from production workloads; thus, the detector will customize to the particular usage and stress in production. The results on real data from 1190 production disks of a 24x7 customerfacing internet service are a 88.2% detection rate and a 2.56% false positive rate. These benefits are encouraging for further research into building robust detectors based on performance signals from the disks. The rest of the paper is organized as follows: Section 2 briefly compares this work against the most relevant published literature. Section 3 introduces a set of preliminary definitions. Section 4 describe the models and the methods for training. Section 5 discusses the results of the evaluation and we conclude in Section 6. 2 Related Work This section is by no means comprehensive. It touches on the published work that is closer to ours and that help to put our work in the right perspective. The seminal paper by Pinheiro, Weber, and Barroso [5] contains an in depth study on a much larger scale than ours. We use the same definition of failure and like them we study disks in operation under stress from real workloads. Our results also confirm their finding that...we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy... Similarly as further reported in that paper we found that a small percentage of failed disks show no SMART error signals (at least amongst the ones we had access to). The important difference between the current paper and [5] is that we focus on prediction based on performance (and not on error signals) with positive results. In [2] and [3] the authors look at machine learning Figure 1: Examples of the AML time series for healthy and STF disks. Each graph contains 8 time series from neighboring disks. Time series for healthy disks are at the bottom of the Y axis. The difference with the values for STF disks is sometimes over 7 standard deviations. The top graphs are taken 8 hours before failure. The bottom two are taken 4 days before failure. The X axis is in intervals of 20 minutes. Units in the Y axis are obscured by request from the operations team at the site. methods for predicting disk failures using also SMART signals from the disk. Again, this contrasts with us using performance signals. Even though the population of disks in our study is of similar size, our study differs in two important dimensions: (1) the disks in our study are stressed by a real world workload under production conditions and (2) our statistical models are parametric (theirs are nonparametric). This is important as in our models the number of parameters do not increase with the data. Finally, the study in [7] provides understanding on the statistical properties of disk failures but is not focused on detection or prediction. 3 Problem Statement The problem we are solving is the detection of soontofail (STF) disks. We define disk failure as in in [5]: a drive is considered to have failed if it was replaced as part of a repairs procedure. There are several considerations that go into this definition such as the actual time of failure may not be as accurate as it depends on when the disk was actually exchanged etc. Still, we agree with the authors of [5] that this is the ultimate ground truth for evaluating any predictor/detector. Our proposed solution is a software based detector, DFAD, that uses statistical models to find abnormal behavior. The input to DFAD is a single signal from every disk, a time series containing in each sample the Average Maximum Latency (AML) over each sample period, and the output is an alarm if the models in DFAD decide that the signal presents abnormal behavior. 2
3 The data collection was done via a software interface provided by the disk vendor. In this interface, all signals are exposed every 20 minutes. The AML is computed as follows: the maximum latency of each disk is recorded every minute, then they are averaged over the 20 minutes of the reporting period, and finally exposed by the interface. These are nonoverlapping averages. Examples of AML for both healthy and STF disks are depicted in Figure 1. A list of all the signals exposed by the interface is presented in Table 3. We couldn t find any predictor of STF disks other than the AML (see Section 5). Besides being restricted by the parameters set by the vendor, we were constrained by several other factors outside of our control. We only had data for a month for about 2380 disks. We reserved 1190 disks for training and 1190 for testing the models. All the parameters of the models described in Section 4, model selection, and initial estimation of the various rates, were fitted and computed using the training data. The testing data was only seen once during the evaluation of DFAD. The detection and false positive rates reported come from the evaluation (although we remark that they coincide with the ones computed during training). We actual detected disk failures (ground truth) through changes in the serial number of the disk in the logs. The particular disks we study come from a cluster of 144 servers. Each server has 34 drives in an array and drives are 750GB SATA drives. We had data for 70 of those servers. The data was collected during December of 2007 from production disks of a popular 24x7 customerfacing internet service. With respect to the lead time to failure, that is how soon will the disk fail, we consider that an imminent failure is predicted if the alarm is sent between 18 days and 4 hours before the disk is changed. Note that the definition of false positives is very stringent. If the detector emits an alarm and the disk has not failed within 18 days, (or by the end of the month of data available) then this alarm will be considered as a false positive (even if it is changed within the period established above). 4 The DFAD Models The schematics for DFAD are shown in Figure 2. The raw AML goes through two filters in parallel. The filter on the left compares the (log) odds of the probability that the data is produced by a STF disk vs. a healthy disk. It is based on Hidden Markov Models (HMMs) [6] and is described in Subsection 4.1. The other filter counts the number of peaks in AML in a 24 hour period (see Subsection 4.2). The output from these two units is taken by a final unit that uses a logistic regression model [8] to decide whether the disk will fail. These are well understood models, and techniques for fitting the parameters from data and performing infer Model Comparison HMM models Fusion Alarm AML signal Peak Counter Logistic Regression Threshold Figure 2: Schematic diagram for DFAD. The AML signal goes through a model comparison filter (using HMMs) and a peak counter. The output of these two filters is fused by a logistic regression model. ence on new data are well known. The specific combination and application to this particular problem is new. 4.1 The Hidden Markov Model Unit From inspection of the difference between the AML signals of STF and healthy disks (see Figure 1), it is clear that at least part of the detector must consist of comparing the statistics of healthy and STF disks. We initially tried comparing running averages and using change point detection algorithm with very little success. Both detection and the false positive rate were abysmal. Our next step was to try Hidden Markov Models (HMMs). These models capture several modes of operation for the disks, and also take into consideration the time dependence inherent in the signal. HMMs are statistical state space models [6] used in applications ranging from speech processing to time series prediction. These models assume that the signal is sampled at regular intervals (our case) and therefore time is discrete. An HMM consists of two sets of variables. The first set is the hidden part of the model representing a disk state. In our case the different states emit either low, medium, or high values of the AML signal. We will use S i to denote these variables where the index i represents a particular instance in time in the time series. The second set of variables, denoted by Y i, is the observable part which is continuous and is used to model the actual values of the AML signal. We will use upper case letters for the variables, and lowercase letters for their values. Given a sequence of values y 1,y 2,...,y n the model assumes that the disk is in one state at each time s 1,s 2,...,s n and the value y observed at time i is a (probabilistic) function of the state s of the disk. In order to have a complete specification of the model, we need the 3
4 probability functions relating the hidden states to the observables, and the transition between the states. Thus we have: (1) P(S i+1 S i ), the probability that the disk is in a particular state S i+1 = s given that it was in state S i = s in the previous time instance; (2) P(Y i S i ) a set of conditional distributions modeling the probability of the values observed given the state of the disk; and (3) a probability for the initial state namely P(S 1 ). The parameterization of these probability functions in this paper is as follows: P(Y i S i ) is considered to be a Gaussian distribution with mean µ s and variance σ s. P(S i S i 1 ) is considered to be multinomial (as S is a discrete random variable). The final parameter is an integer denoting the number of states S. These parameters are automatically fitted from samples of the AML signal. In this paper we used maximum likelihood estimators using well known algorithms described in [6] (rather than Bayesian techniques [8]). The problem of fitting the number of states S is what is known a model selection problem. For this paper we relied on the Bayesian Information Criteria (BIC) [8] which scores the a model based on the goodness of fit to the training data, and the number of parameters (complexity) of the model. Given that we expect to have no more than a handful of states, the problem of model selection proceeds as a linear search: we start with S = 2 and incrementing by 1 until BIC decreases. In our case S = 3. Once we fit these parameters the model is completely specified and we can compute the probability of any sequence of observations y 1,y 2,...,y n, using P(y 1,...,y n ) = S1,...,S n P(y 1 S 1 )...P(y n S n )P(S n S n 1 )...P(S 1 ). We are now ready to explain the inner workings of the model comparison box in Figure 2. We first train (fit parameters offline) an HMM model using data coming from healthy disks. Let s denote that probability model by P H. We do the same using data from an STF disk and obtain an HMM model for STF disks. Let P ST F denote that model. Given data D coming from a disk in operation, we compute P H (D), P ST F (D) (using the equation above), and then the ratio log(p ST F (D)/P H (D)) (the logodds between the models). This ratio, which we denote by LO is the output of the model comparison box. If this ratio is bigger than zero then it is more likely that D comes from a STF disk. This is one of the two pieces of evidence that the logistic regression unit considers in making a final decision regarding an alarm. 4.2 The Peak Counter Unit In our initial experiments with the training data, the output LO (i.e. the comparison between HMM s) was not enough to decrease the false positive rate while maintaining the detection rate. Thus we introduced an additional filter to provide another piece of evidence. We found that AML signals from ST F disks have peaks that are over 7 standard deviations away from AML signals coming from healthy disks (see Figure 1). Running simple statistics we found that a threshold of 2 seconds was over four standard deviations from the signal of healthy disks. This unit returns the number of times that the AML signal peaks over 2 seconds in a period of 24 hours. We will use PC to denote this count. 4.3 The Logistic Regression Unit Now we have two signals, the LO and the PC, and we need to make a decision on whether the disk is STF or healthy. We use a logistic regression model for this task. A logistic regression model is the simplest model with the least assumptions about the probabilistic models of these two signals which still helps in making a decision in a principled way. Let P(A) = P(ST F LO,PC) denote the probability that the disk is an STF disk. Then the logistic regression computes log( P(A) 1 P(A) ) = β 0 +β 1 LO+ β 2 PC. Fitting the parameters, β 0,β 1 and β 2 from data is a standard statistical problem [1]. The final output of DFAD comes from this equation: the more positive log( P(A) 1 P(A) ) is, the more evidence according to the models for sending an alarm. The threshold T for determining when to send an alarm will be fitted by balancing the tradeoff between detection and false positives. 4.4 Parameter Fitting To fit the parameters of the HMM s, we used 24 hours of data from 60 disks. We selected 12 of (17) failed disks, and 48 healthy disks from the same arrays as the failed disks. This selection from the same array guarantees that we control for common factors (i.e. workload) that may affect the whole array. The period of 24 hours was selected from the point of disk change for the failed disks to avoid daily cyclic confounder factors in the data (the disks do not fail at the same time). Thus P ST F was fitted from data coming from 24 hours prior to the failure of 12 disks and P H was fitted from data coming from 60 healthy disks (in the same period). Given that we will be substituting the parameters of maximum likelihood (and not integrating over them as in the Bayesian methodology) we need to take care that the models are comparable in the number of parameters. This is determined entirely by the number of states S which must be equal in both models. In our case the BIC score decreased for S > 3 in the model for STF disks. Thus we set S = 3. We then used a set of 200 disks (which includes the 60 disks above) to fit the parameters (β i ) of the logistic regression model. Finally, we used an additional 700 disks to continue testing the models and for setting the the threshold T for the decision on an alarm. We increased T until we 4
5 Life Ref Time Reset Ref Time Perf Collection Interval Life Sectors Read Reset Sectors Read Perf Read Requests Life Hard Reads Reset Hard Reads Perf Write Requests Life Retry Reads Reset Retry Reads Perf Max Queue Depth Life ECC Reads Reset ECC Reads Perf Average Latency Life Sectors Written Reset Sectors Written Perf Maximum Latency Life Hard Writes Reset Hard Writes Perf Maximum Wait Time Life Retry Writes Reset Retry Writes Error Log Total Errors Life Used Reallocs Reset Used Reallocs SMART Reallocated Sectors Life Timeouts Reset Timeouts SMART Pending Reallocs Life Pred Fail Reset Pred Fail SMART ECC Errors predicted signal and all the other signals as regressors, using state of the art feature finding techniques such as L1 regularization with no success [1, 4]. This lack of correlation may be a result of the level at which the signals were collected, the architecture of the service, or just that more sophisticated analyses are required. Another hypothesis is there are many ways in which the disk may fail and we need a larger population of failed disks to find reliable correlations between the AML signal and the different (error) signals behind the different types of failures. Further research is needed but we are comforted by the fact that [5] also reports on the lack of correlation for prediction with SMART error signals. Figure 3: List of signals available to us from each disk. decreased the rate of false positives without impacting the detection rate. We performed all these operations only on the training data. The test data was not touched or inspected, so there is no risk of overfitting. We were satisfied with a false positive rate of 2.54% which is similar to the one obtained on the test data later. 5 Results As specified above, we used half of the population of 2380 disks for training and the other half for testing. The 1190 disks in the testing data were unseen by the algorithms until the evaluation. The evaluation proceeded as follows: DFAD went over the month of data for each disk, taking 24 hours of AML data at a time with a sliding window of 4 hours. Thus, every 4 hours, DFAD makes a decision on the disk of whether to classify it as STF or healthy. If during the month a disk is classified once as STF and it is changed within 4 hours to 18 days from the alarm, then it counts as a predicted failure. We found that 8 disks were predicted with over 10 days of lead time and the rest between 5 days and 8 hours. Otherwise it is counted as a false positive. The results were 88.2% detection (15 out of the 17 changed disks during the month) with 2.56% false positives (30 disks). It is reasuring that these numbers are almost identical to the ones in the training set. It is interesting to note that we couldn t find in the data anything particularly salient on the two disks that were not detected. Also, we couldn t find any correlation between the AML and any of the other signals that were collected from the disks (see Table 3). The workloads (both read and write) were not affected, nor were any other error related indicators. We used several techniques to search for those correlations. First we fitted a logistic regression standard regressions to AML as the 6 Conclusions We have described a detector capable of predicting disks that are soon to fail, and we have evaluated the detector on real data from disks used in a 24x7 customerfacing internet service. The results, 88.2% detection rate and 2.56 false positive rate, are encouraging and provide a significant reduction in the initial state of uncertainty about disks failure. The decision to deploy DFAD is a tradeoff between the cost of false positives and the loss of service quality for not performing preventive actions. Nevertheless the reported results and the advantages of the methodology provide evidence that this is a fruitful avenue of research to pursue. Acknowledgments Many thanks to F. McSherry, D. Woodard, N. Reddy, and Q. Ke for comments on previous versions of this paper. References [1] HASTIE, T., TIBSHIRANI, R., AND FRIEDMAN, J. The Elements of Statistical Learning. Springer, [2] HUGHES, G., MURRAY, J., KREUTZDELGADO, K., AND ELKAN, C. Hard drive failure prediction using nonparametric statistical methods. IEEE Transactions on Reliability (2002). [3] MURRAY, J., HUGHES, G., AND KREUTZDELGADO, K. Machine learning methods for predicting failures in hard drives. Journal of Machine Learning Research (2005). [4] PARK, M. Y., AND HASTIE, T. L1 regularization path algorithm for generalized linear models. Royal Statistical Society (2007). [5] PINHEIRO, E., WEBER, W.D., AND BARROSO, L. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (2007). [6] RABINER, L. B., AND HUANG, H. L. An introduction to hidden Markov models. In IEEE ASSP Magazine [7] SCHROEDER, B., AND GIBSON, G. Disk failures in the real world. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (2007). [8] WASSERMAN, L. All of statistics. Springer,
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation:  Feature vector X,  qualitative response Y, taking values in C
More informationData Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University
Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Models vs. Patterns Models A model is a high level, global description of a
More informationCell Phone based Activity Detection using Markov Logic Network
Cell Phone based Activity Detection using Markov Logic Network Somdeb Sarkhel sxs104721@utdallas.edu 1 Introduction Mobile devices are becoming increasingly sophisticated and the latest generation of smart
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationCS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott
More informationChris Slaughter, DrPH. GI Research Conference June 19, 2008
Chris Slaughter, DrPH Assistant Professor, Department of Biostatistics Vanderbilt University School of Medicine GI Research Conference June 19, 2008 Outline 1 2 3 Factors that Impact Power 4 5 6 Conclusions
More information203.4770: Introduction to Machine Learning Dr. Rita Osadchy
203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:
More informationThroughput Capacity Planning and Application Saturation
Throughput Capacity Planning and Application Saturation Alfred J. Barchi ajb@ajbinc.net http://www.ajbinc.net/ Introduction Applications have a tendency to be used more heavily by users over time, as the
More informationSTATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 SigmaRestricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
More informationThe Proportional Odds Model for Assessing Rater Agreement with Multiple Modalities
The Proportional Odds Model for Assessing Rater Agreement with Multiple Modalities Elizabeth GarrettMayer, PhD Assistant Professor Sidney Kimmel Comprehensive Cancer Center Johns Hopkins University 1
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationQuestion 2 Naïve Bayes (16 points)
Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the
More informationBroadband Networks. Prof. Dr. Abhay Karandikar. Electrical Engineering Department. Indian Institute of Technology, Bombay. Lecture  29.
Broadband Networks Prof. Dr. Abhay Karandikar Electrical Engineering Department Indian Institute of Technology, Bombay Lecture  29 Voice over IP So, today we will discuss about voice over IP and internet
More informationRegularized Logistic Regression for Mind Reading with Parallel Validation
Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, JukkaPekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland
More informationWebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat
Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise
More informationServer Load Prediction
Server Load Prediction Suthee Chaidaroon (unsuthee@stanford.edu) Joon Yeong Kim (kim64@stanford.edu) Jonghan Seo (jonghan@stanford.edu) Abstract Estimating server load average is one of the methods that
More informationRAID Storage Systems with Earlywarning and Data Migration
National Conference on Information Technology and Computer Science (CITCS 2012) RAID Storage Systems with Earlywarning and Data Migration Yin Yang 12 1 School of Computer. Huazhong University of yy16036551@smail.hust.edu.cn
More informationSureSense Software Suite Overview
SureSense Software Overview Eliminate Failures, Increase Reliability and Safety, Reduce Costs and Predict Remaining Useful Life for Critical Assets Using SureSense and Health Monitoring Software What SureSense
More information" Y. Notation and Equations for Regression Lecture 11/4. Notation:
Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through
More informationPredict the Popularity of YouTube Videos Using Early View Data
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationSampling and Hypothesis Testing
Population and sample Sampling and Hypothesis Testing Allin Cottrell Population : an entire set of objects or units of observation of one sort or another. Sample : subset of a population. Parameter versus
More informationBasics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar
More informationData Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan
Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms:
More informationDetection of changes in variance using binary segmentation and optimal partitioning
Detection of changes in variance using binary segmentation and optimal partitioning Christian Rohrbeck Abstract This work explores the performance of binary segmentation and optimal partitioning in the
More informationPrinciples of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
More informationHigh Availability with Postgres Plus Advanced Server. An EnterpriseDB White Paper
High Availability with Postgres Plus Advanced Server An EnterpriseDB White Paper For DBAs, Database Architects & IT Directors December 2013 Table of Contents Introduction 3 Active/Passive Clustering 4
More informationStatistics in Retail Finance. Chapter 6: Behavioural models
Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics: Behavioural
More informationENTERPRISE INFRASTRUCTURE CONFIGURATION GUIDE
ENTERPRISE INFRASTRUCTURE CONFIGURATION GUIDE MailEnable Pty. Ltd. 59 Murrumbeena Road, Murrumbeena. VIC 3163. Australia t: +61 3 9569 0772 f: +61 3 9568 4270 www.mailenable.com Document last modified:
More informationUsing Simulation Modeling to Predict Scalability of an Ecommerce Website
Using Simulation Modeling to Predict Scalability of an Ecommerce Website Rebeca Sandino Ronald Giachetti Department of Industrial and Systems Engineering Florida International University Miami, FL 33174
More informationSystem Aware Cyber Security
System Aware Cyber Security Application of Dynamic System Models and State Estimation Technology to the Cyber Security of Physical Systems Barry M. Horowitz, Kate Pierce University of Virginia April, 2012
More informationPrediction of DDoS Attack Scheme
Chapter 5 Prediction of DDoS Attack Scheme Distributed denial of service attack can be launched by malicious nodes participating in the attack, exploit the lack of entry point in a wireless network, and
More informationCourse: Model, Learning, and Inference: Lecture 5
Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.
More informationSTATISTICAL DATA ANALYSIS COURSE VIA THE MATLAB WEB SERVER
STATISTICAL DATA ANALYSIS COURSE VIA THE MATLAB WEB SERVER Ale š LINKA Dept. of Textile Materials, TU Liberec Hálkova 6, 461 17 Liberec, Czech Republic email: ales.linka@vslib.cz Petr VOLF Dept. of Applied
More informationHealth monitoring & predictive analytics To lower the TCO in a datacenter
Health monitoring & predictive analytics To lower the TCO in a datacenter PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei Khurshudov Seagate Technology christian.b.madsen@seagate.com Outline 1.
More informationLecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions
SMA 50: Statistical Learning and Data Mining in Bioinformatics (also listed as 5.077: Statistical Learning and Data Mining ()) Spring Term (Feb May 200) Faculty: Professor Roy Welsch Wed 0 Feb 7:008:0
More informationTechnology StepbyStep Using StatCrunch
Technology StepbyStep Using StatCrunch Section 1.3 Simple Random Sampling 1. Select Data, highlight Simulate Data, then highlight Discrete Uniform. 2. Fill in the following window with the appropriate
More informationAzure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Daybyday Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
More informationUnivariate Regression
Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is
More informationMachine Learning Big Data using Map Reduce
Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? Web data (web logs, click histories) ecommerce applications (purchase histories) Retail purchase histories
More informationCCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York
BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal  the stuff biology is not
More informationAnalytical Test Method Validation Report Template
Analytical Test Method Validation Report Template 1. Purpose The purpose of this Validation Summary Report is to summarize the finding of the validation of test method Determination of, following Validation
More informationCloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
More informationCLASSIFICATION OF TRADING STRATEGIES IN ADAPTIVE MARKETS
M A C H I N E L E A R N I N G P R O J E C T F I N A L R E P O R T F A L L 2 7 C S 6 8 9 CLASSIFICATION OF TRADING STRATEGIES IN ADAPTIVE MARKETS MARK GRUMAN MANJUNATH NARAYANA Abstract In the CAT Tournament,
More informationPS 271B: Quantitative Methods II. Lecture Notes
PS 271B: Quantitative Methods II Lecture Notes Langche Zeng zeng@ucsd.edu The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.
More informationSome Essential Statistics The Lure of Statistics
Some Essential Statistics The Lure of Statistics Data Mining Techniques, by M.J.A. Berry and G.S Linoff, 2004 Statistics vs. Data Mining..lie, damn lie, and statistics mining data to support preconceived
More informationHardware Implementation of Probabilistic State Machine for Word Recognition
IJECT Vo l. 4, Is s u e Sp l  5, Ju l y  Se p t 2013 ISSN : 22307109 (Online) ISSN : 22309543 (Print) Hardware Implementation of Probabilistic State Machine for Word Recognition 1 Soorya Asokan, 2
More informationVDI FIT and VDI UX: Composite Metrics Track Good, Fair, Poor Desktop Performance
VDI FIT and VDI UX: Composite Metrics Track Good, Fair, Poor Desktop Performance Key indicators and classification capabilities in Stratusphere FIT and Stratusphere UX Whitepaper INTRODUCTION This whitepaper
More informationProactive Drive Failure Prediction for Large Scale Storage Systems
Proactive Drive Failure Prediction for Large Scale Storage Systems Bingpeng Zhu, Gang Wang, Xiaoguang Liu 2, Dianming Hu 3, Sheng Lin, Jingwei Ma NankaiBaidu Joint Lab, College of Information Technical
More informationOverview Classes. 123 Logistic regression (5) 193 Building and applying logistic regression (6) 263 Generalizations of logistic regression (7)
Overview Classes 123 Logistic regression (5) 193 Building and applying logistic regression (6) 263 Generalizations of logistic regression (7) 24 Loglinear models (8) 54 1517 hrs; 5B02 Building and
More informationYiming Peng, Department of Statistics. February 12, 2013
Regression Analysis Using JMP Yiming Peng, Department of Statistics February 12, 2013 2 Presentation and Data http://www.lisa.stat.vt.edu Short Courses Regression Analysis Using JMP Download Data to Desktop
More informationData quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
More informationAuxiliary Variables in Mixture Modeling: 3Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
More informationQ & A From Hitachi Data Systems WebTech Presentation:
Q & A From Hitachi Data Systems WebTech Presentation: RAID Concepts 1. Is the chunk size the same for all Hitachi Data Systems storage systems, i.e., Adaptable Modular Systems, Network Storage Controller,
More informationMACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
More informationInsurance Analytics  analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.
Insurance Analytics  analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics
More informationStatistics in Retail Finance. Chapter 2: Statistical models of default
Statistics in Retail Finance 1 Overview > We consider how to build statistical models of default, or delinquency, and how such models are traditionally used for credit application scoring and decision
More informationStorage I/O Control: Proportional Allocation of Shared Storage Resources
Storage I/O Control: Proportional Allocation of Shared Storage Resources Chethan Kumar Sr. Member of Technical Staff, R&D VMware, Inc. Outline The Problem Storage IO Control (SIOC) overview Technical Details
More information10601. Machine Learning. http://www.cs.cmu.edu/afs/cs/academic/class/10601f10/index.html
10601 Machine Learning http://www.cs.cmu.edu/afs/cs/academic/class/10601f10/index.html Course data All uptodate info is on the course web page: http://www.cs.cmu.edu/afs/cs/academic/class/10601f10/index.html
More informationWhite Paper. Making Sense of the DataOriented Tools Available to Facility Managers. Find What Matters. Version 1.1 Oct 2013
White Paper Making Sense of the DataOriented Tools Available to Facility Managers Version 1.1 Oct 2013 Find What Matters Making Sense the DataOriented Tools Available to Facility Managers INTRODUCTION
More informationAdditional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jintselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jintselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
More informationSegmentation and Classification of Online Chats
Segmentation and Classification of Online Chats Justin Weisz Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 jweisz@cs.cmu.edu Abstract One method for analyzing textual chat
More informationInterpretation of Somers D under four simple models
Interpretation of Somers D under four simple models Roger B. Newson 03 September, 04 Introduction Somers D is an ordinal measure of association introduced by Somers (96)[9]. It can be defined in terms
More informationKEITH LEHNERT AND ERIC FRIEDRICH
MACHINE LEARNING CLASSIFICATION OF MALICIOUS NETWORK TRAFFIC KEITH LEHNERT AND ERIC FRIEDRICH 1. Introduction 1.1. Intrusion Detection Systems. In our society, information systems are everywhere. They
More informationImpact of Feature Selection on the Performance of Wireless Intrusion Detection Systems
2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Impact of Feature Selection on the Performance of ireless Intrusion Detection Systems
More informationNCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
More informationPractical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
More informationError Log Processing for Accurate Failure Prediction. HumboldtUniversität zu Berlin
Error Log Processing for Accurate Failure Prediction Felix Salfner ICSI Berkeley Steffen Tschirpke HumboldtUniversität zu Berlin Introduction Context of work: Errorbased online failure prediction: error
More informationCSCI599 DATA MINING AND STATISTICAL INFERENCE
CSCI599 DATA MINING AND STATISTICAL INFERENCE Course Information Course ID and title: CSCI599 Data Mining and Statistical Inference Semester and day/time/location: Spring 2013/ Mon/Wed 3:304:50pm Instructor:
More information8. Time Series and Prediction
8. Time Series and Prediction Definition: A time series is given by a sequence of the values of a variable observed at sequential points in time. e.g. daily maximum temperature, end of day share prices,
More informationGerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
More informationANALYTICS IN BIG DATA ERA
ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY, DISCOVER RELATIONSHIPS AND CLASSIFY HUGE AMOUNT OF DATA MAURIZIO SALUSTI SAS Copyr i g ht 2012, SAS Ins titut
More informationData Mining  Evaluation of Classifiers
Data Mining  Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationLikelihood Approaches for Trial Designs in Early Phase Oncology
Likelihood Approaches for Trial Designs in Early Phase Oncology Clinical Trials Elizabeth GarrettMayer, PhD Cody Chiuzan, PhD Hollings Cancer Center Department of Public Health Sciences Medical University
More informationRegression 3: Logistic Regression
Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic regression Logistic regression in R Outline Logistic regression Introduction The model Looking at and comparing
More informationThe counterpart to a DAC is the ADC, which is generally a more complicated circuit. One of the most popular ADC circuit is the successive
The counterpart to a DAC is the ADC, which is generally a more complicated circuit. One of the most popular ADC circuit is the successive approximation converter. 1 2 The idea of sampling is fully covered
More informationA Hybrid Approach to Efficient Detection of Distributed DenialofService Attacks
Technical Report, June 2008 A Hybrid Approach to Efficient Detection of Distributed DenialofService Attacks Christos Papadopoulos Department of Computer Science Colorado State University 1873 Campus
More informationCharacterizing Task Usage Shapes in Google s Compute Clusters
Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc. Introduction Cloud computing is becoming a key
More informationMultisensor Data Fusion and Applications
Multisensor Data Fusion and Applications Pramod K. Varshney Department of Electrical Engineering and Computer Science Syracuse University 121 Link Hall Syracuse, New York 13244 USA Email: varshney@syr.edu
More informationScalable Machine Learning to Exploit Big Data for Knowledge Discovery
Scalable Machine Learning to Exploit Big Data for Knowledge Discovery UnaMay O Reilly MIT MIT ILPEPOCH Taiwan Symposium Big Data: Technologies and Applications Lots of Data Everywhere Knowledge Mining
More informationPredictive Modeling Techniques in Insurance
Predictive Modeling Techniques in Insurance Tuesday May 5, 2015 JF. Breton Application Engineer 2014 The MathWorks, Inc. 1 Opening Presenter: JF. Breton: 13 years of experience in predictive analytics
More informationDefault Thresholds. Performance Advisor. Adaikkappan Arumugam, Nagendra Krishnappa
Default Thresholds Performance Advisor Adaikkappan Arumugam, Nagendra Krishnappa Monitoring performance of storage subsystem and getting alerted at the right time before a complete performance breakdown
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationNew Features in PSP2 for SANsymphony V10 Softwaredefined Storage Platform and DataCore Virtual SAN
New Features in PSP2 for SANsymphony V10 Softwaredefined Storage Platform and DataCore Virtual SAN Updated: May 19, 2015 Contents Introduction... 1 Cloud Integration... 1 OpenStack Support... 1 Expanded
More informationSIMPLIFIED PERFORMANCE MODEL FOR HYBRID WIND DIESEL SYSTEMS. J. F. MANWELL, J. G. McGOWAN and U. ABDULWAHID
SIMPLIFIED PERFORMANCE MODEL FOR HYBRID WIND DIESEL SYSTEMS J. F. MANWELL, J. G. McGOWAN and U. ABDULWAHID Renewable Energy Laboratory Department of Mechanical and Industrial Engineering University of
More informationMachine Learning Capacity and Performance Analysis and R
Machine Learning and R May 3, 11 30 25 15 10 5 25 15 10 5 30 25 15 10 5 0 2 4 6 8 101214161822 0 2 4 6 8 101214161822 0 2 4 6 8 101214161822 100 80 60 40 100 80 60 40 100 80 60 40 30 25 15 10 5 25 15 10
More informationGamma Distribution Fitting
Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics
More informationCS/SWE 321 Sections 001 & 003. Software Project Management
CS/SWE 321 Sections 001 & 003 Software Project Management Copyright 2014 Hassan Gomaa All rights reserved. No part of this document may be reproduced in any form or by any means, without the prior written
More informatione = random error, assumed to be normally distributed with mean 0 and standard deviation σ
1 Linear Regression 1.1 Simple Linear Regression Model The linear regression model is applied if we want to model a numeric response variable and its dependency on at least one numeric factor variable.
More informationSENSITIVITY ANALYSIS AND INFERENCE. Lecture 12
This work is licensed under a Creative Commons AttributionNonCommercialShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this
More informationCHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a
More informationTHIS PAGE WAS LEFT BLANK INTENTIONALLY
SAMPLE EXAMINATION The purpose of the following sample examination is to present an example of what is provided on exam day by ASQ, complete with the same instructions that are given on exam day. The test
More informationPart III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing
CS 188: Artificial Intelligence Lecture 20: Dynamic Bayes Nets, Naïve Bayes Pieter Abbeel UC Berkeley Slides adapted from Dan Klein. Part III: Machine Learning Up until now: how to reason in a model and
More informationAnother Look at Sensitivity of Bayesian Networks to Imprecise Probabilities
Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities Oscar Kipersztok Mathematics and Computing Technology Phantom Works, The Boeing Company P.O.Box 3707, MC: 7L44 Seattle, WA 98124
More informationSupervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
More informationSchedule Risk Analysis Simulator using Beta Distribution
Schedule Risk Analysis Simulator using Beta Distribution Isha Sharma Department of Computer Science and Applications, Kurukshetra University, Kurukshetra, Haryana (INDIA) ishasharma211@yahoo.com Dr. P.K.
More informationCHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES
CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES Claus Gwiggner, Ecole Polytechnique, LIX, Palaiseau, France Gert Lanckriet, University of Berkeley, EECS,
More informationComparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
More informationHypothesis Testing  Relationships
 Relationships Session 3 AHX43 (28) 1 Lecture Outline Correlational Research. The Correlation Coefficient. An example. Considerations. One and Twotailed Tests. Errors. Power. for Relationships AHX43
More informationMultivariate Normal Distribution
Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #47/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues
More informationImputing Values to Missing Data
Imputing Values to Missing Data In federated data, between 30%70% of the data points will have at least one missing attribute  data wastage if we ignore all records with a missing value Remaining data
More information