Finding soon-to-fail disks in a haystack

Size: px
Start display at page:

Download "Finding soon-to-fail disks in a haystack"

Transcription

1 Finding soon-to-fail disks in a haystack Moises Goldszmidt Microsoft Research Abstract This paper presents a detector of soon-to-fail disks based on a combination of statistical models. During operation the detector takes as input a performance signal from each disk and sends and alarm when there is enough evidence (according to the models) that the disk is not healthy. The parameters of these models are automatically trained using signals from healthy and failed disks. In an evaluation on a population of 1190 production disks from a popular customer-facing internet service, the detector was able to predict 15 out of the 17 failed disks (88.2% detection) with 30 false alarms (2.56% false positive rate). 1 Introduction The ability to provide advance warning of hard disk failures is extremely useful in internet and cloud services. For cloud services this ability enables completely uninterrupted service as preventive reallocation may be executed prior to imminent disk failures. For internet services such as , content (news, search, movies), and ecommerce, advance warning may trigger immediate replication or replacement which will increase availability and reliability statistics, and may even reduce replications and prevent data loss. In addition to preventive actions, one can imagine more invasive ones involving perhaps intrusive diagnostics and repairs. IT departments in regular businesses and individual users would obtain similar benefits from having a detector producing these alarms. In this paper we describe, characterize, and evaluate a software based detector of soon-to-fail disks. This detector which we call D-FAD (for Disk Failure Advance Detector), takes as input a performance signal from each disk and produces an alarm according to a combination of statistical models. The parameters in these models are automatically trained from a population of healthy and failed disks, using machine learning techniques. Predicting the future is notoriously difficult. There is a trade-off between the ability to accurately predict failure (indicated by the detection rate), and the number of false alarms (indicated by false positive rate). This trade-off is reflected in the tension between the cost of ignoring an alarm for a failing disk and the cost of unnecessary action in reaction to a false one. Ignoring an alarm for an impending failure will result in a degradation of the quality, availability, and reliability of the service. Acting on a false alarm will result on unnecessary costs related to, for example, migrating users in a cloud service. The decision about how to act on this trade-off is entirely dependent on economic and business considerations of the service. Thus, we report on the detection and false positive rates of D-FAD providing the necessary information to act on this trade-off. When applied to a population of 1190 production disks from a popular 24x7 customer-facing internet service D- FAD predicts 15 out of 17 failed disks (88.2% detection rate), with 30 false alarms (2.56% false positive rate). In the case of the preventive action being moving users in a cloud service, this translates to 30 additional moves replicas for guaranteeing uninterrupted service in spite of 15 disks failures. The parameters of the statistical models for D-FAD were fitted using 200 disks, containing a mixture of 17 failed disks and 183 healthy disks. In addition, we used 700 disks to estimate the false positive rate. The disks used for training and those used for testing (evaluating the performance of D-FAD) came from the same population of disks. We divided this total population in two sets, maintaining the same proportion of failed and healthy disks in each set. Signals from the disks in the testing set were only used during the evaluation of D-FAD. Also, for the purposes of this paper, as long as D-FAD sends an alarm between 18 days and 4 hours of the disk failing we will regard the impending failure as detected. By training D-FAD on the specific workload that disks experience 1

2 in production, the models are not subject to the inherent uncertainty in the failure rates specified by the manufacturer which can present a lot of variability (see [5]), and which were estimated from proprietary stress workloads. Another way to look at the benefits of D-FAD is that it reduces the uncertainty of identifying failing disks in the whole population of disks (using the manufacturer failure rate), to a smaller population composed of the of detected failures (15 in our case) plus the false alarms (30 in our case). In summary, this paper proposes and evaluates D-FAD a software based detector of soon-to-fail disks, with the following benefits: STF signal Healthy signals The input to the detector is a single performance signal, average maximum latency; consequently it is efficient at scale. It is based on models fitted directly from production workloads; thus, the detector will customize to the particular usage and stress in production. The results on real data from 1190 production disks of a 24x7 customer-facing internet service are a 88.2% detection rate and a 2.56% false positive rate. These benefits are encouraging for further research into building robust detectors based on performance signals from the disks. The rest of the paper is organized as follows: Section 2 briefly compares this work against the most relevant published literature. Section 3 introduces a set of preliminary definitions. Section 4 describe the models and the methods for training. Section 5 discusses the results of the evaluation and we conclude in Section 6. 2 Related Work This section is by no means comprehensive. It touches on the published work that is closer to ours and that help to put our work in the right perspective. The seminal paper by Pinheiro, Weber, and Barroso [5] contains an in depth study on a much larger scale than ours. We use the same definition of failure and like them we study disks in operation under stress from real workloads. Our results also confirm their finding that...we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy... Similarly as further reported in that paper we found that a small percentage of failed disks show no SMART error signals (at least amongst the ones we had access to). The important difference between the current paper and [5] is that we focus on prediction based on performance (and not on error signals) with positive results. In [2] and [3] the authors look at machine learning Figure 1: Examples of the AML time series for healthy and STF disks. Each graph contains 8 time series from neighboring disks. Time series for healthy disks are at the bottom of the Y axis. The difference with the values for STF disks is sometimes over 7 standard deviations. The top graphs are taken 8 hours before failure. The bottom two are taken 4 days before failure. The X axis is in intervals of 20 minutes. Units in the Y axis are obscured by request from the operations team at the site. methods for predicting disk failures using also SMART signals from the disk. Again, this contrasts with us using performance signals. Even though the population of disks in our study is of similar size, our study differs in two important dimensions: (1) the disks in our study are stressed by a real world workload under production conditions and (2) our statistical models are parametric (theirs are non-parametric). This is important as in our models the number of parameters do not increase with the data. Finally, the study in [7] provides understanding on the statistical properties of disk failures but is not focused on detection or prediction. 3 Problem Statement The problem we are solving is the detection of soon-tofail (STF) disks. We define disk failure as in in [5]: a drive is considered to have failed if it was replaced as part of a repairs procedure. There are several considerations that go into this definition such as the actual time of failure may not be as accurate as it depends on when the disk was actually exchanged etc. Still, we agree with the authors of [5] that this is the ultimate ground truth for evaluating any predictor/detector. Our proposed solution is a software based detector, D-FAD, that uses statistical models to find abnormal behavior. The input to D-FAD is a single signal from every disk, a time series containing in each sample the Average Maximum Latency (AML) over each sample period, and the output is an alarm if the models in D-FAD decide that the signal presents abnormal behavior. 2

3 The data collection was done via a software interface provided by the disk vendor. In this interface, all signals are exposed every 20 minutes. The AML is computed as follows: the maximum latency of each disk is recorded every minute, then they are averaged over the 20 minutes of the reporting period, and finally exposed by the interface. These are non-overlapping averages. Examples of AML for both healthy and STF disks are depicted in Figure 1. A list of all the signals exposed by the interface is presented in Table 3. We couldn t find any predictor of STF disks other than the AML (see Section 5). Besides being restricted by the parameters set by the vendor, we were constrained by several other factors outside of our control. We only had data for a month for about 2380 disks. We reserved 1190 disks for training and 1190 for testing the models. All the parameters of the models described in Section 4, model selection, and initial estimation of the various rates, were fitted and computed using the training data. The testing data was only seen once during the evaluation of D-FAD. The detection and false positive rates reported come from the evaluation (although we remark that they coincide with the ones computed during training). We actual detected disk failures (ground truth) through changes in the serial number of the disk in the logs. The particular disks we study come from a cluster of 144 servers. Each server has 34 drives in an array and drives are 750GB SATA drives. We had data for 70 of those servers. The data was collected during December of 2007 from production disks of a popular 24x7 customer-facing internet service. With respect to the lead time to failure, that is how soon will the disk fail, we consider that an imminent failure is predicted if the alarm is sent between 18 days and 4 hours before the disk is changed. Note that the definition of false positives is very stringent. If the detector emits an alarm and the disk has not failed within 18 days, (or by the end of the month of data available) then this alarm will be considered as a false positive (even if it is changed within the period established above). 4 The D-FAD Models The schematics for D-FAD are shown in Figure 2. The raw AML goes through two filters in parallel. The filter on the left compares the (log) odds of the probability that the data is produced by a STF disk vs. a healthy disk. It is based on Hidden Markov Models (HMMs) [6] and is described in Subsection 4.1. The other filter counts the number of peaks in AML in a 24 hour period (see Subsection 4.2). The output from these two units is taken by a final unit that uses a logistic regression model [8] to decide whether the disk will fail. These are well understood models, and techniques for fitting the parameters from data and performing infer- Model Comparison HMM models Fusion Alarm AML signal Peak Counter Logistic Regression Threshold Figure 2: Schematic diagram for D-FAD. The AML signal goes through a model comparison filter (using HMMs) and a peak counter. The output of these two filters is fused by a logistic regression model. ence on new data are well known. The specific combination and application to this particular problem is new. 4.1 The Hidden Markov Model Unit From inspection of the difference between the AML signals of STF and healthy disks (see Figure 1), it is clear that at least part of the detector must consist of comparing the statistics of healthy and STF disks. We initially tried comparing running averages and using change point detection algorithm with very little success. Both detection and the false positive rate were abysmal. Our next step was to try Hidden Markov Models (HMMs). These models capture several modes of operation for the disks, and also take into consideration the time dependence inherent in the signal. HMMs are statistical state space models [6] used in applications ranging from speech processing to time series prediction. These models assume that the signal is sampled at regular intervals (our case) and therefore time is discrete. An HMM consists of two sets of variables. The first set is the hidden part of the model representing a disk state. In our case the different states emit either low, medium, or high values of the AML signal. We will use S i to denote these variables where the index i represents a particular instance in time in the time series. The second set of variables, denoted by Y i, is the observable part which is continuous and is used to model the actual values of the AML signal. We will use upper case letters for the variables, and lowercase letters for their values. Given a sequence of values y 1,y 2,...,y n the model assumes that the disk is in one state at each time s 1,s 2,...,s n and the value y observed at time i is a (probabilistic) function of the state s of the disk. In order to have a complete specification of the model, we need the 3

4 probability functions relating the hidden states to the observables, and the transition between the states. Thus we have: (1) P(S i+1 S i ), the probability that the disk is in a particular state S i+1 = s given that it was in state S i = s in the previous time instance; (2) P(Y i S i ) a set of conditional distributions modeling the probability of the values observed given the state of the disk; and (3) a probability for the initial state namely P(S 1 ). The parameterization of these probability functions in this paper is as follows: P(Y i S i ) is considered to be a Gaussian distribution with mean µ s and variance σ s. P(S i S i 1 ) is considered to be multinomial (as S is a discrete random variable). The final parameter is an integer denoting the number of states S. These parameters are automatically fitted from samples of the AML signal. In this paper we used maximum likelihood estimators using well known algorithms described in [6] (rather than Bayesian techniques [8]). The problem of fitting the number of states S is what is known a model selection problem. For this paper we relied on the Bayesian Information Criteria (BIC) [8] which scores the a model based on the goodness of fit to the training data, and the number of parameters (complexity) of the model. Given that we expect to have no more than a handful of states, the problem of model selection proceeds as a linear search: we start with S = 2 and incrementing by 1 until BIC decreases. In our case S = 3. Once we fit these parameters the model is completely specified and we can compute the probability of any sequence of observations y 1,y 2,...,y n, using P(y 1,...,y n ) = S1,...,S n P(y 1 S 1 )...P(y n S n )P(S n S n 1 )...P(S 1 ). We are now ready to explain the inner workings of the model comparison box in Figure 2. We first train (fit parameters offline) an HMM model using data coming from healthy disks. Let s denote that probability model by P H. We do the same using data from an STF disk and obtain an HMM model for STF disks. Let P ST F denote that model. Given data D coming from a disk in operation, we compute P H (D), P ST F (D) (using the equation above), and then the ratio log(p ST F (D)/P H (D)) (the log-odds between the models). This ratio, which we denote by LO is the output of the model comparison box. If this ratio is bigger than zero then it is more likely that D comes from a STF disk. This is one of the two pieces of evidence that the logistic regression unit considers in making a final decision regarding an alarm. 4.2 The Peak Counter Unit In our initial experiments with the training data, the output LO (i.e. the comparison between HMM s) was not enough to decrease the false positive rate while maintaining the detection rate. Thus we introduced an additional filter to provide another piece of evidence. We found that AML signals from ST F disks have peaks that are over 7 standard deviations away from AML signals coming from healthy disks (see Figure 1). Running simple statistics we found that a threshold of 2 seconds was over four standard deviations from the signal of healthy disks. This unit returns the number of times that the AML signal peaks over 2 seconds in a period of 24 hours. We will use PC to denote this count. 4.3 The Logistic Regression Unit Now we have two signals, the LO and the PC, and we need to make a decision on whether the disk is STF or healthy. We use a logistic regression model for this task. A logistic regression model is the simplest model with the least assumptions about the probabilistic models of these two signals which still helps in making a decision in a principled way. Let P(A) = P(ST F LO,PC) denote the probability that the disk is an STF disk. Then the logistic regression computes log( P(A) 1 P(A) ) = β 0 +β 1 LO+ β 2 PC. Fitting the parameters, β 0,β 1 and β 2 from data is a standard statistical problem [1]. The final output of D-FAD comes from this equation: the more positive log( P(A) 1 P(A) ) is, the more evidence according to the models for sending an alarm. The threshold T for determining when to send an alarm will be fitted by balancing the trade-off between detection and false positives. 4.4 Parameter Fitting To fit the parameters of the HMM s, we used 24 hours of data from 60 disks. We selected 12 of (17) failed disks, and 48 healthy disks from the same arrays as the failed disks. This selection from the same array guarantees that we control for common factors (i.e. workload) that may affect the whole array. The period of 24 hours was selected from the point of disk change for the failed disks to avoid daily cyclic confounder factors in the data (the disks do not fail at the same time). Thus P ST F was fitted from data coming from 24 hours prior to the failure of 12 disks and P H was fitted from data coming from 60 healthy disks (in the same period). Given that we will be substituting the parameters of maximum likelihood (and not integrating over them as in the Bayesian methodology) we need to take care that the models are comparable in the number of parameters. This is determined entirely by the number of states S which must be equal in both models. In our case the BIC score decreased for S > 3 in the model for STF disks. Thus we set S = 3. We then used a set of 200 disks (which includes the 60 disks above) to fit the parameters (β i ) of the logistic regression model. Finally, we used an additional 700 disks to continue testing the models and for setting the the threshold T for the decision on an alarm. We increased T until we 4

5 Life Ref Time Reset Ref Time Perf Collection Interval Life Sectors Read Reset Sectors Read Perf Read Requests Life Hard Reads Reset Hard Reads Perf Write Requests Life Retry Reads Reset Retry Reads Perf Max Queue Depth Life ECC Reads Reset ECC Reads Perf Average Latency Life Sectors Written Reset Sectors Written Perf Maximum Latency Life Hard Writes Reset Hard Writes Perf Maximum Wait Time Life Retry Writes Reset Retry Writes Error Log Total Errors Life Used Reallocs Reset Used Reallocs SMART Reallocated Sectors Life Timeouts Reset Timeouts SMART Pending Reallocs Life Pred Fail Reset Pred Fail SMART ECC Errors predicted signal and all the other signals as regressors, using state of the art feature finding techniques such as L1 regularization with no success [1, 4]. This lack of correlation may be a result of the level at which the signals were collected, the architecture of the service, or just that more sophisticated analyses are required. Another hypothesis is there are many ways in which the disk may fail and we need a larger population of failed disks to find reliable correlations between the AML signal and the different (error) signals behind the different types of failures. Further research is needed but we are comforted by the fact that [5] also reports on the lack of correlation for prediction with SMART error signals. Figure 3: List of signals available to us from each disk. decreased the rate of false positives without impacting the detection rate. We performed all these operations only on the training data. The test data was not touched or inspected, so there is no risk of overfitting. We were satisfied with a false positive rate of 2.54% which is similar to the one obtained on the test data later. 5 Results As specified above, we used half of the population of 2380 disks for training and the other half for testing. The 1190 disks in the testing data were unseen by the algorithms until the evaluation. The evaluation proceeded as follows: D-FAD went over the month of data for each disk, taking 24 hours of AML data at a time with a sliding window of 4 hours. Thus, every 4 hours, D-FAD makes a decision on the disk of whether to classify it as STF or healthy. If during the month a disk is classified once as STF and it is changed within 4 hours to 18 days from the alarm, then it counts as a predicted failure. We found that 8 disks were predicted with over 10 days of lead time and the rest between 5 days and 8 hours. Otherwise it is counted as a false positive. The results were 88.2% detection (15 out of the 17 changed disks during the month) with 2.56% false positives (30 disks). It is reasuring that these numbers are almost identical to the ones in the training set. It is interesting to note that we couldn t find in the data anything particularly salient on the two disks that were not detected. Also, we couldn t find any correlation between the AML and any of the other signals that were collected from the disks (see Table 3). The workloads (both read and write) were not affected, nor were any other error related indicators. We used several techniques to search for those correlations. First we fitted a logistic regression standard regressions to AML as the 6 Conclusions We have described a detector capable of predicting disks that are soon to fail, and we have evaluated the detector on real data from disks used in a 24x7 customerfacing internet service. The results, 88.2% detection rate and 2.56 false positive rate, are encouraging and provide a significant reduction in the initial state of uncertainty about disks failure. The decision to deploy D-FAD is a trade-off between the cost of false positives and the loss of service quality for not performing preventive actions. Nevertheless the reported results and the advantages of the methodology provide evidence that this is a fruitful avenue of research to pursue. Acknowledgments Many thanks to F. McSherry, D. Woodard, N. Reddy, and Q. Ke for comments on previous versions of this paper. References [1] HASTIE, T., TIBSHIRANI, R., AND FRIEDMAN, J. The Elements of Statistical Learning. Springer, [2] HUGHES, G., MURRAY, J., KREUTZ-DELGADO, K., AND ELKAN, C. Hard drive failure prediction using non-parametric statistical methods. IEEE Transactions on Reliability (2002). [3] MURRAY, J., HUGHES, G., AND KREUTZ-DELGADO, K. Machine learning methods for predicting failures in hard drives. Journal of Machine Learning Research (2005). [4] PARK, M. Y., AND HASTIE, T. L1 regularization path algorithm for generalized linear models. Royal Statistical Society (2007). [5] PINHEIRO, E., WEBER, W.-D., AND BARROSO, L. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (2007). [6] RABINER, L. B., AND HUANG, H. L. An introduction to hidden Markov models. In IEEE ASSP Magazine [7] SCHROEDER, B., AND GIBSON, G. Disk failures in the real world. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (2007). [8] WASSERMAN, L. All of statistics. Springer,

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Models vs. Patterns Models A model is a high level, global description of a

More information

Cell Phone based Activity Detection using Markov Logic Network

Cell Phone based Activity Detection using Markov Logic Network Cell Phone based Activity Detection using Markov Logic Network Somdeb Sarkhel sxs104721@utdallas.edu 1 Introduction Mobile devices are becoming increasingly sophisticated and the latest generation of smart

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

More information

Chris Slaughter, DrPH. GI Research Conference June 19, 2008

Chris Slaughter, DrPH. GI Research Conference June 19, 2008 Chris Slaughter, DrPH Assistant Professor, Department of Biostatistics Vanderbilt University School of Medicine GI Research Conference June 19, 2008 Outline 1 2 3 Factors that Impact Power 4 5 6 Conclusions

More information

203.4770: Introduction to Machine Learning Dr. Rita Osadchy

203.4770: Introduction to Machine Learning Dr. Rita Osadchy 203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:

More information

Throughput Capacity Planning and Application Saturation

Throughput Capacity Planning and Application Saturation Throughput Capacity Planning and Application Saturation Alfred J. Barchi ajb@ajbinc.net http://www.ajbinc.net/ Introduction Applications have a tendency to be used more heavily by users over time, as the

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

The Proportional Odds Model for Assessing Rater Agreement with Multiple Modalities

The Proportional Odds Model for Assessing Rater Agreement with Multiple Modalities The Proportional Odds Model for Assessing Rater Agreement with Multiple Modalities Elizabeth Garrett-Mayer, PhD Assistant Professor Sidney Kimmel Comprehensive Cancer Center Johns Hopkins University 1

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Question 2 Naïve Bayes (16 points)

Question 2 Naïve Bayes (16 points) Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the

More information

Broadband Networks. Prof. Dr. Abhay Karandikar. Electrical Engineering Department. Indian Institute of Technology, Bombay. Lecture - 29.

Broadband Networks. Prof. Dr. Abhay Karandikar. Electrical Engineering Department. Indian Institute of Technology, Bombay. Lecture - 29. Broadband Networks Prof. Dr. Abhay Karandikar Electrical Engineering Department Indian Institute of Technology, Bombay Lecture - 29 Voice over IP So, today we will discuss about voice over IP and internet

More information

Regularized Logistic Regression for Mind Reading with Parallel Validation

Regularized Logistic Regression for Mind Reading with Parallel Validation Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland

More information

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise

More information

Server Load Prediction

Server Load Prediction Server Load Prediction Suthee Chaidaroon (unsuthee@stanford.edu) Joon Yeong Kim (kim64@stanford.edu) Jonghan Seo (jonghan@stanford.edu) Abstract Estimating server load average is one of the methods that

More information

RAID Storage Systems with Early-warning and Data Migration

RAID Storage Systems with Early-warning and Data Migration National Conference on Information Technology and Computer Science (CITCS 2012) RAID Storage Systems with Early-warning and Data Migration Yin Yang 12 1 School of Computer. Huazhong University of yy16036551@smail.hust.edu.cn

More information

SureSense Software Suite Overview

SureSense Software Suite Overview SureSense Software Overview Eliminate Failures, Increase Reliability and Safety, Reduce Costs and Predict Remaining Useful Life for Critical Assets Using SureSense and Health Monitoring Software What SureSense

More information

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

 Y. Notation and Equations for Regression Lecture 11/4. Notation: Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Sampling and Hypothesis Testing

Sampling and Hypothesis Testing Population and sample Sampling and Hypothesis Testing Allin Cottrell Population : an entire set of objects or units of observation of one sort or another. Sample : subset of a population. Parameter versus

More information

Basics of Statistical Machine Learning

Basics of Statistical Machine Learning CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

More information

Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan

Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms:

More information

Detection of changes in variance using binary segmentation and optimal partitioning

Detection of changes in variance using binary segmentation and optimal partitioning Detection of changes in variance using binary segmentation and optimal partitioning Christian Rohrbeck Abstract This work explores the performance of binary segmentation and optimal partitioning in the

More information

Principles of Data Mining by Hand&Mannila&Smyth

Principles of Data Mining by Hand&Mannila&Smyth Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences

More information

High Availability with Postgres Plus Advanced Server. An EnterpriseDB White Paper

High Availability with Postgres Plus Advanced Server. An EnterpriseDB White Paper High Availability with Postgres Plus Advanced Server An EnterpriseDB White Paper For DBAs, Database Architects & IT Directors December 2013 Table of Contents Introduction 3 Active/Passive Clustering 4

More information

Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance. Chapter 6: Behavioural models Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

More information

ENTERPRISE INFRASTRUCTURE CONFIGURATION GUIDE

ENTERPRISE INFRASTRUCTURE CONFIGURATION GUIDE ENTERPRISE INFRASTRUCTURE CONFIGURATION GUIDE MailEnable Pty. Ltd. 59 Murrumbeena Road, Murrumbeena. VIC 3163. Australia t: +61 3 9569 0772 f: +61 3 9568 4270 www.mailenable.com Document last modified:

More information

Using Simulation Modeling to Predict Scalability of an E-commerce Website

Using Simulation Modeling to Predict Scalability of an E-commerce Website Using Simulation Modeling to Predict Scalability of an E-commerce Website Rebeca Sandino Ronald Giachetti Department of Industrial and Systems Engineering Florida International University Miami, FL 33174

More information

System Aware Cyber Security

System Aware Cyber Security System Aware Cyber Security Application of Dynamic System Models and State Estimation Technology to the Cyber Security of Physical Systems Barry M. Horowitz, Kate Pierce University of Virginia April, 2012

More information

Prediction of DDoS Attack Scheme

Prediction of DDoS Attack Scheme Chapter 5 Prediction of DDoS Attack Scheme Distributed denial of service attack can be launched by malicious nodes participating in the attack, exploit the lack of entry point in a wireless network, and

More information

Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

More information

STATISTICAL DATA ANALYSIS COURSE VIA THE MATLAB WEB SERVER

STATISTICAL DATA ANALYSIS COURSE VIA THE MATLAB WEB SERVER STATISTICAL DATA ANALYSIS COURSE VIA THE MATLAB WEB SERVER Ale š LINKA Dept. of Textile Materials, TU Liberec Hálkova 6, 461 17 Liberec, Czech Republic e-mail: ales.linka@vslib.cz Petr VOLF Dept. of Applied

More information

Health monitoring & predictive analytics To lower the TCO in a datacenter

Health monitoring & predictive analytics To lower the TCO in a datacenter Health monitoring & predictive analytics To lower the TCO in a datacenter PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei Khurshudov Seagate Technology christian.b.madsen@seagate.com Outline 1.

More information

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions SMA 50: Statistical Learning and Data Mining in Bioinformatics (also listed as 5.077: Statistical Learning and Data Mining ()) Spring Term (Feb May 200) Faculty: Professor Roy Welsch Wed 0 Feb 7:00-8:0

More information

Technology Step-by-Step Using StatCrunch

Technology Step-by-Step Using StatCrunch Technology Step-by-Step Using StatCrunch Section 1.3 Simple Random Sampling 1. Select Data, highlight Simulate Data, then highlight Discrete Uniform. 2. Fill in the following window with the appropriate

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Univariate Regression

Univariate Regression Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

More information

Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

More information

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

More information

Analytical Test Method Validation Report Template

Analytical Test Method Validation Report Template Analytical Test Method Validation Report Template 1. Purpose The purpose of this Validation Summary Report is to summarize the finding of the validation of test method Determination of, following Validation

More information

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...

More information

CLASSIFICATION OF TRADING STRATEGIES IN ADAPTIVE MARKETS

CLASSIFICATION OF TRADING STRATEGIES IN ADAPTIVE MARKETS M A C H I N E L E A R N I N G P R O J E C T F I N A L R E P O R T F A L L 2 7 C S 6 8 9 CLASSIFICATION OF TRADING STRATEGIES IN ADAPTIVE MARKETS MARK GRUMAN MANJUNATH NARAYANA Abstract In the CAT Tournament,

More information

PS 271B: Quantitative Methods II. Lecture Notes

PS 271B: Quantitative Methods II. Lecture Notes PS 271B: Quantitative Methods II Lecture Notes Langche Zeng zeng@ucsd.edu The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.

More information

Some Essential Statistics The Lure of Statistics

Some Essential Statistics The Lure of Statistics Some Essential Statistics The Lure of Statistics Data Mining Techniques, by M.J.A. Berry and G.S Linoff, 2004 Statistics vs. Data Mining..lie, damn lie, and statistics mining data to support preconceived

More information

Hardware Implementation of Probabilistic State Machine for Word Recognition

Hardware Implementation of Probabilistic State Machine for Word Recognition IJECT Vo l. 4, Is s u e Sp l - 5, Ju l y - Se p t 2013 ISSN : 2230-7109 (Online) ISSN : 2230-9543 (Print) Hardware Implementation of Probabilistic State Machine for Word Recognition 1 Soorya Asokan, 2

More information

VDI FIT and VDI UX: Composite Metrics Track Good, Fair, Poor Desktop Performance

VDI FIT and VDI UX: Composite Metrics Track Good, Fair, Poor Desktop Performance VDI FIT and VDI UX: Composite Metrics Track Good, Fair, Poor Desktop Performance Key indicators and classification capabilities in Stratusphere FIT and Stratusphere UX Whitepaper INTRODUCTION This whitepaper

More information

Proactive Drive Failure Prediction for Large Scale Storage Systems

Proactive Drive Failure Prediction for Large Scale Storage Systems Proactive Drive Failure Prediction for Large Scale Storage Systems Bingpeng Zhu, Gang Wang, Xiaoguang Liu 2, Dianming Hu 3, Sheng Lin, Jingwei Ma Nankai-Baidu Joint Lab, College of Information Technical

More information

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and

More information

Yiming Peng, Department of Statistics. February 12, 2013

Yiming Peng, Department of Statistics. February 12, 2013 Regression Analysis Using JMP Yiming Peng, Department of Statistics February 12, 2013 2 Presentation and Data http://www.lisa.stat.vt.edu Short Courses Regression Analysis Using JMP Download Data to Desktop

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

Q & A From Hitachi Data Systems WebTech Presentation:

Q & A From Hitachi Data Systems WebTech Presentation: Q & A From Hitachi Data Systems WebTech Presentation: RAID Concepts 1. Is the chunk size the same for all Hitachi Data Systems storage systems, i.e., Adaptable Modular Systems, Network Storage Controller,

More information

MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

More information

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4. Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics

More information

Statistics in Retail Finance. Chapter 2: Statistical models of default

Statistics in Retail Finance. Chapter 2: Statistical models of default Statistics in Retail Finance 1 Overview > We consider how to build statistical models of default, or delinquency, and how such models are traditionally used for credit application scoring and decision

More information

Storage I/O Control: Proportional Allocation of Shared Storage Resources

Storage I/O Control: Proportional Allocation of Shared Storage Resources Storage I/O Control: Proportional Allocation of Shared Storage Resources Chethan Kumar Sr. Member of Technical Staff, R&D VMware, Inc. Outline The Problem Storage IO Control (SIOC) overview Technical Details

More information

10-601. Machine Learning. http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html

10-601. Machine Learning. http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html 10-601 Machine Learning http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html Course data All up-to-date info is on the course web page: http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html

More information

White Paper. Making Sense of the Data-Oriented Tools Available to Facility Managers. Find What Matters. Version 1.1 Oct 2013

White Paper. Making Sense of the Data-Oriented Tools Available to Facility Managers. Find What Matters. Version 1.1 Oct 2013 White Paper Making Sense of the Data-Oriented Tools Available to Facility Managers Version 1.1 Oct 2013 Find What Matters Making Sense the Data-Oriented Tools Available to Facility Managers INTRODUCTION

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

Segmentation and Classification of Online Chats

Segmentation and Classification of Online Chats Segmentation and Classification of Online Chats Justin Weisz Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 jweisz@cs.cmu.edu Abstract One method for analyzing textual chat

More information

Interpretation of Somers D under four simple models

Interpretation of Somers D under four simple models Interpretation of Somers D under four simple models Roger B. Newson 03 September, 04 Introduction Somers D is an ordinal measure of association introduced by Somers (96)[9]. It can be defined in terms

More information

KEITH LEHNERT AND ERIC FRIEDRICH

KEITH LEHNERT AND ERIC FRIEDRICH MACHINE LEARNING CLASSIFICATION OF MALICIOUS NETWORK TRAFFIC KEITH LEHNERT AND ERIC FRIEDRICH 1. Introduction 1.1. Intrusion Detection Systems. In our society, information systems are everywhere. They

More information

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems 2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Impact of Feature Selection on the Performance of ireless Intrusion Detection Systems

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

Error Log Processing for Accurate Failure Prediction. Humboldt-Universität zu Berlin

Error Log Processing for Accurate Failure Prediction. Humboldt-Universität zu Berlin Error Log Processing for Accurate Failure Prediction Felix Salfner ICSI Berkeley Steffen Tschirpke Humboldt-Universität zu Berlin Introduction Context of work: Error-based online failure prediction: error

More information

CSCI-599 DATA MINING AND STATISTICAL INFERENCE

CSCI-599 DATA MINING AND STATISTICAL INFERENCE CSCI-599 DATA MINING AND STATISTICAL INFERENCE Course Information Course ID and title: CSCI-599 Data Mining and Statistical Inference Semester and day/time/location: Spring 2013/ Mon/Wed 3:30-4:50pm Instructor:

More information

8. Time Series and Prediction

8. Time Series and Prediction 8. Time Series and Prediction Definition: A time series is given by a sequence of the values of a variable observed at sequential points in time. e.g. daily maximum temperature, end of day share prices,

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

ANALYTICS IN BIG DATA ERA

ANALYTICS IN BIG DATA ERA ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY, DISCOVER RELATIONSHIPS AND CLASSIFY HUGE AMOUNT OF DATA MAURIZIO SALUSTI SAS Copyr i g ht 2012, SAS Ins titut

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Likelihood Approaches for Trial Designs in Early Phase Oncology

Likelihood Approaches for Trial Designs in Early Phase Oncology Likelihood Approaches for Trial Designs in Early Phase Oncology Clinical Trials Elizabeth Garrett-Mayer, PhD Cody Chiuzan, PhD Hollings Cancer Center Department of Public Health Sciences Medical University

More information

Regression 3: Logistic Regression

Regression 3: Logistic Regression Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic regression Logistic regression in R Outline Logistic regression Introduction The model Looking at and comparing

More information

The counterpart to a DAC is the ADC, which is generally a more complicated circuit. One of the most popular ADC circuit is the successive

The counterpart to a DAC is the ADC, which is generally a more complicated circuit. One of the most popular ADC circuit is the successive The counterpart to a DAC is the ADC, which is generally a more complicated circuit. One of the most popular ADC circuit is the successive approximation converter. 1 2 The idea of sampling is fully covered

More information

A Hybrid Approach to Efficient Detection of Distributed Denial-of-Service Attacks

A Hybrid Approach to Efficient Detection of Distributed Denial-of-Service Attacks Technical Report, June 2008 A Hybrid Approach to Efficient Detection of Distributed Denial-of-Service Attacks Christos Papadopoulos Department of Computer Science Colorado State University 1873 Campus

More information

Characterizing Task Usage Shapes in Google s Compute Clusters

Characterizing Task Usage Shapes in Google s Compute Clusters Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc. Introduction Cloud computing is becoming a key

More information

Multisensor Data Fusion and Applications

Multisensor Data Fusion and Applications Multisensor Data Fusion and Applications Pramod K. Varshney Department of Electrical Engineering and Computer Science Syracuse University 121 Link Hall Syracuse, New York 13244 USA E-mail: varshney@syr.edu

More information

Scalable Machine Learning to Exploit Big Data for Knowledge Discovery

Scalable Machine Learning to Exploit Big Data for Knowledge Discovery Scalable Machine Learning to Exploit Big Data for Knowledge Discovery Una-May O Reilly MIT MIT ILP-EPOCH Taiwan Symposium Big Data: Technologies and Applications Lots of Data Everywhere Knowledge Mining

More information

Predictive Modeling Techniques in Insurance

Predictive Modeling Techniques in Insurance Predictive Modeling Techniques in Insurance Tuesday May 5, 2015 JF. Breton Application Engineer 2014 The MathWorks, Inc. 1 Opening Presenter: JF. Breton: 13 years of experience in predictive analytics

More information

Default Thresholds. Performance Advisor. Adaikkappan Arumugam, Nagendra Krishnappa

Default Thresholds. Performance Advisor. Adaikkappan Arumugam, Nagendra Krishnappa Default Thresholds Performance Advisor Adaikkappan Arumugam, Nagendra Krishnappa Monitoring performance of storage subsystem and getting alerted at the right time before a complete performance breakdown

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

New Features in PSP2 for SANsymphony -V10 Software-defined Storage Platform and DataCore Virtual SAN

New Features in PSP2 for SANsymphony -V10 Software-defined Storage Platform and DataCore Virtual SAN New Features in PSP2 for SANsymphony -V10 Software-defined Storage Platform and DataCore Virtual SAN Updated: May 19, 2015 Contents Introduction... 1 Cloud Integration... 1 OpenStack Support... 1 Expanded

More information

SIMPLIFIED PERFORMANCE MODEL FOR HYBRID WIND DIESEL SYSTEMS. J. F. MANWELL, J. G. McGOWAN and U. ABDULWAHID

SIMPLIFIED PERFORMANCE MODEL FOR HYBRID WIND DIESEL SYSTEMS. J. F. MANWELL, J. G. McGOWAN and U. ABDULWAHID SIMPLIFIED PERFORMANCE MODEL FOR HYBRID WIND DIESEL SYSTEMS J. F. MANWELL, J. G. McGOWAN and U. ABDULWAHID Renewable Energy Laboratory Department of Mechanical and Industrial Engineering University of

More information

Machine Learning Capacity and Performance Analysis and R

Machine Learning Capacity and Performance Analysis and R Machine Learning and R May 3, 11 30 25 15 10 5 25 15 10 5 30 25 15 10 5 0 2 4 6 8 101214161822 0 2 4 6 8 101214161822 0 2 4 6 8 101214161822 100 80 60 40 100 80 60 40 100 80 60 40 30 25 15 10 5 25 15 10

More information

Gamma Distribution Fitting

Gamma Distribution Fitting Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics

More information

CS/SWE 321 Sections -001 & -003. Software Project Management

CS/SWE 321 Sections -001 & -003. Software Project Management CS/SWE 321 Sections -001 & -003 Software Project Management Copyright 2014 Hassan Gomaa All rights reserved. No part of this document may be reproduced in any form or by any means, without the prior written

More information

e = random error, assumed to be normally distributed with mean 0 and standard deviation σ

e = random error, assumed to be normally distributed with mean 0 and standard deviation σ 1 Linear Regression 1.1 Simple Linear Regression Model The linear regression model is applied if we want to model a numeric response variable and its dependency on at least one numeric factor variable.

More information

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12 This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

More information

THIS PAGE WAS LEFT BLANK INTENTIONALLY

THIS PAGE WAS LEFT BLANK INTENTIONALLY SAMPLE EXAMINATION The purpose of the following sample examination is to present an example of what is provided on exam day by ASQ, complete with the same instructions that are given on exam day. The test

More information

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing CS 188: Artificial Intelligence Lecture 20: Dynamic Bayes Nets, Naïve Bayes Pieter Abbeel UC Berkeley Slides adapted from Dan Klein. Part III: Machine Learning Up until now: how to reason in a model and

More information

Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities

Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities Oscar Kipersztok Mathematics and Computing Technology Phantom Works, The Boeing Company P.O.Box 3707, MC: 7L-44 Seattle, WA 98124

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

Schedule Risk Analysis Simulator using Beta Distribution

Schedule Risk Analysis Simulator using Beta Distribution Schedule Risk Analysis Simulator using Beta Distribution Isha Sharma Department of Computer Science and Applications, Kurukshetra University, Kurukshetra, Haryana (INDIA) ishasharma211@yahoo.com Dr. P.K.

More information

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES Claus Gwiggner, Ecole Polytechnique, LIX, Palaiseau, France Gert Lanckriet, University of Berkeley, EECS,

More information

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

More information

Hypothesis Testing - Relationships

Hypothesis Testing - Relationships - Relationships Session 3 AHX43 (28) 1 Lecture Outline Correlational Research. The Correlation Coefficient. An example. Considerations. One and Two-tailed Tests. Errors. Power. for Relationships AHX43

More information

Multivariate Normal Distribution

Multivariate Normal Distribution Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

More information

Imputing Values to Missing Data

Imputing Values to Missing Data Imputing Values to Missing Data In federated data, between 30%-70% of the data points will have at least one missing attribute - data wastage if we ignore all records with a missing value Remaining data

More information