Evaluation of Neural Network Performance by

Size: px
Start display at page:

Download "Evaluation of Neural Network Performance by"

Transcription

1 Evaluation of Neural Network Performance by Receiver Operating Characteristic Analysis: Examples from the Biotechnology Domain Michael L. Meistrell, M.D. Advisor: Kent A. Spackman, M.D., Ph.D. Program in Medical Information Science Dartmouth Medical School, Hanover, NH 3756 Request reprints from Dr. Meistrell: New England Medical Center Clinical Decision Making, 75 Washington Street, Boston, MA 2111 Abstract. A need exists for an unbiased measure of the accuracy of feed-forward neural networks used for classification. Receiver Operating Characteristic (ROC) analysis is suited for this measure, and was used to assess the performance of several different network weights. The area under an ROC and its standard error were used to compare different network weight sets, and to follow the performance of a network during the course of training. The ROC is not sensitive to the prior probabilities of examples in the testing set nor to decision bias. The area under an ROC curve is a readily understood measure, and should be used to evaluate neural networks and to report results of learning experiments. Examples are provided from experiments with data from the biotechnology domain. KEY WORDS: NEURAL NETWORKS; ROC ANALYSIS; PROTEIN STRUCTURE PREDICTION. Introduction & background. Machine learning is concerned with computational theories of learning and with building learning systems, defined as computer systems which can improve their performance with experience. This paper focuses on the accuracy of one type of machine learning, namely neural networks (also known as connectionist models), in predicting the three-dimensional structure of globular proteins. More specifically, the focus is on the analysis of feed-forward neural networks trained by error back-propagation to predict the secondary structure of a protein from its primary amino acid sequence. The topics of study herein are motivated by the long-range goal of developing computational tools for prediction of a protein's three-dimensional conformation from its primary amino acid sequence and from properties which can be derived from it. Such tools would be important in elucidating the structure/function relationships of biologically active macromolecules, and thus would contribute significantly to understanding the molecular basis of physiology and disease. Problems in the domain of protein structure prediction are often computationally intractable with conventional techniques; therefore, the massively parallel nature of neural network computing has appeal for solution of certain types of problems encountered in this area. Interest in neural network computing has undergone impressive growth during the past five years, with increasing resources devoted to research and development [1,2]. Lippmann [3] has discussed the attributes of several network models, particularly as they relate to pattern recognition and classification. A popular model is the feed-forward network trained with the error back-propagation learning method [4]. Interesting applications of this network paradigm have been reported, including two recent manuscripts pertaining to prediction of protein secondary structure [5,6]. In spite of its recent widespread use in many branches of science and engineering, a reliable and robust means of measuring the accuracy of performance of this kind of neural network has not been apparent in the literature. Previous reports in the protein structure field [5,6,7,8,9,1] have reported accuracy in ways which fail to capture all of the relevant information for an unbiased assessment, and in some cases appear to have perpetuated some misconceptions, as will be discussed below. The present work was motivated by this need, and has proven its utility in our laboratory, both for comparing the performance of networks which we have developed with others' published results, and to aid in determining the best machine learning technique for a given class of problem or for a particular set of input data. Receiver operating characteristic (ROC) analysis has its roots in the signal detection field, and the foundations of the method were laid by /89//295$1. X 1989 SCAMC, Inc. 295

2 investigators in the period roughly from 1945 to An ROC curve can be constructed by varying the "cutoff value" of the output from a given class detector, and then plotting the true positive rate (customarily on the ordinate) and the corresponding false positive rate (abscissa) for that detector at each cutoff value of its output. The cutoff value is the value above which the detector output is interpreted as representing a positive example, and below which the output is considered to represent a negative example (counter-example) of the particular class of interest. Swets [ 1 1] has recently stressed the fact that the ROC uniquely possesses the important attributes of being uninfluenced by either decision biases or prior probabilities. As stated by Swets, ROC analysis "places the performances of diverse systems on a common, easily understood scale". ROC analysis is well known within the field of medical informatics, and several investigators in medicine, epidemiology, and biostatistics have contributed to the ROC literature. This report addresses the application of ROC methods to the neural network model of computing and advances the proposal that the ROC should receive consideration for widespread use in the evaluation and reporting of neural network performance. ROC theory applied to neural networks. Although more will be said below (see Methods and procedures) about the design of the networks utilized in these experiments, the nature of the neural network as a detection device will now be discussed in order to clarify the reasoning for applying ROC analysis to such devices. The networks utilized here comprise two functionally independent (but not statistically independent) detectors, one, with output Xa, for detecting the presence of an amino acid residue within the primary protein sequence which is a positive example of the class of secondary structure known as a-helix, and the second, with output denoted Xp, which is to detect the presence or absence of an amino acid residue which is a component of the - strand class of substructure. X a and X i are continuous random variables with values 1.X2O. That is, when the ith input is presented the a - helix output node gives value xai and the,bstrand output node simultaneously produces an output x13. The value, X, of an output node, while ranging between zero and one, does no t necessarily represent the probability of a positive example (see below). On the other hand, if the network output nodes are effective as detectors, then on average, for each output node, X will be greater when a positive example is presented than when a negative example is presented. Thus, the outputs, xai and xpi, along with their corresponding correct class designation (a-helix,,b-strand, or non-a-non-,) constitute categorical data and are readily analyzed by ROC's constructed for each class detector. The decision criteria applied to the network in order to make a choice between classes for a given example are not the subject of this report, and will not be evaluated. It is the accuracy of each class detector, functioning independently of the other, which is being analyzed. The area under the ROC curve is important [12,13]. If the network (say, for the a-helix predictor) is exposed to a pair of residues, one of which is a positive example (a-helix) and the other is a negative example (non a-helix), the area under the ROC curve for that predictor is equal to the probability that the network output (X a) will be greater for the positive example than for the negative example [14]. The area under the ROC curve and its standard error can be accurately calculated for these neural networks, and constitute the statistics which are used to compare the accuracy of different networks. Methods and procedures. Neural networks were constructed with two output nodes (processors), one for prediction of a-helix the other for prediction of 5-strand. Each output node utilized a logistic function [4] and received weighted activation from each input node as shown in Figure 1. The activation of each input node was clamped to a value of zero or one for each input. For one network each input pattern consisted of the binary bit-map of each amino acid in a "window" 13 residues wide, and in the other network the "window" was 17 residues wide. Prediction is made for the center residue in the window. Each of the twenty amino acids (aa) which occur in natural globular proteins was represented by one bit on (= 1) in a vector of length 21 (see Fig. 2). The twenty-first bit was on for a sequence position with no aa present. Consequently, there were 273 input nodes (21 x 296

3 13) in one network, and 357 (21x17) in the other. Figure 1. Schematic diagram of neural networks utilized. The input processors encode a "window" of the primary amino acid sequence of the protein. The "window" is moved along the entire primary sequence. Secondary structure assignment is made for the central aa residue in the context of its 12 or 16 neighbors. Xa and Xp are the values of the network output processors. p-strand a-helix,b c O O * * * * * * O O Input = central AA residue and 12 or 16 surrounding primary structure neighbors Tyrosine Null ooooooooooooooooooo Figure 2. Vector bit-map representation of twenty amino acids plus one null value for the positions at the carbonyl and at the amino ends of each protein primary sequence. 1*- 21 Bits- Alanine *ooooooooooooooooooo Arginine o-oooooooooooooooooo ooooooooooooooooooo- * Denotes bit on = 1 o Denotes bit off = Both Qian and Sejnowski [5] and Holley and Karplus [6] found that single-layer networks performed essentially as well as those with "hidden" nodes for the training and testing sets used in their experiments; consequently the networks utilized here did not contain hidden nodes. Training of our networks ("Spackman & Meistrell") was done on either a Sun3 workstation or a NeXT workstation by the method described by Rumelhart, et al. [4,15]. Three training sets of proteins were utilized for our networks. Training was concluded when the decremental change in total sum squared error for the network reached 5x1-5 for at least three successive epochs (one epoch consists of a feedforward pass and a back-propagation pass for each of the patterns in the training set). All training and testing sets utilized data from the Protein Data Bank [16] with secondary structural classification assignments according to the DSSP algorithm of Kabsch and Sander [17]. Training set #1 comprised proteins specified by Qian and Sejnowski [5] and consisted of 18,111 examples, with a-helix and 5-strand prevalence as summarized in Table 1. Training sets #2 and #3 were enriched with duplicates of either a- helix or 5-strand positive examples (see Table 1) thus appreciably increasing the prior probability of the network's being trained with a positive example of each of these structures. The essential product of network training is a set of weights which contains the "knowledge" of the network. Each of three network weight sets were evaluated with two testing sets of examples and counter-examples of a-helix and 1-strand from proteins which had not been seen previously by the networks during training. One network was trained in our laboratory, and this was compared with those reported by Qian and Sejnowski [5] and those reported by Holley and Karplus [6]. Thus, the material for analysis of accuracy consisted of three different sets of weights, each of which could independently predict for occurrence of a- helix and 13-strand. Each of the three competing sets of weights was evaluated with two test sets. Testing set #1 (comprising 3,52 aa residues) was that specified by Qian and Sejnowski [5], while testing set #2 (comprising 2,441 aa residues) was that utilized by Holley and Karplus [6], the details of which are specified by Kabsch and Sander [7]. 297

4 Table 1. Composition of training and testing sets of proteins. For each set of proteins, the number of aa residues in a given secondary structural class appears immediately above the proportion of that class. ob±u non-a D-strand mjn Training set #1: 4,648 13,463 3,826 14, % 74.33% 21.12% 78.88% Training set #2: 13,944 13, % 49.12% Training set #3: ,478 14, % 55.45% Testing set #1 (352 aa) [51: 24.1% 75.9% 21.3% 78.7% Testing set #2 (2441 aa) [61: 26.% 74.% 2.% 8.% Calculation of ROC area. The data obtained from each network with each test set were evaluated by calculating ROC statistics for both a -helix and,b -strand. The area () under the ROC curve and its standard error (SE) were calculated using the Wilcoxon statistic computational method described by Hanley and McNeil [18]. Our implementation was written in the C language. It was very efficient when run on the Sun3 workstation. For comparing two networks without a common testing set, the standard z score is an appropriate statistic [19,2]: Z = (I - OI)/(SE, + SE,,2)1/2 where subscripts I and II denote the two curves being compared. When two networks were compared using the same testing set, the covariance of the ROC areas was utilized to substantially increase the power of the analysis [21]. In this case: Z = (1-11)/(SE, + SE,,2-2r.SEISEII) 1/2 The correlation coefficient (r) was calculated by first calculating the mean of the Pearson product-moment coefficients for curves I and II, and then using the table given in Hanley and McNeil [21, page 841]. As a check on the Wilcoxon computations, the true-positive and false-positive rates for several experiments were computed at each cutoff value on the ROC, and the areas under the curves obtained by integration by the trapezoidal rule. The areas obtained in this way were then compared to the area by Wilcoxon statistic. There was no significant difference. Results. The area under the ROC curve () and its standard error (SE) were used to test for significance of difference of performance by the various networks. Three types of overall analyses were performed: 1) Comparison of different author's weights utilizing the same test set; 2) for a given author's weights, comparison of the ability to predict a-helix versus 5-strand; and 3) to study the effect of prior probability of positive examples during the training phase on a network's predictive accuracy. The ROC area measures,, of predictive accuracy of different authors' weights are summarized in Table 2. The reader should note that the standard errors for different networks did in fact differ, but not when only three significant figures were used. Comparisons of type I (above). In this case the appropriate statistic for determining if a performance difference exists is the standard z score (see Discussion) exploiting the covariance (see above). There was no significant difference in performance for these networks on either Testing set #1 or #2 at a level of significance of.5. For these comparisons z scores ranged between a minimum of.497 and a maximum of Comparisons of type 2 (above). In this case the comparison is between for a-helix versus for J-strand for either the same or different authors. In this case, however, there is no covariance because the a-helix test examples do not constitute a common set with those of 1 - strand. The z score for these comparisons ranged from a minimum of.73 to a maximum of.688. Therefore, there was no significant difference for prediction of a-helix versus j- strand for any author with either test set. Figure 3 shows an example of a plot of two ROC's, one of a-helix and one of,b-strand. Analyses of type 3 (above). After training a network with Training set #1, ROC analysis revealed that performance was judged to be inferior to that of the weights reported by Qian and Sejnowski [5], in the face of apparent convergence of the network (see Figure 4). 298

5 Small differences in the protein structural data exist between that used by Qian and Sejnowski and that utilized in our laboratory because changes are made to the Brookhaven database with time, and ours was undoubtedly more recent (August, 1988 version) than that available to the other authors at the time of their network training. It is unlikely that these small differences could account for the performance difference, especially since more recent data would be expected to have the same or better fidelity than earlier data. The mathematically precise role of prior probabilities of positive examples (prevalence) is not known for the neural network during training. However, it seemed intuitively clear that materially changing the likelihood for positive examples would alter the network's operating point for learning. Consequently, the training set was then modified by replicating positive examples of both a-helix and,3-strand, such that the probability of each was near.5. After re-training both a-helix and,b-strand weights the network was re-tested, and there was an increase in prediction accuracy, such that O was indistinguishable from other reported results. (See Figures 4 and 5 and Table 3.) The increase in ROC area resulted in a z score of 1.58, which falls just short of the.5 significance level for a two-tailed test. Figure 3. Plot of ROC for a-helix (dark graph) and 13-strand. Weights tested were those of Holley & Karplus [6] using Testing set #2. TPR is true positive rate, FPR is false positive rate. Difference in area under curves is not significant at.5 level (z = for Oa 8) TPR FPR Table 2. Comparative accuracy of three networks for detection of both a-helix and 1- strand Accuracy is measured by = ROC area (see text). SE=standard error of. Test set #1 [ 5 Network Weights*-> &K &M S a-helix SE a-helix o 1-strand SE 13-strand Test set #2 (6] Network Weights*-> Q& H&K S & M O a-helix SE a-helix o 1-strand SE 5-strand * Q & S: Qian & Sejnowski [5] H & K: Holley & Karplus [6] S & M: Spackman & Meistrell Figure 4. Two ROC's for a-helix using testing set #1. Upper curve is from weights of Qian and Sejnowski [5] ( =.737, SE =.25), lower curve is from weights after training with Training set #1 ( =.717, SE =.25) TPR ' FPR 299

6 Figure 5. Plot of a-helix ROC's after retraining with Training sets #2 and #3 (see text). Upper curve is from the retrained weights, lower is from Training set #1 (identical to lower curve, Fig. 4) TPR FPR Table 3. Improved training with increased prevalence of positive examples of both a-helix and p-strand. The same test set was used for evaluating both sets of weights. Training sets #2 & #3 have increased prevalence of positive examples compared with Training set #1. a-helix Train set #1 Train sets #2 & # (set 2) SE a-helix.25.25,b-strand (set 3) SE,B-strand Discussion. A robust, unbiased, easily understood means of assessing the accuracy of the type of neural network described here is needed. The receiver operating characteristic rests on solid theoretical foundations, and there has been considerable experience with application of the method in diverse areas [11]. The advantages of a generally understood and readily interpreted scale for measuring accuracy warrants use of the ROC for neural networks. The area under the ROC and its standard error provide such a means of measurement of accuracy of prediction. The output of a neural network is a continuous variable with value between zero and one when a logistic function determines the activation of the output node. In these networks, the ability of each output node to properly classify an amino acid residue in the context of its 12 or 16 nearest primary sequence neighbors was assessed by calculating the ROC for each output node. No assumptions were made about the shape of the underlying probability distributions (other than the continuous nature of the data). The available data have such a "fine grain" as to make maximum-likelihood curve-fitting unnecessary (see Hanley and McNeil [18] and Hanley [19]). The straightforward computational method of utilizing the Wilcoxon statistic to compute the area under the ROC curve, while generally somewhat conservative in comparison to the maximum-likelihood method of Dorfman and Alf [22], can be expected to agree almost perfectly for the data reported here. In fact, assumption of normal distributions for positive and negative examples in these data is probably fully justified [19], but the computational burden of the nonparametric methods utilized here is small. Use of the z score is readily justified for these data by the central limit theorem. That is, the Wilcoxon statistic is "asymptotically normally distributed" [2]. This report sheds light on an interesting question from the biotechnology literature. It has frequently been stated for a variety of methods that the prediction accuracy of a-helix is superior to that of,b-strand, and this was implied in both recent reports utilizing neural networks to perform the predictions. Authors have speculated that the reason for this is a greater influence of tertiary interactions (compared with local amino acid sequence) for j- strand than for a -helix. However, this impression is a misconception fostered by test statistics such as "percent correct a-helix" or "percent correct 3-strand". For an example of the calculation of such a statistic, see Qian and Sejnowski [5]. The "percent correct" statistic can be misleading because it is sensitive to the prior probability of a positive example. Implicit in the calculation of "percent correct" as seen in the literature, is the choice of a cutoff value for the output variable. Implicitly, then, the statistic is operating at a single point on the ROC curve. The prior probability of a-helix is greater in most protein sets than that of 5-strand (see Table 1). 3

7 Consequently, the operating point on the a-helix ROC will tend to be placed such that the posterior probability of correct prediction for a-helix is greater than that for 5-strand. In contrast, the ROC is insensitive to prior probabilities [11,13] and permits a clear assessment of the capability of a given measure. Our results indicate that it really is no easier to predict a-helix than,bstrand from local aa sequence. This illustrates the value of the ROC for evaluation of a system which is intended to "diagnose" the presence or absence of a particular class in its input. Analysis of these neural networks by the ROC method has provided insight into the role played by prevalence of examples on the training of the network weights. ROC analysis can thus be expected to play a role in improving the understanding of neural networks and possibly other types of machine learning as classification systems. Although hidden nodes were not used in the networks reported here, the ROC technique is equally applicable to multi-layer networks. Conclusions. ROC analysis has been successfully applied to measurement of the accuracy of feed-forward neural networks. It is proposed that this method be adopted by others for assessment of network performance and for reporting of results. BIBLIOGRAPHY 1. Touretzky, D.; Hinton, G.; Sejnowski, T. Proceedings of the 1988 Connectionist Models Summer School; June 17-26, 1988; Carnegie Mellon University. San Mateo, CA: Morgan Kaufmann; pages. 2. Neural Networks; 1988; l(supplement 1): Lippmann, Richard P. An introduction to computing with neural nets. IEEE ASSP Magazine; April, 1987: Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning Internal Representations by Error Propagation. Rumelhart, D.E.; McClelland, J.L., Eds. Parallel Distributed Processing; 1986; 1: Qian, Ning; Sejnowski, Terrence J. Predicting the secondary structure ofglobular proteins using neural network models. J Mol Biol; 1988; 22: Holley,L.H.; Karplus,M. Protein secondary structure prediction with neural network. Proceeding of National Academy of Science, USA; 1989; 86: Kabsch, Wolfgang; Sander, Christian. How Good are Predictions ofprotein Secondary Structure? FEBS Letters; 1983; 155(No. 2): Lim, V.I. Algorithms for Prediction ofalpha-helical and Beta-structural Regions in Globular Proteins. J. Mol. Biol.; 1974; 88: Chou, P.Y.; Fasman, G.D. Empirical Prediction of Protein Conformation. Ann.. Rev. Biochem; 1978; 47: Robson, A. The prediction ofpeptide andprotein structure. In: Practical Protein Chemistry - A Handbook. Darbre, A. ed. London: John Wiley & Sons; 1986: Swets, John A. Measuring the accuracy ofdiagnostic systems. Science; 3 June 1988; 24: Green, D.M.; Swets, J.A. Signal detection theory and psychophysics. New York: Wiley; Egan, J.P. Signal Detection Theory and ROC Analysis. New York: Academic Press; Bamber, D. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology; 1975; 12: Training hidden units: The generalized delta rule. McClelland, James L.; Rumelhart, David E. Explorations in parallel distributed processing. Cambridge, Massachusetts: The MITPress; 1988: Bernstein, F.C.; Koetzle, T.F.; Williams, G.J.B.; Meyer, E.F., Jr.; Brice, M.D.; Rodgers, J.R.; Kennard, O.; Shimanouchi, T.; Tasumi, M. The Protein Data Bank: A Comnputer-based Archival Filefor Macromolecular Structures. J. Mol. Biol.; 1977; 112: Kabsch, Wolfgang; Sander, Christian. Dictionary of protein secondary structure: pattern recognition ofhydrogenbonded and geometricalfeatures. Biopolymers; 1983; 22: Hanley, J.A.; McNeil, B.J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology; 1982; 143: Hanley, James A. The robustness ofthe "Binormal" assumptions used infitting ROC curves. Med Decision Making; 1988; 8: McClish, Donna K. Comparing the areas under more than two independent ROC curves. Med Decision Making; 1987; 7: Hanley, J.A.; McNeil, B.J. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology; 1983; 148: Dorfman, D.D.; Alf, E., Jr. Maximum-likelihood estimation ofparameters ofsignal-detection theory and determination ofconfidence intervals - rating-method data. Journal of Mathematical Psychology; 1969; 6:

Lecture 6. Artificial Neural Networks

Lecture 6. Artificial Neural Networks Lecture 6 Artificial Neural Networks 1 1 Artificial Neural Networks In this note we provide an overview of the key concepts that have led to the emergence of Artificial Neural Networks as a major paradigm

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Chapter 12 Discovering New Knowledge Data Mining

Chapter 12 Discovering New Knowledge Data Mining Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12 This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

Detection Sensitivity and Response Bias

Detection Sensitivity and Response Bias Detection Sensitivity and Response Bias Lewis O. Harvey, Jr. Department of Psychology University of Colorado Boulder, Colorado The Brain (Observable) Stimulus System (Observable) Response System (Observable)

More information

Iranian J Env Health Sci Eng, 2004, Vol.1, No.2, pp.51-57. Application of Intelligent System for Water Treatment Plant Operation.

Iranian J Env Health Sci Eng, 2004, Vol.1, No.2, pp.51-57. Application of Intelligent System for Water Treatment Plant Operation. Iranian J Env Health Sci Eng, 2004, Vol.1, No.2, pp.51-57 Application of Intelligent System for Water Treatment Plant Operation *A Mirsepassi Dept. of Environmental Health Engineering, School of Public

More information

Prediction of Stock Performance Using Analytical Techniques

Prediction of Stock Performance Using Analytical Techniques 136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University

More information

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS O.U. Sezerman 1, R. Islamaj 2, E. Alpaydin 2 1 Laborotory of Computational Biology, Sabancı University, Istanbul, Turkey. 2 Computer Engineering

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

II. DISTRIBUTIONS distribution normal distribution. standard scores

II. DISTRIBUTIONS distribution normal distribution. standard scores Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski trakovski@nyus.edu.mk Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

A Learning Algorithm For Neural Network Ensembles

A Learning Algorithm For Neural Network Ensembles A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICET-UNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República

More information

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 8 August 2013

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 8 August 2013 A Short-Term Traffic Prediction On A Distributed Network Using Multiple Regression Equation Ms.Sharmi.S 1 Research Scholar, MS University,Thirunelvelli Dr.M.Punithavalli Director, SREC,Coimbatore. Abstract:

More information

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta]. 1.3 Neural Networks 19 Neural Networks are large structured systems of equations. These systems have many degrees of freedom and are able to adapt to the task they are supposed to do [Gupta]. Two very

More information

DATA MINING METHODS WITH TREES

DATA MINING METHODS WITH TREES DATA MINING METHODS WITH TREES Marta Žambochová 1. Introduction The contemporary world is characterized by the explosion of an enormous volume of data deposited into databases. Sharp competition contributes

More information

Neural Networks for Sentiment Detection in Financial Text

Neural Networks for Sentiment Detection in Financial Text Neural Networks for Sentiment Detection in Financial Text Caslav Bozic* and Detlef Seese* With a rise of algorithmic trading volume in recent years, the need for automatic analysis of financial news emerged.

More information

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1) CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.

Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing. Introduction to Hypothesis Testing CHAPTER 8 LEARNING OBJECTIVES After reading this chapter, you should be able to: 1 Identify the four steps of hypothesis testing. 2 Define null hypothesis, alternative

More information

SELECTING NEURAL NETWORK ARCHITECTURE FOR INVESTMENT PROFITABILITY PREDICTIONS

SELECTING NEURAL NETWORK ARCHITECTURE FOR INVESTMENT PROFITABILITY PREDICTIONS UDC: 004.8 Original scientific paper SELECTING NEURAL NETWORK ARCHITECTURE FOR INVESTMENT PROFITABILITY PREDICTIONS Tonimir Kišasondi, Alen Lovren i University of Zagreb, Faculty of Organization and Informatics,

More information

Binary Diagnostic Tests Two Independent Samples

Binary Diagnostic Tests Two Independent Samples Chapter 537 Binary Diagnostic Tests Two Independent Samples Introduction An important task in diagnostic medicine is to measure the accuracy of two diagnostic tests. This can be done by comparing summary

More information

NTC Project: S01-PH10 (formerly I01-P10) 1 Forecasting Women s Apparel Sales Using Mathematical Modeling

NTC Project: S01-PH10 (formerly I01-P10) 1 Forecasting Women s Apparel Sales Using Mathematical Modeling 1 Forecasting Women s Apparel Sales Using Mathematical Modeling Celia Frank* 1, Balaji Vemulapalli 1, Les M. Sztandera 2, Amar Raheja 3 1 School of Textiles and Materials Technology 2 Computer Information

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities

Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities Oscar Kipersztok Mathematics and Computing Technology Phantom Works, The Boeing Company P.O.Box 3707, MC: 7L-44 Seattle, WA 98124

More information

U-shaped curves in development: A PDP approach. Carnegie Mellon University, Pittsburgh PA. Fax: international +44 1223 359062 Fax: 412 268-2798

U-shaped curves in development: A PDP approach. Carnegie Mellon University, Pittsburgh PA. Fax: international +44 1223 359062 Fax: 412 268-2798 U-shaped curves in development: A PDP approach Timothy T. Rogers 1, David H. Rakison 2, and James L. McClelland 2,3 1 MRC Cognition and Brain Sciences Unit, Cambridge, UK 2 Department of Psychology, and

More information

Artificial Neural Network and Non-Linear Regression: A Comparative Study

Artificial Neural Network and Non-Linear Regression: A Comparative Study International Journal of Scientific and Research Publications, Volume 2, Issue 12, December 2012 1 Artificial Neural Network and Non-Linear Regression: A Comparative Study Shraddha Srivastava 1, *, K.C.

More information

Optimization of technical trading strategies and the profitability in security markets

Optimization of technical trading strategies and the profitability in security markets Economics Letters 59 (1998) 249 254 Optimization of technical trading strategies and the profitability in security markets Ramazan Gençay 1, * University of Windsor, Department of Economics, 401 Sunset,

More information

Paper AA-08-2015. Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM

Paper AA-08-2015. Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM Paper AA-08-2015 Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM Delali Agbenyegah, Alliance Data Systems, Columbus, Ohio 0.0 ABSTRACT Traditional

More information

Price Prediction of Share Market using Artificial Neural Network (ANN)

Price Prediction of Share Market using Artificial Neural Network (ANN) Prediction of Share Market using Artificial Neural Network (ANN) Zabir Haider Khan Department of CSE, SUST, Sylhet, Bangladesh Tasnim Sharmin Alin Department of CSE, SUST, Sylhet, Bangladesh Md. Akter

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING. Anatoli Nachev

APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING. Anatoli Nachev 86 ITHEA APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING Anatoli Nachev Abstract: This paper presents a case study of data mining modeling techniques for direct marketing. It focuses to three

More information

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan

More information

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

More information

Performance Based Evaluation of New Software Testing Using Artificial Neural Network

Performance Based Evaluation of New Software Testing Using Artificial Neural Network Performance Based Evaluation of New Software Testing Using Artificial Neural Network Jogi John 1, Mangesh Wanjari 2 1 Priyadarshini College of Engineering, Nagpur, Maharashtra, India 2 Shri Ramdeobaba

More information

Local outlier detection in data forensics: data mining approach to flag unusual schools

Local outlier detection in data forensics: data mining approach to flag unusual schools Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential

More information

Mass Spectrometry Signal Calibration for Protein Quantitation

Mass Spectrometry Signal Calibration for Protein Quantitation Cambridge Isotope Laboratories, Inc. www.isotope.com Proteomics Mass Spectrometry Signal Calibration for Protein Quantitation Michael J. MacCoss, PhD Associate Professor of Genome Sciences University of

More information

Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance. Chapter 6: Behavioural models Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

More information

LOGISTIC REGRESSION ANALYSIS

LOGISTIC REGRESSION ANALYSIS LOGISTIC REGRESSION ANALYSIS C. Mitchell Dayton Department of Measurement, Statistics & Evaluation Room 1230D Benjamin Building University of Maryland September 1992 1. Introduction and Model Logistic

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Applying Statistics Recommended by Regulatory Documents

Applying Statistics Recommended by Regulatory Documents Applying Statistics Recommended by Regulatory Documents Steven Walfish President, Statistical Outsourcing Services steven@statisticaloutsourcingservices.com 301-325 325-31293129 About the Speaker Mr. Steven

More information

Abstract. The DNA promoter sequences domain theory and database have become popular for

Abstract. The DNA promoter sequences domain theory and database have become popular for Journal of Artiæcial Intelligence Research 2 è1995è 361í367 Submitted 8è94; published 3è95 Research Note On the Informativeness of the DNA Promoter Sequences Domain Theory Julio Ortega Computer Science

More information

On the effect of data set size on bias and variance in classification learning

On the effect of data set size on bias and variance in classification learning On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent

More information

Measurement and Metrics Fundamentals. SE 350 Software Process & Product Quality

Measurement and Metrics Fundamentals. SE 350 Software Process & Product Quality Measurement and Metrics Fundamentals Lecture Objectives Provide some basic concepts of metrics Quality attribute metrics and measurements Reliability, validity, error Correlation and causation Discuss

More information

6.2.8 Neural networks for data mining

6.2.8 Neural networks for data mining 6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural

More information

Handwritten Digit Recognition with a Back-Propagation Network

Handwritten Digit Recognition with a Back-Propagation Network 396 Le Cun, Boser, Denker, Henderson, Howard, Hubbard and Jackel Handwritten Digit Recognition with a Back-Propagation Network Y. Le Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard,

More information

CHAPTER THREE COMMON DESCRIPTIVE STATISTICS COMMON DESCRIPTIVE STATISTICS / 13

CHAPTER THREE COMMON DESCRIPTIVE STATISTICS COMMON DESCRIPTIVE STATISTICS / 13 COMMON DESCRIPTIVE STATISTICS / 13 CHAPTER THREE COMMON DESCRIPTIVE STATISTICS The analysis of data begins with descriptive statistics such as the mean, median, mode, range, standard deviation, variance,

More information

CSC 2427: Algorithms for Molecular Biology Spring 2006. Lecture 16 March 10

CSC 2427: Algorithms for Molecular Biology Spring 2006. Lecture 16 March 10 CSC 2427: Algorithms for Molecular Biology Spring 2006 Lecture 16 March 10 Lecturer: Michael Brudno Scribe: Jim Huang 16.1 Overview of proteins Proteins are long chains of amino acids (AA) which are produced

More information

NEURAL NETWORKS FOR SENSOR VALIDATION AND PLANT MONITORING. B.R. Upadhyaya, E. Eryurek and G. Mathai

NEURAL NETWORKS FOR SENSOR VALIDATION AND PLANT MONITORING. B.R. Upadhyaya, E. Eryurek and G. Mathai NEURAL NETWORKS FOR SENSOR VALIDATION AND PLANT MONITORING B.R. Upadhyaya, E. Eryurek and G. Mathai The University of Tennessee Knoxville, Tennessee 37996-2300, U.S.A. OONF-900804 35 DE93 002127 ABSTRACT

More information

Question 2 Naïve Bayes (16 points)

Question 2 Naïve Bayes (16 points) Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the

More information

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize

More information

Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

More information

Indiana State Core Curriculum Standards updated 2009 Algebra I

Indiana State Core Curriculum Standards updated 2009 Algebra I Indiana State Core Curriculum Standards updated 2009 Algebra I Strand Description Boardworks High School Algebra presentations Operations With Real Numbers Linear Equations and A1.1 Students simplify and

More information

Neural Networks in Data Mining

Neural Networks in Data Mining IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 03 (March. 2014), V6 PP 01-06 www.iosrjen.org Neural Networks in Data Mining Ripundeep Singh Gill, Ashima Department

More information

Mikko Rönkkö, Antero Järvi, Markus M. Mäkelä. Measuring and Comparing the Adoption of Software Process Practices in the Software Product Industry

Mikko Rönkkö, Antero Järvi, Markus M. Mäkelä. Measuring and Comparing the Adoption of Software Process Practices in the Software Product Industry Mikko Rönkkö, Antero Järvi, Markus M. Mäkelä Measuring and Comparing the Adoption of Software Process Practices in the Software Product Industry 1) Inroduction 2) Research approach 3) Results 4) Discussion

More information

Evaluation of Diagnostic Tests

Evaluation of Diagnostic Tests Biostatistics for Health Care Researchers: A Short Course Evaluation of Diagnostic Tests Presented ed by: Siu L. Hui, Ph.D. Department of Medicine, Division of Biostatistics Indiana University School of

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

Neural network models: Foundations and applications to an audit decision problem

Neural network models: Foundations and applications to an audit decision problem Annals of Operations Research 75(1997)291 301 291 Neural network models: Foundations and applications to an audit decision problem Rebecca C. Wu Department of Accounting, College of Management, National

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

Charles Secolsky County College of Morris. Sathasivam 'Kris' Krishnan The Richard Stockton College of New Jersey

Charles Secolsky County College of Morris. Sathasivam 'Kris' Krishnan The Richard Stockton College of New Jersey Using logistic regression for validating or invalidating initial statewide cut-off scores on basic skills placement tests at the community college level Abstract Charles Secolsky County College of Morris

More information

New Ensemble Combination Scheme

New Ensemble Combination Scheme New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,

More information

Weather forecast prediction: a Data Mining application

Weather forecast prediction: a Data Mining application Weather forecast prediction: a Data Mining application Ms. Ashwini Mandale, Mrs. Jadhawar B.A. Assistant professor, Dr.Daulatrao Aher College of engg,karad,ashwini.mandale@gmail.com,8407974457 Abstract

More information

Parametric and Nonparametric: Demystifying the Terms

Parametric and Nonparametric: Demystifying the Terms Parametric and Nonparametric: Demystifying the Terms By Tanya Hoskin, a statistician in the Mayo Clinic Department of Health Sciences Research who provides consultations through the Mayo Clinic CTSA BERD

More information

MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996)

MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996) MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL by Michael L. Orlov Chemistry Department, Oregon State University (1996) INTRODUCTION In modern science, regression analysis is a necessary part

More information

SIMPLIFIED PERFORMANCE MODEL FOR HYBRID WIND DIESEL SYSTEMS. J. F. MANWELL, J. G. McGOWAN and U. ABDULWAHID

SIMPLIFIED PERFORMANCE MODEL FOR HYBRID WIND DIESEL SYSTEMS. J. F. MANWELL, J. G. McGOWAN and U. ABDULWAHID SIMPLIFIED PERFORMANCE MODEL FOR HYBRID WIND DIESEL SYSTEMS J. F. MANWELL, J. G. McGOWAN and U. ABDULWAHID Renewable Energy Laboratory Department of Mechanical and Industrial Engineering University of

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

AN APPLICATION OF TIME SERIES ANALYSIS FOR WEATHER FORECASTING

AN APPLICATION OF TIME SERIES ANALYSIS FOR WEATHER FORECASTING AN APPLICATION OF TIME SERIES ANALYSIS FOR WEATHER FORECASTING Abhishek Agrawal*, Vikas Kumar** 1,Ashish Pandey** 2,Imran Khan** 3 *(M. Tech Scholar, Department of Computer Science, Bhagwant University,

More information

An Introduction to Neural Networks

An Introduction to Neural Networks An Introduction to Vincent Cheung Kevin Cannons Signal & Data Compression Laboratory Electrical & Computer Engineering University of Manitoba Winnipeg, Manitoba, Canada Advisor: Dr. W. Kinsner May 27,

More information

Neural Networks and Support Vector Machines

Neural Networks and Support Vector Machines INF5390 - Kunstig intelligens Neural Networks and Support Vector Machines Roar Fjellheim INF5390-13 Neural Networks and SVM 1 Outline Neural networks Perceptrons Neural networks Support vector machines

More information

11. Analysis of Case-control Studies Logistic Regression

11. Analysis of Case-control Studies Logistic Regression Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

More information

The CHIR Algorithm for Feed Forward. Networks with Binary Weights

The CHIR Algorithm for Feed Forward. Networks with Binary Weights 516 Grossman The CHIR Algorithm for Feed Forward Networks with Binary Weights Tal Grossman Department of Electronics Weizmann Institute of Science Rehovot 76100 Israel ABSTRACT A new learning algorithm,

More information

How To Use Neural Networks In Data Mining

How To Use Neural Networks In Data Mining International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and

More information

DATA MINING PROCESS FOR UNBIASED PERSONNEL ASSESSMENT IN POWER ENGINEERING COMPANY

DATA MINING PROCESS FOR UNBIASED PERSONNEL ASSESSMENT IN POWER ENGINEERING COMPANY Medunarodna naucna konferencija MENADŽMENT 2012 International Scientific Conference MANAGEMENT 2012 Mladenovac, Srbija, 20-21. april 2012 Mladenovac, Serbia, 20-21 April, 2012 DATA MINING PROCESS FOR UNBIASED

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

Handling attrition and non-response in longitudinal data

Handling attrition and non-response in longitudinal data Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein

More information

Speech and Network Marketing Model - A Review

Speech and Network Marketing Model - A Review Jastrzȩbia Góra, 16 th 20 th September 2013 APPLYING DATA MINING CLASSIFICATION TECHNIQUES TO SPEAKER IDENTIFICATION Kinga Sałapa 1,, Agata Trawińska 2 and Irena Roterman-Konieczna 1, 1 Department of Bioinformatics

More information

A New Approach For Estimating Software Effort Using RBFN Network

A New Approach For Estimating Software Effort Using RBFN Network IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.7, July 008 37 A New Approach For Estimating Software Using RBFN Network Ch. Satyananda Reddy, P. Sankara Rao, KVSVN Raju,

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information

Why High-Order Polynomials Should Not be Used in Regression Discontinuity Designs

Why High-Order Polynomials Should Not be Used in Regression Discontinuity Designs Why High-Order Polynomials Should Not be Used in Regression Discontinuity Designs Andrew Gelman Guido Imbens 2 Aug 2014 Abstract It is common in regression discontinuity analysis to control for high order

More information

Introduction to Principal Components and FactorAnalysis

Introduction to Principal Components and FactorAnalysis Introduction to Principal Components and FactorAnalysis Multivariate Analysis often starts out with data involving a substantial number of correlated variables. Principal Component Analysis (PCA) is a

More information

1 Maximum likelihood estimation

1 Maximum likelihood estimation COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

More information

Role of Neural network in data mining

Role of Neural network in data mining Role of Neural network in data mining Chitranjanjit kaur Associate Prof Guru Nanak College, Sukhchainana Phagwara,(GNDU) Punjab, India Pooja kapoor Associate Prof Swami Sarvanand Group Of Institutes Dinanagar(PTU)

More information

Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration

Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration Chapter 6: The Information Function 129 CHAPTER 7 Test Calibration 130 Chapter 7: Test Calibration CHAPTER 7 Test Calibration For didactic purposes, all of the preceding chapters have assumed that the

More information

The Mathematics of Alcoholics Anonymous

The Mathematics of Alcoholics Anonymous The Mathematics of Alcoholics Anonymous "As a celebrated American statesman put it, 'Let's look at the record. Bill Wilson, Alcoholics Anonymous, page 50, A.A.W.S. Inc., 2001. Part 2: A.A. membership surveys

More information

Healthcare data analytics. Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw

Healthcare data analytics. Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw Healthcare data analytics Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw Outline Data Science Enabling technologies Grand goals Issues Google flu trend Privacy Conclusion Analytics

More information

The Basics of Graphical Models

The Basics of Graphical Models The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures

More information

MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

More information

A Neural Network based Approach for Predicting Customer Churn in Cellular Network Services

A Neural Network based Approach for Predicting Customer Churn in Cellular Network Services A Neural Network based Approach for Predicting Customer Churn in Cellular Network Services Anuj Sharma Information Systems Area Indian Institute of Management, Indore, India Dr. Prabin Kumar Panigrahi

More information

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression Opening Example CHAPTER 13 SIMPLE LINEAR REGREION SIMPLE LINEAR REGREION! Simple Regression! Linear Regression Simple Regression Definition A regression model is a mathematical equation that descries the

More information

A Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks

A Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks A Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks H. T. Kung Dario Vlah {htk, dario}@eecs.harvard.edu Harvard School of Engineering and Applied Sciences

More information

A C T R esearcli R e p o rt S eries 2 0 0 5. Using ACT Assessment Scores to Set Benchmarks for College Readiness. IJeff Allen.

A C T R esearcli R e p o rt S eries 2 0 0 5. Using ACT Assessment Scores to Set Benchmarks for College Readiness. IJeff Allen. A C T R esearcli R e p o rt S eries 2 0 0 5 Using ACT Assessment Scores to Set Benchmarks for College Readiness IJeff Allen Jim Sconing ACT August 2005 For additional copies write: ACT Research Report

More information

Application-Specific Biometric Templates

Application-Specific Biometric Templates Application-Specific Biometric s Michael Braithwaite, Ulf Cahn von Seelen, James Cambier, John Daugman, Randy Glass, Russ Moore, Ian Scott, Iridian Technologies Inc. Introduction Biometric technologies

More information