Evaluation of Neural Network Performance by

Transcription

1 Evaluation of Neural Network Performance by Receiver Operating Characteristic Analysis: Examples from the Biotechnology Domain Michael L. Meistrell, M.D. Advisor: Kent A. Spackman, M.D., Ph.D. Program in Medical Information Science Dartmouth Medical School, Hanover, NH 3756 Request reprints from Dr. Meistrell: New England Medical Center Clinical Decision Making, 75 Washington Street, Boston, MA 2111 Abstract. A need exists for an unbiased measure of the accuracy of feed-forward neural networks used for classification. Receiver Operating Characteristic (ROC) analysis is suited for this measure, and was used to assess the performance of several different network weights. The area under an ROC and its standard error were used to compare different network weight sets, and to follow the performance of a network during the course of training. The ROC is not sensitive to the prior probabilities of examples in the testing set nor to decision bias. The area under an ROC curve is a readily understood measure, and should be used to evaluate neural networks and to report results of learning experiments. Examples are provided from experiments with data from the biotechnology domain. KEY WORDS: NEURAL NETWORKS; ROC ANALYSIS; PROTEIN STRUCTURE PREDICTION. Introduction & background. Machine learning is concerned with computational theories of learning and with building learning systems, defined as computer systems which can improve their performance with experience. This paper focuses on the accuracy of one type of machine learning, namely neural networks (also known as connectionist models), in predicting the three-dimensional structure of globular proteins. More specifically, the focus is on the analysis of feed-forward neural networks trained by error back-propagation to predict the secondary structure of a protein from its primary amino acid sequence. The topics of study herein are motivated by the long-range goal of developing computational tools for prediction of a protein's three-dimensional conformation from its primary amino acid sequence and from properties which can be derived from it. Such tools would be important in elucidating the structure/function relationships of biologically active macromolecules, and thus would contribute significantly to understanding the molecular basis of physiology and disease. Problems in the domain of protein structure prediction are often computationally intractable with conventional techniques; therefore, the massively parallel nature of neural network computing has appeal for solution of certain types of problems encountered in this area. Interest in neural network computing has undergone impressive growth during the past five years, with increasing resources devoted to research and development [1,2]. Lippmann [3] has discussed the attributes of several network models, particularly as they relate to pattern recognition and classification. A popular model is the feed-forward network trained with the error back-propagation learning method [4]. Interesting applications of this network paradigm have been reported, including two recent manuscripts pertaining to prediction of protein secondary structure [5,6]. In spite of its recent widespread use in many branches of science and engineering, a reliable and robust means of measuring the accuracy of performance of this kind of neural network has not been apparent in the literature. Previous reports in the protein structure field [5,6,7,8,9,1] have reported accuracy in ways which fail to capture all of the relevant information for an unbiased assessment, and in some cases appear to have perpetuated some misconceptions, as will be discussed below. The present work was motivated by this need, and has proven its utility in our laboratory, both for comparing the performance of networks which we have developed with others' published results, and to aid in determining the best machine learning technique for a given class of problem or for a particular set of input data. Receiver operating characteristic (ROC) analysis has its roots in the signal detection field, and the foundations of the method were laid by /89//295$1. X 1989 SCAMC, Inc. 295

2 investigators in the period roughly from 1945 to An ROC curve can be constructed by varying the "cutoff value" of the output from a given class detector, and then plotting the true positive rate (customarily on the ordinate) and the corresponding false positive rate (abscissa) for that detector at each cutoff value of its output. The cutoff value is the value above which the detector output is interpreted as representing a positive example, and below which the output is considered to represent a negative example (counter-example) of the particular class of interest. Swets [ 1 1] has recently stressed the fact that the ROC uniquely possesses the important attributes of being uninfluenced by either decision biases or prior probabilities. As stated by Swets, ROC analysis "places the performances of diverse systems on a common, easily understood scale". ROC analysis is well known within the field of medical informatics, and several investigators in medicine, epidemiology, and biostatistics have contributed to the ROC literature. This report addresses the application of ROC methods to the neural network model of computing and advances the proposal that the ROC should receive consideration for widespread use in the evaluation and reporting of neural network performance. ROC theory applied to neural networks. Although more will be said below (see Methods and procedures) about the design of the networks utilized in these experiments, the nature of the neural network as a detection device will now be discussed in order to clarify the reasoning for applying ROC analysis to such devices. The networks utilized here comprise two functionally independent (but not statistically independent) detectors, one, with output Xa, for detecting the presence of an amino acid residue within the primary protein sequence which is a positive example of the class of secondary structure known as a-helix, and the second, with output denoted Xp, which is to detect the presence or absence of an amino acid residue which is a component of the - strand class of substructure. X a and X i are continuous random variables with values 1.X2O. That is, when the ith input is presented the a - helix output node gives value xai and the,bstrand output node simultaneously produces an output x13. The value, X, of an output node, while ranging between zero and one, does no t necessarily represent the probability of a positive example (see below). On the other hand, if the network output nodes are effective as detectors, then on average, for each output node, X will be greater when a positive example is presented than when a negative example is presented. Thus, the outputs, xai and xpi, along with their corresponding correct class designation (a-helix,,b-strand, or non-a-non-,) constitute categorical data and are readily analyzed by ROC's constructed for each class detector. The decision criteria applied to the network in order to make a choice between classes for a given example are not the subject of this report, and will not be evaluated. It is the accuracy of each class detector, functioning independently of the other, which is being analyzed. The area under the ROC curve is important [12,13]. If the network (say, for the a-helix predictor) is exposed to a pair of residues, one of which is a positive example (a-helix) and the other is a negative example (non a-helix), the area under the ROC curve for that predictor is equal to the probability that the network output (X a) will be greater for the positive example than for the negative example [14]. The area under the ROC curve and its standard error can be accurately calculated for these neural networks, and constitute the statistics which are used to compare the accuracy of different networks. Methods and procedures. Neural networks were constructed with two output nodes (processors), one for prediction of a-helix the other for prediction of 5-strand. Each output node utilized a logistic function [4] and received weighted activation from each input node as shown in Figure 1. The activation of each input node was clamped to a value of zero or one for each input. For one network each input pattern consisted of the binary bit-map of each amino acid in a "window" 13 residues wide, and in the other network the "window" was 17 residues wide. Prediction is made for the center residue in the window. Each of the twenty amino acids (aa) which occur in natural globular proteins was represented by one bit on (= 1) in a vector of length 21 (see Fig. 2). The twenty-first bit was on for a sequence position with no aa present. Consequently, there were 273 input nodes (21 x 296

3 13) in one network, and 357 (21x17) in the other. Figure 1. Schematic diagram of neural networks utilized. The input processors encode a "window" of the primary amino acid sequence of the protein. The "window" is moved along the entire primary sequence. Secondary structure assignment is made for the central aa residue in the context of its 12 or 16 neighbors. Xa and Xp are the values of the network output processors. p-strand a-helix,b c O O * * * * * * O O Input = central AA residue and 12 or 16 surrounding primary structure neighbors Tyrosine Null ooooooooooooooooooo Figure 2. Vector bit-map representation of twenty amino acids plus one null value for the positions at the carbonyl and at the amino ends of each protein primary sequence. 1*- 21 Bits- Alanine *ooooooooooooooooooo Arginine o-oooooooooooooooooo ooooooooooooooooooo- * Denotes bit on = 1 o Denotes bit off = Both Qian and Sejnowski [5] and Holley and Karplus [6] found that single-layer networks performed essentially as well as those with "hidden" nodes for the training and testing sets used in their experiments; consequently the networks utilized here did not contain hidden nodes. Training of our networks ("Spackman & Meistrell") was done on either a Sun3 workstation or a NeXT workstation by the method described by Rumelhart, et al. [4,15]. Three training sets of proteins were utilized for our networks. Training was concluded when the decremental change in total sum squared error for the network reached 5x1-5 for at least three successive epochs (one epoch consists of a feedforward pass and a back-propagation pass for each of the patterns in the training set). All training and testing sets utilized data from the Protein Data Bank [16] with secondary structural classification assignments according to the DSSP algorithm of Kabsch and Sander [17]. Training set #1 comprised proteins specified by Qian and Sejnowski [5] and consisted of 18,111 examples, with a-helix and 5-strand prevalence as summarized in Table 1. Training sets #2 and #3 were enriched with duplicates of either a- helix or 5-strand positive examples (see Table 1) thus appreciably increasing the prior probability of the network's being trained with a positive example of each of these structures. The essential product of network training is a set of weights which contains the "knowledge" of the network. Each of three network weight sets were evaluated with two testing sets of examples and counter-examples of a-helix and 1-strand from proteins which had not been seen previously by the networks during training. One network was trained in our laboratory, and this was compared with those reported by Qian and Sejnowski [5] and those reported by Holley and Karplus [6]. Thus, the material for analysis of accuracy consisted of three different sets of weights, each of which could independently predict for occurrence of a- helix and 13-strand. Each of the three competing sets of weights was evaluated with two test sets. Testing set #1 (comprising 3,52 aa residues) was that specified by Qian and Sejnowski [5], while testing set #2 (comprising 2,441 aa residues) was that utilized by Holley and Karplus [6], the details of which are specified by Kabsch and Sander [7]. 297

4 Table 1. Composition of training and testing sets of proteins. For each set of proteins, the number of aa residues in a given secondary structural class appears immediately above the proportion of that class. ob±u non-a D-strand mjn Training set #1: 4,648 13,463 3,826 14, % 74.33% 21.12% 78.88% Training set #2: 13,944 13, % 49.12% Training set #3: ,478 14, % 55.45% Testing set #1 (352 aa) [51: 24.1% 75.9% 21.3% 78.7% Testing set #2 (2441 aa) [61: 26.% 74.% 2.% 8.% Calculation of ROC area. The data obtained from each network with each test set were evaluated by calculating ROC statistics for both a -helix and,b -strand. The area () under the ROC curve and its standard error (SE) were calculated using the Wilcoxon statistic computational method described by Hanley and McNeil [18]. Our implementation was written in the C language. It was very efficient when run on the Sun3 workstation. For comparing two networks without a common testing set, the standard z score is an appropriate statistic [19,2]: Z = (I - OI)/(SE, + SE,,2)1/2 where subscripts I and II denote the two curves being compared. When two networks were compared using the same testing set, the covariance of the ROC areas was utilized to substantially increase the power of the analysis [21]. In this case: Z = (1-11)/(SE, + SE,,2-2r.SEISEII) 1/2 The correlation coefficient (r) was calculated by first calculating the mean of the Pearson product-moment coefficients for curves I and II, and then using the table given in Hanley and McNeil [21, page 841]. As a check on the Wilcoxon computations, the true-positive and false-positive rates for several experiments were computed at each cutoff value on the ROC, and the areas under the curves obtained by integration by the trapezoidal rule. The areas obtained in this way were then compared to the area by Wilcoxon statistic. There was no significant difference. Results. The area under the ROC curve () and its standard error (SE) were used to test for significance of difference of performance by the various networks. Three types of overall analyses were performed: 1) Comparison of different author's weights utilizing the same test set; 2) for a given author's weights, comparison of the ability to predict a-helix versus 5-strand; and 3) to study the effect of prior probability of positive examples during the training phase on a network's predictive accuracy. The ROC area measures,, of predictive accuracy of different authors' weights are summarized in Table 2. The reader should note that the standard errors for different networks did in fact differ, but not when only three significant figures were used. Comparisons of type I (above). In this case the appropriate statistic for determining if a performance difference exists is the standard z score (see Discussion) exploiting the covariance (see above). There was no significant difference in performance for these networks on either Testing set #1 or #2 at a level of significance of.5. For these comparisons z scores ranged between a minimum of.497 and a maximum of Comparisons of type 2 (above). In this case the comparison is between for a-helix versus for J-strand for either the same or different authors. In this case, however, there is no covariance because the a-helix test examples do not constitute a common set with those of 1 - strand. The z score for these comparisons ranged from a minimum of.73 to a maximum of.688. Therefore, there was no significant difference for prediction of a-helix versus j- strand for any author with either test set. Figure 3 shows an example of a plot of two ROC's, one of a-helix and one of,b-strand. Analyses of type 3 (above). After training a network with Training set #1, ROC analysis revealed that performance was judged to be inferior to that of the weights reported by Qian and Sejnowski [5], in the face of apparent convergence of the network (see Figure 4). 298

5 Small differences in the protein structural data exist between that used by Qian and Sejnowski and that utilized in our laboratory because changes are made to the Brookhaven database with time, and ours was undoubtedly more recent (August, 1988 version) than that available to the other authors at the time of their network training. It is unlikely that these small differences could account for the performance difference, especially since more recent data would be expected to have the same or better fidelity than earlier data. The mathematically precise role of prior probabilities of positive examples (prevalence) is not known for the neural network during training. However, it seemed intuitively clear that materially changing the likelihood for positive examples would alter the network's operating point for learning. Consequently, the training set was then modified by replicating positive examples of both a-helix and,3-strand, such that the probability of each was near.5. After re-training both a-helix and,b-strand weights the network was re-tested, and there was an increase in prediction accuracy, such that O was indistinguishable from other reported results. (See Figures 4 and 5 and Table 3.) The increase in ROC area resulted in a z score of 1.58, which falls just short of the.5 significance level for a two-tailed test. Figure 3. Plot of ROC for a-helix (dark graph) and 13-strand. Weights tested were those of Holley & Karplus [6] using Testing set #2. TPR is true positive rate, FPR is false positive rate. Difference in area under curves is not significant at.5 level (z = for Oa 8) TPR FPR Table 2. Comparative accuracy of three networks for detection of both a-helix and 1- strand Accuracy is measured by = ROC area (see text). SE=standard error of. Test set #1 [ 5 Network Weights*-> &K &M S a-helix SE a-helix o 1-strand SE 13-strand Test set #2 (6] Network Weights*-> Q& H&K S & M O a-helix SE a-helix o 1-strand SE 5-strand * Q & S: Qian & Sejnowski [5] H & K: Holley & Karplus [6] S & M: Spackman & Meistrell Figure 4. Two ROC's for a-helix using testing set #1. Upper curve is from weights of Qian and Sejnowski [5] ( =.737, SE =.25), lower curve is from weights after training with Training set #1 ( =.717, SE =.25) TPR ' FPR 299

6 Figure 5. Plot of a-helix ROC's after retraining with Training sets #2 and #3 (see text). Upper curve is from the retrained weights, lower is from Training set #1 (identical to lower curve, Fig. 4) TPR FPR Table 3. Improved training with increased prevalence of positive examples of both a-helix and p-strand. The same test set was used for evaluating both sets of weights. Training sets #2 & #3 have increased prevalence of positive examples compared with Training set #1. a-helix Train set #1 Train sets #2 & # (set 2) SE a-helix.25.25,b-strand (set 3) SE,B-strand Discussion. A robust, unbiased, easily understood means of assessing the accuracy of the type of neural network described here is needed. The receiver operating characteristic rests on solid theoretical foundations, and there has been considerable experience with application of the method in diverse areas [11]. The advantages of a generally understood and readily interpreted scale for measuring accuracy warrants use of the ROC for neural networks. The area under the ROC and its standard error provide such a means of measurement of accuracy of prediction. The output of a neural network is a continuous variable with value between zero and one when a logistic function determines the activation of the output node. In these networks, the ability of each output node to properly classify an amino acid residue in the context of its 12 or 16 nearest primary sequence neighbors was assessed by calculating the ROC for each output node. No assumptions were made about the shape of the underlying probability distributions (other than the continuous nature of the data). The available data have such a "fine grain" as to make maximum-likelihood curve-fitting unnecessary (see Hanley and McNeil [18] and Hanley [19]). The straightforward computational method of utilizing the Wilcoxon statistic to compute the area under the ROC curve, while generally somewhat conservative in comparison to the maximum-likelihood method of Dorfman and Alf [22], can be expected to agree almost perfectly for the data reported here. In fact, assumption of normal distributions for positive and negative examples in these data is probably fully justified [19], but the computational burden of the nonparametric methods utilized here is small. Use of the z score is readily justified for these data by the central limit theorem. That is, the Wilcoxon statistic is "asymptotically normally distributed" [2]. This report sheds light on an interesting question from the biotechnology literature. It has frequently been stated for a variety of methods that the prediction accuracy of a-helix is superior to that of,b-strand, and this was implied in both recent reports utilizing neural networks to perform the predictions. Authors have speculated that the reason for this is a greater influence of tertiary interactions (compared with local amino acid sequence) for j- strand than for a -helix. However, this impression is a misconception fostered by test statistics such as "percent correct a-helix" or "percent correct 3-strand". For an example of the calculation of such a statistic, see Qian and Sejnowski [5]. The "percent correct" statistic can be misleading because it is sensitive to the prior probability of a positive example. Implicit in the calculation of "percent correct" as seen in the literature, is the choice of a cutoff value for the output variable. Implicitly, then, the statistic is operating at a single point on the ROC curve. The prior probability of a-helix is greater in most protein sets than that of 5-strand (see Table 1). 3

7 Consequently, the operating point on the a-helix ROC will tend to be placed such that the posterior probability of correct prediction for a-helix is greater than that for 5-strand. In contrast, the ROC is insensitive to prior probabilities [11,13] and permits a clear assessment of the capability of a given measure. Our results indicate that it really is no easier to predict a-helix than,bstrand from local aa sequence. This illustrates the value of the ROC for evaluation of a system which is intended to "diagnose" the presence or absence of a particular class in its input. Analysis of these neural networks by the ROC method has provided insight into the role played by prevalence of examples on the training of the network weights. ROC analysis can thus be expected to play a role in improving the understanding of neural networks and possibly other types of machine learning as classification systems. Although hidden nodes were not used in the networks reported here, the ROC technique is equally applicable to multi-layer networks. Conclusions. ROC analysis has been successfully applied to measurement of the accuracy of feed-forward neural networks. It is proposed that this method be adopted by others for assessment of network performance and for reporting of results. BIBLIOGRAPHY 1. Touretzky, D.; Hinton, G.; Sejnowski, T. Proceedings of the 1988 Connectionist Models Summer School; June 17-26, 1988; Carnegie Mellon University. San Mateo, CA: Morgan Kaufmann; pages. 2. Neural Networks; 1988; l(supplement 1): Lippmann, Richard P. An introduction to computing with neural nets. IEEE ASSP Magazine; April, 1987: Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning Internal Representations by Error Propagation. Rumelhart, D.E.; McClelland, J.L., Eds. Parallel Distributed Processing; 1986; 1: Qian, Ning; Sejnowski, Terrence J. Predicting the secondary structure ofglobular proteins using neural network models. J Mol Biol; 1988; 22: Holley,L.H.; Karplus,M. Protein secondary structure prediction with neural network. Proceeding of National Academy of Science, USA; 1989; 86: Kabsch, Wolfgang; Sander, Christian. How Good are Predictions ofprotein Secondary Structure? FEBS Letters; 1983; 155(No. 2): Lim, V.I. Algorithms for Prediction ofalpha-helical and Beta-structural Regions in Globular Proteins. J. Mol. Biol.; 1974; 88: Chou, P.Y.; Fasman, G.D. Empirical Prediction of Protein Conformation. Ann.. Rev. Biochem; 1978; 47: Robson, A. The prediction ofpeptide andprotein structure. In: Practical Protein Chemistry - A Handbook. Darbre, A. ed. London: John Wiley & Sons; 1986: Swets, John A. Measuring the accuracy ofdiagnostic systems. Science; 3 June 1988; 24: Green, D.M.; Swets, J.A. Signal detection theory and psychophysics. New York: Wiley; Egan, J.P. Signal Detection Theory and ROC Analysis. New York: Academic Press; Bamber, D. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology; 1975; 12: Training hidden units: The generalized delta rule. McClelland, James L.; Rumelhart, David E. Explorations in parallel distributed processing. Cambridge, Massachusetts: The MITPress; 1988: Bernstein, F.C.; Koetzle, T.F.; Williams, G.J.B.; Meyer, E.F., Jr.; Brice, M.D.; Rodgers, J.R.; Kennard, O.; Shimanouchi, T.; Tasumi, M. The Protein Data Bank: A Comnputer-based Archival Filefor Macromolecular Structures. J. Mol. Biol.; 1977; 112: Kabsch, Wolfgang; Sander, Christian. Dictionary of protein secondary structure: pattern recognition ofhydrogenbonded and geometricalfeatures. Biopolymers; 1983; 22: Hanley, J.A.; McNeil, B.J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology; 1982; 143: Hanley, James A. The robustness ofthe "Binormal" assumptions used infitting ROC curves. Med Decision Making; 1988; 8: McClish, Donna K. Comparing the areas under more than two independent ROC curves. Med Decision Making; 1987; 7: Hanley, J.A.; McNeil, B.J. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology; 1983; 148: Dorfman, D.D.; Alf, E., Jr. Maximum-likelihood estimation ofparameters ofsignal-detection theory and determination ofconfidence intervals - rating-method data. Journal of Mathematical Psychology; 1969; 6: