Computer-assisted diagnosis of breast cancer using a data-driven Bayesian belief network

Transcription

1 International Journal of Medical Informatics 54 (1999) Computer-assisted diagnosis of breast cancer using a data-driven Bayesian belief network Xiao-Hui Wang, Bin Zheng, Walter F. Good *, Jill L. King, Yuan-Hsiang Chang Imaging Research Di ision, Department of Radiology, Uni ersity of Pittsburgh, A439 Scaife Hall, Pittsburgh, PA , USA Accepted 4 December 1998 Abstract This study investigates a simple Bayesian belief network for the diagnosis of breast cancer, and specifically addresses the question of whether integrating image and non-image based features into a single network can yield better performance than hybrid combinations of independent networks. From a dataset of 419 cases, including 92 malignancies, 13 features relating to mammographic findings, physical examinations and patients clinical histories, were extracted to build three Bayesian belief networks. The scenarios tested included a network incorporating all features and two hybrids which combined the outputs of sub-networks corresponding to the image or non-image features. Average areas (A z ) under the corresponding ROC curves were used as measures of performance. The network incorporating only image based features performed better (A z =0.81) than that using nonimage features (A z =0.71). Both hybrid classifiers yielded better performance (A z =0.85 for averaging and A z =0.87 for logistic regression), but neither hybrid was as accurate as the network incorporating all features (A z =0.89). This preliminary study suggests that, like human observers who concurrently consider different types of information, a single classifier that simultaneously evaluates both image and non-image information can achieve better diagnostic performance than the hybrid combinations considered here Elsevier Science Ireland Ltd. All rights reserved. Keywords: Bayesian belief network; Breast cancer; Computer-assisted diagnosis; Classifier; Cross-validation; Machine learning 1. Introduction * Corresponding author. Tel.: ; fax: Mammography is currently the most effective diagnostic tool for the early detection of breast cancer. But because of the complexity of tissue patterns represented in mam /99/$ - see front matter 1999 Elsevier Science Ireland Ltd. All rights reserved. PII: S (98)

2 116 X.-H. Wang et al. / International Journal of Medical Informatics 54 (1999) mograms, combined with the low prevalence of cancer in screening environments, the early diagnosis of breast cancer can be a difficult task. Clinical studies have shown that radiologists may initially miss 10 30% of breast cancers that are visible in mammograms [1], and that less than 30% of the patients who have undergone biopsy were found to have breast cancers [2]. One effective approach for improving diagnostic accuracy is the independent double-reading of mammograms [3], however this approach is both inefficient and costly. As an alternative, computerized decision aids which can be used to assist physicians in the diagnosis of breast cancer, have become a topic of extensive research during the past decade [4,5] and many of these approaches have demonstrated potential value for improving diagnosis. A number of decision systems have been developed which use mammographic features described by radiologists and other relevant clinical information for predicting the risk or probability of having breast cancer. An early attempt at providing a decision aid to mammographers applied discriminant analysis to identify lists of perceptual features, which were weighted according to their relative importance, and evaluated by a computer-based classifier [6 8]. More recently, a similar method was adopted to combine evidence from diaphanography and mammography [9]. Cook et al. have developed a rule based expert system which incorporates features from mammograms, as well as from clinical and patient history data [10]. Various efforts have also been undertaken to apply neural network technology to breast cancer diagnosis. These include the development of a system to classify features extracted from mammograms by radiologists [11] and the development of a classification scheme which incorporates patient age as well as mammographic features [12 14]. Much of the current interest in the development of decision systems centers around techniques employing Bayesian networks, and these networks have been applied to a number of problems similar to those for which neural networks have traditionally been used. These networks, which are often called belief networks, Bayesian belief networks, or probabilistic causal networks, are represented as directed acyclic graphs, where the nodes correspond to variables and the links between nodes relate to the independence assumptions which hold between the nodes. For each node there is a probability function which specifies for each value of the variable, and for each value of its parents variables, the posterior probability of the node given the value of the parent (i.e. P(Node Parent i )) [15 19]. Once input parameters to the network have been specified, the network generates a hypothesis about the values of the remaining parameters, which is optimal in the sense that no other hypothesis is more likely [15,16]. Bayesian networks are very attractive for medical diagnostic systems because they can be applied to make inferences in cases where the input data is incomplete. This is the situation in many clinical settings where diagnostic decisions must be made on limited data, but the decision can be revised at a later time as more data becomes available. The most significant application of Bayesian networks to breast cancer detection reported to date is Kahn et al. [20,21], who used Bayesian networks as the basis of a system which combined 15 mammographic features, five patient-history features and two physical findings into a decision process for predicting the likelihood of breast cancer. The probabilities required by their network were derived largely from published statistics and the subjective estimates of expert mammographers. This system was evaluated on a

3 X.-H. Wang et al. / International Journal of Medical Informatics 54 (1999) set of 77 cases which included 25 malignancies and attained a performance, as measured by receiver operating characteristic (ROC) analysis, A z = for the area under the ROC curve, where A z is the average area. This level of performance was actually higher than that achieved by radiologists reading a largely overlapping set of cases in an independent study [11]. Compared to artificial neural networks, Bayesian belief networks have certain unique advantages, in addition to their ability to work with incomplete information, mentioned above. One such advantage is that they can provide explanations of their decisions [15,22 24]. Because they provide a flexible capability for specifying dependence and independence of variables, in a natural way through the network topology, their structure tends to reflect the logical structure inherent in a decision task. In contrast, neural networks can be viewed, to a large extent, as a black box whose machine learned internal decision structure is generally incomprehensible to human observers. The probability values for links between nodes in Bayesian networks reflect degrees of dependence between variables. This makes it possible for the structure of these networks to be examined by human experts to uncover relationships between the variables, hence, enabling the assessment of the reasonableness of the decision process. Confirmation by an expert can provide some level of confidence for Bayesian networks that is not attainable in neural network implementations, and this is likely to be an important factor in their gaining acceptance in the field of medical diagnosis. An additional advantage of the Bayesian network paradigm is that, instead of using an iterative optimization approach as is the case when training artificial neural networks with procedures such as back-propagation, the weights between links of different nodes can be derived from subjective estimates of the probabilities, or from statistical reports in the medical literature, or determined from datasets by using probabilistic learning methods [16,25]. These processes can accommodate knowledge about the prior probabilities of alternative hypotheses and the probability of observing various data given some hypothesis. Despite these very positive qualities, there are certain limitations of Bayesian networks as compared to neural networks. Current implementations of Bayesian networks require that nodes be assigned discrete values. This means that continuous variables must be quantified, though in practice, given the limited accuracy with which continuous variables are usually known, this does not necessarily significantly reduce the precision of the input parameters. Using a large number possible states for a parameter increases its potential precision, but at the same time increases the size of the probability tables that must be derived and retained. A more fundamental difficulty in applying Bayesian networks relates to the computational complexity of evaluating these networks. Although the singly-connected structure of all the networks considered in this study permits them to be evaluated in a reasonable time [16,26], this is not true of Bayesian networks having a more general structure which have been shown to be computationally NP-hard [27]. The issues of computational complexity is currently an active topic of research, and improved approximation algorithms should become available in the future [27]. In current applications of Bayesian networks to the diagnosis of breast cancer, the networks are relatively small and the problem of computational complexity has not been prohibitive. Nevertheless, as these systems

4 118 X.-H. Wang et al. / International Journal of Medical Informatics 54 (1999) evolve they will likely become more complex and the computational issue will have to be addressed. In this case, there may be a significant computational advantage to partitioning networks into multiple sub-networks and forming a hybrid combination of the outputs of these individual sub-networks. The question arises as to how to optimally decompose the network and whether such a partitioning will have a significantly adverse impact on overall performance. As part of our ongoing endeavor to apply intelligent system techniques to diagnostic tasks in radiology, we have begun an investigation of certain questions that have arisen in our effort to develop a Bayesian network decision mechanism for breast cancer detection. The two issues which we have attempted to address in this preliminary investigation relate to assessing the impact of a seemingly natural decomposition of a Bayesian network for breast cancer diagnosis, and to determining the feasibility of using machine learning to configure networks, by directly analyzing large databases of cases. Specifically, in this study we investigated the application of simple Bayesian networks to the diagnosis of breast cancer, based on features derived from mammographic findings, physical examination findings, and other relevant data from patients clinical histories. The networks used in this study were automatically built by applying machine learning methods to a set of training cases. With these networks, we investigated a possible decomposition of the decision task into subtasks related to the image and non-image components of the feature set, as well as the question of how best to use the combined data in a single decision aid. A description of the approach, along with the preliminary test results of a fivefold cross-validation on a set of 419 clinical cases, is reported below. 2. Materials and methods The cases used in this study were selected, in order, from the film library of Magee Womens Hospital s Breast Care Centers in Pittsburgh, PA, and correspond to mammographic examinations performed between 1987 and Cases were only used if complete follow-up documentation was available. Of the total of 419 cases selected, 92 are positive for malignancy. The verification of positive cases consisted of biopsy and/or surgical reports, while establishing a negative case required a negative follow-up for at least a 2-year period. Our database contains both mammographic and non-mammographic features. The non-mammographic features, which are related to patient history and physical exam, were extracted from patient files. To obtain mammographic features we employed a specially designed computerized scoring form, into which experienced mammographers reported their findings as they read the films. A detailed description of the design and creation of this database has been reported elsewhere [28]. Because we consider this to be primarily a feasibility study, in anticipation of more elaborate investigations to optimize both the network topology and feature set, for this study we employed a preliminary feature set and topology which were specified based on subjective evaluation and on general radiological experience. Although such a subjectively designed system is not optimal, it is sufficient to define a lower bound on the performance that can be expected from this type of system. When applying a data-driven machine learning algorithm to medical diagnosis, the size of the input feature set must be limited by the training sample size if robust performance is to be achieved [29]. At present, only a relatively small number of cases are available for this study. Thus, we limited this

5 X.-H. Wang et al. / International Journal of Medical Informatics 54 (1999) study to 13 features from the database, which were used to train and test the networks. The selected features, and their possible states in our networks, are summarized in Table 1. The features include four from the mammographic findings, four from the general physical examination, and five from other patient clinical history data. Based on consideration of the dependence and independence of the selected features, we adopted the singly connected structure shown in Fig. 1 for the topology of our network. The network was built by using commercially available software, Hugin Demo [15]. Due to the definition of Bayesian belief networks [15], which includes the properties of acyclic connection and d-separation, there are no feedback loops between nodes. The absence of a link (or path) between two nodes indicates that, although the variables are not necessarily statistically independent, whatever dependence exists is assumed to not be important in the particular decision process being modeled. To build the network, we first determined a series of prior and conditional probabilities according to the network s topology. A common practice in applying Bayesian networks is to represent these probabilities in a conditional probability table [15]. Basically, our network (as shown in Fig. 1) consists of three-layers. In the first layer, there are five features derived from patients clinical histories. Since each of these features has two possible states, yes and no, as shown in Table 1, the probability table for this layer will contain ten values, but only five of which are needed to completely determine this layer. Table 1 Definition of features and their states in the Bayesian belief networks a Category Node description State description Diagnosis Physical findings Breast cancer Clinical history Habit of drinking alcoholic beverages and smoking Taking female hormones Have gone through menopause Have ever been pregnant Family member has breast cancer Nipple discharge Skin thickening Breast pain Have a lump(s) Present, absent. Mammo- Architectural distortion Present, absent. graphic findings Mass Score from one to three, score from four to five, absent Microcalcification cluster Score from one to three, score from four to five, absent Asymmetry Present, absent. a Scores for both masses and microcalcification clusters are based on a scale of one to five, where one is definitely benign and five is very suspicious for malignancy.

6 120 X.-H. Wang et al. / International Journal of Medical Informatics 54 (1999) Fig. 1. Topology of a Bayesian belief network to diagnose breast cancer. The first layer includes five features related to patients clinical history, the second layer contains one diagnostic feature, breast cancer, and the third layer involves eight features from both mammographic findings and general physical examination findings. Breast cancer is the only node in the second layer, and has five parent nodes (Y i, i=1,, 5) from the first layer. In this layer the conditional probabilities, P(Cancer Y 1,Y 2, Y 3, Y 4, Y 5 ), must be computed. Since each of the five parent nodes has two states in the configuration shown in Table 1, the probability table will contain 64 values, but because not all are independent, the following 32 different combinations of conditional probabilities are sufficient to specify this probability table: P 1 (Cancer=yes Y 1 =yes, Y 2 =yes, Y 3 =yes, Y 4 =yes, Y 5 =yes), P 2 (Cancer=yes Y 1 =yes, Y 2 =yes, Y 3 =yes, Y 4 =yes, Y 5 =no), P 3 (Cancer=yes Y 1 =yes, Y 2 =yes, Y 3 =yes, Y 4 =no, Y 5 =yes), P 32 (Cancer=yes Y 1 =no, Y 2 =no, Y 3 =no, Y 4 =no, Y 5 =no). The breast cancer node also has eight daughter nodes (X i, i=1,, 8) represented by the third layer, and in this preliminary experiment, these nodes are assumed independent of each other. Since six nodes in this layer have two possible states and the remaining two have three possible states (see Table 1), the probability table for this layer will contain 36 values, 20 of which are independent and sufficient to completely specify the layer. Thus, in order to fully specify the Bayesian network as shown in Fig. 1, a table containing 110 probability values must be determined, but only 57 of these values are independent. In this study, all of the necessary probabilities were automatically computed from the cases selected for network training. In the event that a conditional probability needed in the second layer could not be calculated from our database, because of the size limitation, the default probability values P(Cancer=yes Y 1,, Y 5 )=0.5 and P(Cancer=no Y 1,, Y 5 )=0.5 were used.

7 X.-H. Wang et al. / International Journal of Medical Informatics 54 (1999) Different cross-validation methods have commonly been employed to evaluate the performance of statistical pattern recognition systems in general [30], and computer-assisted diagnosis schemes for mammography in particular [29,31]. Considering the size limitation of our database, a fivefold cross- alidation (CV) framework, that has been used in our previous studies [32], was adopted for this experiment. The original database of 419 cases was divided randomly into five mutually exclusive partitions. Except for one partition that contained 63 negative and 16 positive cases, the partitions involved equal numbers of cases, each with 66 negative and 19 positive cases. A series of five experimental cycles was performed. In each cycle a different group of four partitions was used to train each network (deriving all 110 probability values), and the group of positive and negative cases in the remaining partition was subsequently used for testing. Thus, in these five experimental cycles, each partition was used for training in four cycles and for testing in one cycle. ROC curves were produced by combining the test results from the five experimental cycles, and the areas under these ROC curves (A z values) were computed by using the program ROCFIT [33]. In this study we also investigated the relative contributions of image and non-image based features to the decision process. This involved comparing the performance changes attained when applying different methods to integrate all the features into a single decision outcome. Two methods for feature integration were compared in this study. First, we evaluated the performance of a network which incorporated both mammographic and non-mammographic based features into a single comprehensive Bayesian network. Second, we produced a hybrid decision system in which separate Bayesian networks for mammographic and non-mammographic based features were created, and the outputs of these two networks were combined. Both a simple average of the outputs and a technique based on logistic regression for combining the outputs of the separate networks were tested. Specifically, we divided our original Bayesian network (as shown in Fig. 1) into two sub-networks. The first subnetwork used only non-mammographic based features and the node breast cancer while excluding the four mammographic features (i.e. architectural distortion, mass, microcalcification cluster, and asymmetry) from the network. In contrast, the second sub-network contained only the four mammographic based nodes and the breast cancer node. The fivefold cross-validation method was used to train and test these two sub-networks. The hybrid classifiers, which combined results of the two sub-networks, were also tested. Areas under the ROC curves of the hybrid classifiers were compared to that from the comprehensive Bayesian network which utilized all features. This comparison was intended to indicate whether incorporating features in a single classifier yielded better performance than using a hybrid system of two, presumable independent, classifiers. 3. Results Fig. 2 shows three ROC curves, computed from the detection results of a Bayesian belief network incorporating all 14 nodes (see Fig. 1) as well as from two sub-networks incorporating only image or non-image related nodes. Average areas under the three ROC curves were , , and , respectively. The 0.10 increase in the A z value for the network utilizing four image based features, as compared to the network utilizing non-image based features, suggests that the mammographic features

8 122 X.-H. Wang et al. / International Journal of Medical Informatics 54 (1999) made a greater contribution to the final diagnosis. Furthermore, using the combined feature set as the basis for a network yields a significant (P 0.05) improvement ( A z 0.08) in diagnostic performance over either of the individual sub-networks. Fig. 3 compares the ROC curve for the results of a single Bayesian network using all features to two curves corresponding to the results from hybrid classifiers using either averaging or logistic regression to combine results from the sub-networks. The areas under these ROC curves, achieved by the hybrid classifiers, are A z = and A z = , respectively. Although these values are higher than those yielded by either single sub-network, they are significantly (P 0.05) lower than the performance of the single network that incorporates all features. 4. Discussion The higher performance achieved by the use of all features in our complete network, as compared to either of the two sub-networks, suggests that the classification potential of the image based features and the non-image based features are at least partially independent. Given this independence, the question arises as to whether there is a synergistic effect when both sets of features are used concurrently, as opposed to making separate decisions using each feature subset individually and then combining the two decisions. The degree of such an effect can be assessed quantitatively, such as was done in this experiment, by a comparison of the areas under the ROC curves generated in the two scenarios. Because the combination of the Fig. 2. The ROC curves of three Bayesian belief networks using a fivefold cross-validation testing method.

9 X.-H. Wang et al. / International Journal of Medical Informatics 54 (1999) Fig. 3. ROC curve for the original Bayesian network (Fig. 1) compared to ROC curves for two hybrid classifiers, which were based on combining the outputs of sub-networks by averaging or through logistic regression. two sub-networks is actually a special case of the more general complete network, to the extent that the complete network is optimal, it would not be possible for the combination of sub-networks to perform better than the complete network. In the diagnosis of breast cancer, human observers (mammographers) consolidate information from different views of mammograms (i.e. mammograms from left and right breasts with cranio-caudal and mediolateral oblique views) and other sources of information such as the patient s clinical examination or history. In contrast, many current computer-assisted diagnosis schemes for breast cancer either deal with only a single type of information or process each individual type of information separately and then combine the individual results to form a final decision. In fact, our study indicates a significantly better performance for the complete network, which coincides more closely with the diagnostic process of human observers. This suggests that a synergistic effect is indeed possible, but not proven by this study. Furthermore, the clinical importance of differences of the size found in this study depend on how the decision mechanism is ultimately incorporated into medical practice. It is easy to appreciate the possibility of such an effect. Each sub-network has condensed all of its input parameters to a one-dimensional variable, and the combination of the two sub-networks represents the decision space as only a two-dimensional manifold. It is easy to contrive a decision process having only three input variables, with each assum-

10 124 X.-H. Wang et al. / International Journal of Medical Informatics 54 (1999) ing two possible states, in which no two can be combined without completely neutralizing the decision process. Consider, for example, input states (assuming each input variable takes a value of either zero or one) of (1, 0, 0), (0, 1, 0), (0, 0, 1) and (1, 1, 1) correspond to a zero output and the remaining input combinations correspond to an output of one. Then combining any pair of input parameters into a sub-network (to produce a one-dimensional result) caused the overall decision process to be ineffective. Thus, the existence of a synergistic effect is not surprising. The A z = achieved by our combined network is close to the value (A z = ) reported by Kahn et al. [20]. A similar result (A z =0.89) has been previously reported for an artificial neural network which incorporated 14 input features [11]. Although a direct comparison of these results is dubious since completely different methods and sets of cases were used in the studies, the consistency of these results suggests that this is a level of performance that can reasonably be expected to be attained with these kinds of decision systems. The previously reported study by Kahn et al. [20,21], used a Bayesian network which was not automatically trained from sample cases, but rather by either the assignment of available statistical data from published sources or the direct assignment by experts [20]. Our preliminary study has demonstrated that we can successfully apply machine learning methods to derive the required probabilities from a reasonably small training set. Although in this study we limited our network to relatively few nodes, this can be increased as larger training sets become available. Several encouraging results have been demonstrated in this investigation. We must emphasize, however, that this was a very preliminary study, and it is unlikely that the simple network designs described here, which were based on a small set of features and small training sets, would be sufficient to yield any significant clinical utility. Further investigations on many of the issues discussed, including the selection and investigation of features as well as robustness of performance, are required. Nevertheless, we have demonstrated that it is feasible to use machine learning techniques to develop Bayesian networks for the diagnosis of breast cancer, and that even simple Bayesian networks, based on small training sets, can achieve performance levels which are comparable to more established paradigms [11,20,28]. We have also demonstrated that questions related to the possible partitioning of a network into sub-networks, for the purpose of alleviating the computational burden, merit further study. Acknowledgements The authors wish to thank the staff of Magee Womens Hospital for their extensive assistance in developing the dataset used in this study. This work is sponsored in part by grants CA77850 and CA62800 from the National Cancer Institute, National Institutes of Health. References [1] R.E. Bird, T.W. Wallace, B.C. Yankaskas, Analysis of cancers missed at screening mammography, Radiology 184 (1992) [2] D.B. Kopans, The positive predictive value of mammography, Am. J. Radiol. 158 (1991)

11 X.-H. Wang et al. / International Journal of Medical Informatics 54 (1999) [3] E.L. Thurfjell, K.A. Lernevall, A.S. Taube, Benefit of independent double reading in a populationbased mammography screening program, Radiology 191 (1994) [4] C.J. Vyborny, M.L. Giger, Computer vision and artificial intelligence in mammography, Am. J. Radiol. 162 (1994) [5] C.J. Vyborny, Can computers help radiologists read mammograms?, Radiology 191 (1994) [6] D.J. Getty, R.M. Pickett, C.J. D Orsi, J.A. Swets, Enhanced interpretation of diagnostic images, Invest. Radiol. 23 (1988) [7] J.A. Swets, D.J. Getty, R.M. Pickett, C.J. D Orsi, S.E. Seltzer, B.J. McNeil, Enhancing and evaluating diagnostic accuracy, Med. Decis. Mak. 11 (1) (1991) [8] C.J. D Orsi, D.J. Getty, J.A. Swets, R.M. Pickett, S.E. Seltzer, B.J. McNeil, Reading and decision aids for improved accuracy and standardization of mammographic diagnosis, Radiology 184 (3) (1992) [9] S.E. Seltzer, B.J. McNeil, C.J. D Orsi, D.J. Getty, R.M. Pickett, J.A. Swets, Combining evidence from multiple imaging modalities: a feature-analysis method, Comput. Med. Imaging Graph. 16 (6) (1992) [10] H.M. Cook, M.D. Fox, Application of expert systems to mammographic image analysis, Am. J. Physiol. Imag. 4 (1) (1989) [11] Y. Wu, M.L. Giger, K. Doi, C.J. Vyborny, R.A. Schmidt, C.E. Metz, Artificial neural networks in mammography: application to decision making in the diagnosis of breast cancer, Radiology 187 (1993) [12] C.E. Floyd Jr, J.Y. Lo, A.J. Yun, D.C. Sullivan, P.J. Kornguth, Predication of breast cancer malignancy using an artificial neutral network, Cancer 74 (11) (1994) [13] J.Y. Lo, J.A. Baker, P.J. Kornguth, C.E. Floyd Jr, Computer-aided diagnosis of breast cancer: artificial neutral network approach for optimized merging of mammographic features, Acad. Radiol. 2 (10) (1995) [14] J.Y. Lo, J.A. Baker, P.J. Kornguth, J.D. Iglehart, C.E. Floyd Jr, Predicting breast cancer invasion with artificial neutral networks on the basis of mammographic features, Radiology 203 (1) (1997) [15] F.V. Jensen, An Introduction to Bayesian Network, Springer Verlag, New York, NY, [16] J. Pearl, Probabilistic Reasoning in Intelligent Systems, Morgan Kaufmann, San Mateo, CA, [17] D.E. Heckerman, E.H. Shortliffe, From certainty factors to belief networks, Artif. Intell. Med. 4 (1992) [18] D. Heckerman, Bayesian networks for data mining, Data. Min. Knowl. Discov. 1 (1997) [19] G.F. Cooper, Current research directions in the development of expert systems based on belief networks, Appl. Stoch. Models 5 (1989) [20] C.E. Kahn, L.M. Roberts, K.A. Shaffer, P. Haddawy, Construction of a Bayesian network for mammographic diagnosis of breast cancer, Comput. Biol. Med. 27 (1997) [21] C.E. Kahn, L.M. Roberts, K. Wang, D. Jenks, P. Haddawy, Preliminary investigation of a Bayesian network for mammographic diagnosis of breast cancer, Proc Annu Symp Comput Appl Med Care (1995) [22] P. Haddawy, J. Jacobson, C.E. Kahn, Generating explanations and tutorial problems from Bayesian networks, Proc Annu Symp Comput Appl Med Care (1994) [23] M. Henrion, M.J. Druzdzel, Qualitative propagation and scenario-based approaches to explanation of probabilistic reasoning, in: P.P. Bonissone, M. Henrion, L.N. Kanal, J.F. Lemmar (Eds.), Uncertainty in Artificial Intelligence 6, Elsevier, New York, [24] H.J. Suermondt, G.F. Cooper, An evaluation of explanations of probabilistic inference, Comput. Biomed. Res. 26 (1993) 242. [25] T.M. Mitchell, Machine learning, WCB McGraw- Hill Company, Boston, MA, 1997 (Chapter 6) pp. l97. [26] E. Neapolitan, Probabilistic reasoning in expert systems, Wiley, New York, NY, [27] G.F. Cooper, Probabilistic inference using belief networks is NP-hard, Technical Report 87-27, Medical Computer Science Group, Stanford University (1987). [28] K.M. Harris, B.C. Good, J.L. Kong, D. Toma, D. Gur, Z.S. Ilkhanipour, M.J. Staiger, J.H. Oliver, P.W. Wintz, M.A. Ganott, C.A. Britton, W.H. Straub, Exploring computerized mammographic reporting with feedback, Proc. SPIE 1899 (1993) [29] G.D. Tourassi, C.E. Floyed, The effect of data sampling on the performance evaluation of artificial neural networks in medical diagnosis, Med. Decis. Making 17 (1997)

12 126 X.-H. Wang et al. / International Journal of Medical Informatics 54 (1999) [30] B. Efron, G. Gong, A leisurely look at the bootstrap, the jackknife, and cross-validation, Am. Stat. 37 (1983) [31] W. Zhang, D. Kunio, M.L. Giger, R.M. Nishikawa, R.A. Schmidt, Computerized detection of clustered microcalcifications in digital mammograms using a shift-invariant artificial neural network, Med. Phys. 21 (1994) [32] R. Rymon, B. Zheng, Y.H. Chang, D. Gur, Incorporation of a set enumeration trees-based classifier into a hybrid computer-assisted diagnosis scheme for mass detection, Acad. Radiol. 5 (1998) [33] C.E. Metz, H.B. Kronman, P.L. Wang, J.H. Shen, ROCF11: A modified maximum likelihood algorithm for estimating a binormal ROC curve from confidence-rating data, University of Chicago, Chicago (1985)..