Probabilistic Rough Set Approximations

Transcription

1 Probabilistic Rough Set Approximations Yiyu (Y.Y.) Yao Department of Computer Science University of Regina Regina, Saskatchewan, Canada S4S 0A2 Abstract Probabilistic approaches have been applied to the theory of rough set in several forms, including decision-theoretic analysis, variable precision analysis, and information-theoretic analysis. Based on rough membership functions and rough inclusion functions, we revisit probabilistic rough set approximation operators and present a critical review of existing studies. Intuitively, they are defined based on a pair of thresholds representing the desired levels of precision. Formally, the Bayesian decision-theoretic analysis is adopted to provide a systematic method for determining the precision parameters by using more familiar notions of costs and risks. Results from existing studies are reviewed, synthesized and critically analyzed, and new results on the decision-theoretic rough set model are reported. Key words: rough sets, approximation operators, decision-theoretic model, variable precision and parameterized models, probabilistic rough set models 1 Introduction In the standard rough set model proposed by Pawlak [26,27], the lower and upper approximations are defined based on the two extreme cases regarding the relationships between an equivalence class and a set. The lower approximation requires that the equivalence class is a subset of the set. For the upper approximation, the equivalence class must have a non-empty overlap with the set. A lack of consideration for the degree of their overlap unnecessarily limits the applications of rough sets and has motivated many researchers to investigate probabilistic generalizations of the theory [11,14,25,28 30,32,33,42 46,52,54,56 58,60,66,69]. Probabilistic approaches to rough sets have appeared in many forms, such as the decision-theoretic rough set model [52,54,56 58], the variable preci- Preprint submitted to Elsevier Science

2 sion rough set model [14,66,69], the Bayesian rough set model [11,34,35], information-theoretic analysis [3,41], probabilistic rule induction [6,8,13,19 21,31,37 39,64,65,67,68], and many related studies [44,45]. The extensive results increase our understanding of the theory. At the same time, it seems necessary to provide a unified and comprehensive framework so that those results can be put together into an integrated whole, rather than separated studies [54]. Most of the papers in this special issue aim at such a goal. The current paper focuses specifically on the issues of probabilistic approximations. The existing results are revisited and critically reviewed and new results are provided. Probabilistic rough set approximations can be formulated based on the notions of rough membership functions [28] and rough inclusion [30]. Both notions can be interpreted in terms of conditional probabilities or a posteriori probabilities. Threshold values, known as parameters, are applied to a rough membership function or a rough inclusion to obtain probabilistic or parameterized approximations. Three probabilistic models have been proposed and studied intensively. They are the decision-theoretic rough set model [52,54,56,58], the variable precision rough set model [14,66], and the Bayesian rough set model [11,34,35]. The main differences among those models are their different, but equivalent, formulations of probabilistic approximations and interpretations of the required parameters. The variable precision rough set model treats the required parameters as a primitive notion. The interpretation and the process of determining the parameters are based on rather intuitive arguments and left to empirical studies. There is a lack of theoretical and systematic studies and justifications on the choices of the threshold parameters. In fact, a solution to this problem was reported earlier in a decision-theoretic framework for probabilistic rough set approximations [52,56,58], based on the well established Bayesian decision procedure for classification [7]. Within the decision-theoretic framework, the required threshold values can be easily interpreted and calculated based on more concrete notions, such as costs and risks. Unfortunately, many researchers are still unaware of the decision-theoretic model and tend to estimate the parameters based on tedious trial-and-error approaches. By explicitly showing the connections of the two models in this paper, we hope to increase further understanding of the theoretical foundations of probabilistic approximations. The Bayesian rough set model [11,34,35] attempts to provide an alternative interpretation of the required parameters. The model is based on the Bayes rule that expresses the change from the a priori probability to the a posteriori probability, and a connection between classification and hypothesis verification. Under specific interpretations, the required parameters can be expressed in terms of various probabilities. It is not difficult to establish connections between the probabilities used in the Bayesian rough set model and the costs 2

3 used in the decision-theoretic model. Additional parameters are introduced in the Bayesian rough set model. There remains the question of how to interpret and determine the required parameters systematically. With the objective of bringing together existing studies on probabilistic rough set approximations in a unified and comprehensive framework, the rest of the paper is organized into four parts. In Section 2, we review the basic concepts of the standard rough set approximations. This establishes a basis and guidelines for various probabilistic generalizations. In Section 3, we examine the two fundamental notions of rough membership functions and rough inclusions. They serve as a foundation on which probabilistic rough set approximations can be developed. In Section 4, we critically review different formulations of probabilistic rough set approximations. In Section 5, we explicitly show the conditions on a loss function so that many specific classes of probabilistic rough set approximations introduced in Section 4 can be derived in the decisiontheoretic model. 2 Standard Rough Set Approximations Suppose U is a finite and nonempty set called the universe. Let E U U be an equivalence relation on U, i.e., E is reflexive, symmetric, and transitive. The basic building blocks of rough set theory are the equivalence classes of E. For an element x U, the equivalence class containing x is given by: [x] E = {y U xey}. (1) When no confusion arises, we also simply write [x]. The family of all equivalence classes is also known as the quotient set of U, and is denoted by U/E = {[x] x U}. It defines a partition of the universe, namely, a family of pairwise disjoint subsets whose union is the universe. For an equivalence relation, the pair apr = (U, E) is called an approximation space [26,27]. In the approximation space, we only have a coarsened view of the universe. Each equivalence class is considered as a whole granule instead of many individuals [51]. Equivalence classes are the elementary definable, measurable, or observable sets in the approximation space [26,27,50]. By taking unions of elementary definable sets, one can derive larger definable sets. The family of all definable sets contains the empty set, the whole set U, and is closed with respect to set complement, intersection, and union. It is an σ-algebra over U. Furthermore, σ(u/e) defines uniquely a topological space (U, σ(u/e)), in which σ(u/e) is the family of all open and closed sets [26]. 3

4 In general, the sigma-algebra σ(u/e) is only a subset of the power set 2 U. An interesting issue is therefore the representation of undefinable sets in 2 U σ(u/e) in terms of definable sets, in order to infer knowledge about undefinable sets. Similar to the interior and closure operators in topological spaces, one can define rough set approximation operators [26,27]. For a subset A U, its lower approximation is the greatest definable set contained in A, and its upper approximation is the least definable set containing A. That is, for A U, apr(a) = {X X σ(u/e), X A}, apr(a) = {X X σ(u/e), A X}. (2) In the study of rough set theory, one often uses the following equivalent definitions [49]: and apr(a) = {x U y U[xEy = y A]}, apr(a) = {x U y U[xEy, y A]}; (3) apr(a) = {x U [x] A}, apr(a) = {x U [x] A }. (4) An element is in the lower approximation of A if all of its equivalent elements are in A, and an element is in the upper approximation of A if at least one of its equivalent elements is in A. Definition given by equation (2) is referred to as the subsystem based definition. Definitions given by equations (3) and (4) are referred to as the element based definitions [53]. Equivalently, from equation (4), one can also have a granule based definition: apr(a) = {[x] U/E [x] A}, apr(a) = {[x] U/E [x] A }. (5) It provides a new interpretation of rough set approximations. The lower approximation is the union of equivalence classes that are subsets of A and the upper approximation is the union of equivalence classes that have a non-empty intersection with A. Let A c denote the complement of the set A. Some of the useful properties satisfied by the pair of approximation operators are summarized below [26,27,49]: for A, B U, 4

5 (L0) apr(a) = (apr(a c )) c, (U0) apr(a) = (apr(a c )) c ; (L1) apr(a B) = apr(a) apr(b), (U1) apr(a B) = apr(a) apr(b); (L2) apr(a B) apr(a) apr(b), (U2) apr(a B) apr(a) apr(b); (L3) A B = apr(a) apr(b), (U3) A B = apr(a) apr(b); (L4) apr(a) A, (U4) A apr(a); (L5) apr(a) = apr(apr(a)), (U5) apr(a) = apr(apr(a)); (L6) apr(a) = apr(apr(a)), (U6) apr(a) = apr(apr(a)); (L7) apr(a) = A A σ(u/e), (U7) apr(a) = A A σ(u/e). Properties (L0) and (U0) state that the lower and upper approximations are a pair of dual operators. Properties (L1) and (U1) show the distributivity of apr over set intersection, and apr over set union. Properties (L2) and (U2) state that the lower approximation operator is not necessarily distributive over set union, and the upper approximation operator is not necessarily distributive over set intersection. According to properties (L3)and (U3), both operators are monotonic with respect to set inclusion. By properties (L4) and (U4), a set lies within its lower and upper approximations. The next two pairs of properties state that the result of applying a consecutive approximation operators is the same as the result of the operator closest to A. Properties (L7) and (U7) state that a set and its approximations are the same if and only if the set is a definable set in σ(u/e). Given a subset A U, the universe can be divided into three disjoint regions, namely, the positive, the negative, and the boundary regions [26]: POS(A) = apr(a), NEG(A) = POS(A c ) = (apr(a)) c, BND(A) = apr(a) apr(a). (6) An element of the positive region POS(A) definitely belongs to A, an element of the negative region NEG(A) definitely does not belong to A, and an element of the boundary region BND(A) only possibly belongs to A. 5

6 The three regions and the approximation operators uniquely determine each other. One may therefore use any of the three pairs to represent a subset A U: (POS(A), POS(A) BND(A)) = (apr(a), apr(a)), (POS(A), BND(A)) = (apr(a), apr(a) apr(a)), (POS(A), NEG(A)) = (apr(a), (apr(a)) c ). Each of them makes explicit certain particular aspect of the approximations. The first pair is the most commonly used one, defining the lower and upper bounds within which lies the set A. It is related to the notions of the core and support of a fuzzy set. The second pair explicitly gives the boundary elements under the approximations. The third pair focuses on what is definitely in A, in contrast to what is definitely not in A. 3 Rough Membership Functions and Rough Inclusion Since the lower and upper approximations are dual operators, it is sufficient to consider one of them. According equations (3) and (4), generalized approximation operators can be introduced by relaxing the conditions: (LC) y U[xEy = y A], (SC) [x] A. For the logic condition (LC), one can use probabilistic versions. The results from graded modal logic [10,24], variable precision logic [22], and probabilistic logic [16] may be adopted. For the set-theoretic condition (SC), we may adopt the notion of the degree of set inclusion from many studies, such as approximate reasoning [46,61] and rough mereology [30]. 3.1 Rough membership functions The concept of rough membership functions is based on the generalization of the strict logic condition (LC) into a probabilistic version. More specifically, the rough membership value of an element x, with respect to a set A U, is typically defined in terms of a measure of the degree to which the logic condition (LC) is true. The notion of a rough membership function was explicitly introduced by Pawlak and Skowron [28], although it had been used and studied earlier by 6

7 Wong and Ziarko [43], Pawlak, Wong, and Ziarko [29], Yao, Wong and Lingras [58], Yao and Wong [56], and many authors. Let P : 2 U [0, 1] be a probability function defined on the power set 2 U, and E an equivalence relation on U. The triplet apr = (U, E, P) is called a probabilistic approximation space [29,43]. For a subset A U, its rough membership function is given by the conditional probability as follows: µ A (x) = P(A [x]), (7) Rough membership value of an element belonging to A is the probability of the element in A given that the element is in [x]. With the probabilistic interpretation of rough membership function, we will use µ A (x) and P(A [x]) interchangeably in subsequent discussions. For a finite universe, the rough membership function is typically computed by [28]: µ A (x) = A [x], (8) [x] where A denotes the cardinality of the set A. Rough membership functions satisfy the following properties [28,55]: (m1) µ U (x) = 1, (m2) µ (x) = 0, (m3) xey = µ A (x) = µ A (y), (m4) x A = µ A (x) 0, (m5) x A = µ A (x) 1, (m6) µ A (x) = 1 [x] A, (m7) µ A (x) > 0 [x] A, (m8) A B = µ A (x) µ B (x), (m9) µ A c(x) = 1 µ A (x), (m10) µ A B (x) = µ A (x) + µ B (x) µ A B (x), (m11) A B = = µ A B (x) = µ A (x) + µ B (x), (m12) max(0, µ A (x) + µ B (x) 1) µ A B (x) min(µ A (x), µ B (x)), (m13) max(µ A (x), µ B (x)) µ A B (x) min(1, µ A (x) + µ B (x)). Those properties easily follow from the properties of a probability function. While (m1)-(m8) show the properties of rough membership functions, (m9)- (m13) show the properties of set-theoretic operations with rough membership 7

8 functions. The property (m3) is particularly interesting, which shows that elements in the same equivalence class must have the same degree of membership. In other words, equivalent elements must have the same membership value. 3.2 Rough inclusion The concept of rough inclusion generalizes the set-theoretic condition (SC) in order to capture graded inclusion. The degree to which [x] is included in A depends on both the overlap and non-overlap parts of [x] and A. In the rough set theory literature, the notion of rough inclusion, introduced explicitly by Polkowski and Skowron [30], has been studied using other names, including relative degree of misclassification [66], majority inclusion relation [66], vague inclusion [33], inclusion degrees [46,60,61], and so on. Recall that the notion of rough membership functions is a generalization of the logic condition (LC). By the equivalence of the two conditions (LC) and (SC), we can extend the notion of a rough membership function to rough inclusion [30,33]. For the maximum membership value 1, we have [x] A, namely, [x] is a subset of A. For the minimum membership value 0, we have [x] A =, or equivalently [x] A c, namely, [x] is totally not a subset of A. For a value between 0 and 1, it may be interpreted as the degree to which [x] is a subset of A. Thus, one obtains a measure of graded inclusion of two sets [30,32,33,61]: v(b A) = A B. (9) A For the case where A =, we define v(b ) = 1, namely, the empty set is a subset of any set. Accordingly, the degree to which the equivalence class [x] is included in a set A is given by: v([a [x]) = [x] A. (10) [x] It follows that v(a [x]) = µ A (x). From the properties of a rough membership function, one can easily obtain the corresponding properties of rough inclusion. Similar to a rough membership function, the value of v(b A) can be interpreted as the conditional probability P(B A) that a randomly selected element from A belongs to B. There is a close connection between graded inclusion and fuzzy set inclusion [33,59]. In the development of the variable precision rough set model, Ziarko [66] used 8

9 an inverse measure of v called the relative degree of misclassification: c(b A) = 1 v(b A) = 1 A B. (11) A Bryniarski and Wybraniec-Skardowska [4] proposed to use a family of inclusion relations called context relations, indexed by a bounded and partially ordered set called rank set. The unit interval [0, 1] can be treated as a rank set. From a measure of graded inclusion, a context relation with respect to a value α [0, 1] can be defined by: α = {(A, B) v(b A) α}. (12) If one interprets v as a fuzzy relation on 2 U, the relation α may be interpreted as an α-cut of the fuzzy relation. The use of a complete lattice, or a rank set, corresponds to the study of L-fuzzy sets and L-fuzzy relations in the theory of fuzzy sets [15]. Many proposals have been made to generalize and characterize the notion of graded inclusion. Skowron and Stepaniuk [33] suggested that graded (vague) inclusion of sets may be measured by a function, v : 2 U 2 U [0, 1], with monotonicity regarding the first argument, namely, for A, B, C U, v(b A) (C A) for any B C. In this case, the function defined by equation (9) is an example of such a measure. Skowron and Polkowski [32] suggested new properties for rough inclusion, in addition to the monotonicity. The unit interval [0, 1] can also be generalized to a complete lattice in the definition of rough inclusion [30]. Rough inclusion is only an example for measuring degrees of inclusion in rough mereology. A more detailed discussion on rough mereology and related concepts can be found in [30,32]. Zhang and Leung [61], and Xu et al. [46] proposed a generalized notion of inclusion degree in the context of a partially ordered set. Let (L, ) be a partially ordered set. A function D : L L [0, 1] is called a measure of inclusion degree if it satisfies the following properties [46]: for a, b, c L, (i). 0 D(b a) 1, (ii). a b = D(b a) = 1, (iii). a b c = D(a b) D(a c), (iv). a b = D(a c) D(b c). Property (i) is the normalization condition. Property (ii) ensures that the degree of inclusion reaches the maximum value for the standard inclusion. Properties (iii) and (iv) state two types of monotonicity. When the concept of 9

10 inclusion degree is applied to the partially ordered set (2 U, ), we immediately obtain a rough inclusion [60]. 4 Probabilistic Rough Set Approximations The standard approximation operators ignore the detailed statistical information of the overlap of an equivalence class and a set [29,43]. By exploring such information, probabilistic approximation operators can be introduced. 4.1 Standard approximations as the core and the support of a fuzzy set According to properties (m6) and (m7) of rough membership functions, µ A (x) = 1 if and only if for all y U, xey implies y A, and µ A (x) > 0 if and only if there exists a y U such that xey and y A. A rough membership function µ A may be interpreted as a special kind of fuzzy membership function. Under this interpretation, it is possible to re-express the standard rough set approximations [28,29,43], and to establish their connection to the core and support of a fuzzy set [55], as follows: apr(a) = {x U µ A (x) = 1} =core(µ A ), apr(a) = {x U µ A (x) > 0} =support(µ A ). (13) That is, the lower and upper approximations of a set A are in fact the core and support of the fuzzy set µ A, respectively. In the theory of fuzzy sets, fuzzy set intersection and union are commonly defined in terms of a pair of t-norm and t-conorm [15]. Suppose fuzzy set complement is defined by 1 ( ). For a pair of t-norm and t-conorm, core and support of fuzzy sets satisfy the same properties of (L0)-(L6), and (U0)- (U6), except in which the lower approximation is replaced by the core, and the upper approximation by the support, respectively [55]. If the core and support are interpreted as a qualitative representation of a fuzzy set, one may conclude that the theories of fuzzy sets and rough sets share the same qualitative properties. 10

11 4.2 The 0.5 probabilistic approximations An attempt to use probabilistic information for approximations was suggested by Pawlak, Wong, Ziarko [29]. Their model is based essentially on the majority rule. An element x is put into the lower approximation of A if the majority of its equivalent elements [x] are in A. That is, apr 0.5 = {x U P(A [x]) > 0.5), apr 0.5 = {x U P(A [x]) 0.5). (14) The lower and upper 0.5 probabilistic approximation operators are dual to each other. The boundary region consists of those elements whose conditional probabilities are exactly 0.5, which represents maximal uncertainty. 4.3 Probabilistic approximations as the α-cuts of a fuzzy set The standard approximations and 0.5 probabilistic approximations use special points of the probability, namely, the two extreme points 0 and 1, and the middle point 0.5. By considering other values, Yao and Wong [56] introduced more general probabilistic approximations in the decision-theoretic model. In the theory of fuzzy sets, α-cut and strong α-cut are important notions [15]. For α [0, 1], the α-cut and strong α-cut are defined, respectively, by: (µ A ) α = {x U µ A (x) α}, (µ A ) α + = {x U µ A (x) > α}. (15) Using α-cuts, the standard rough set approximations can be expressed apr(a) = (µ A ) 1 and apr(a) = (µ A ) 0 +. The 0.5 probabilistic approximations can be expressed as apr 0.5 (A) = (µ A ) and apr 0.5 (A) = (µ A ) 0.5. For generalized probabilistic approximations, a pair of parameters α, β [0, 1] with α β are used. The condition α β ensures that the lower approximation is smaller than the upper approximation in order to be consistent with existing approximation operators. Yao and Wong [56] considered two separate cases, 0 β < α 1 and 0 β = α. For 0 β < α 1, the standard rough approximations are extended by the definition [56]: apr α = {x U P(A [x]) α), apr β = {x U P(A [x]) > β). (16) 11

12 For α = β 0, the 0.5 probabilistic approximations are extended by the definition [56]: apr α = {x U P(A [x]) > α), apr α = {x U P(A [x]) α). (17) For 0 < β α < 1, Wei and Zhang [42] suggested another version in which the lower approximation is defined by > and the upper approximation by instead. One advantage of their definition is that we do not need to have two separated cases. However, their definition cannot produce the standard rough set approximations. As will be shown in the next section, both versions are derivable from the decision-theoretic model if different tie-breaking criteria are used. With a pair of arbitrary α and β, the probabilistic approximation operators are not necessarily dual to each other. In order to obtain a pair of dual operators, we set β = 1 α. Then, the lower and upper probabilistic approximation operators are dual operators. Although the above formulation is motivated by the notion of α-cuts in fuzzy set theory, similar notions have in fact been considered in many fields, such as variable precision (probabilistic) logic [22], probabilistic modal logic [10], graded/fuzzy modal logic [24], and many others. The use of thresholds on probability values for making a practical decision is in fact a common method in many fields, such as pattern recognition and classification [7], machine learning [23], data mining [2], and information retrieval [40], to name just a few. Consider the condition α > β. Based on the definition of equation (16), the probabilistic rough set operators satisfy the following properties [56,66]: for 0 β < α 1, α (0, 1] and β [0, 1), (PL0) apr α (A) = (apr 1 α (A c )) c, (PU0) apr α (A) = (apr 1 α (A c )) c ; (PL1) apr α (A B) apr α (A) apr α (B), (PU1) apr β (A B) apr β (A) apr β (B); (PL2) apr α (A B) apr α (A) apr α (B), (PU2) apr β (A B) apr β (A) apr β (B); (PL3) A B = apr α (A) apr α (B), (PU3) A B = apr β (A) apr β (B); (PLU4) apr α (A) apr β (A); (PL5) apr α (A) = apr β (apr α (A)), (PU5) apr β (A) = apr α (apr β (A)); 12

13 (PL6) (PU6) (PL7) (PU7) apr α (A) = apr α (apr α (A)), apr β (A) = apr β (apr β (A)); apr α (A) = A A σ(u/e), apr β (A) = A A σ(u/e). They are counterparts of the properties (L0)-(L7) and (U0)-(U7) of the standard rough set approximation operators. As stated earlier, apr α and apr β are defined differently for the case when α = β, where similar properties can be obtained. For probabilistic approximation operators, one can have additional properties: for α, α (0, 1] and β, β [0, 1), (PL8) (PU8) (PL9) (PU9) α α = apr α (A) apr α (A), β β = apr β (A) apr β (A); apr(a) = apr 1 (A), apr(a) = apr 0 (A). Properties (PL8) and (PU8) show that both probabilistic approximation operators are monotonic decreasing with respect to the parameters α and β. Properties (PL9) and (PU9) establish the connection between probabilistic approximation operators and the standard approximation operators. 4.4 The variable precision and parameterized rough set models With the introduction of rough inclusion, the standard approximation space can be generalized to apr = (U, I, v), where I : U 2 U is an information function and v is a measure of rough inclusion [25,33]. The mapping [ ] that maps an element to its equivalence class is an instance of the information function I. By applying threshold values on a rough inclusion v, it is possible to derive variable precision or parameterized approximations [11,33,66] by generalizing equation (4) of the element based definition or equation (5) of the granule based definition. The variable precision rough set model is one of the well known such formulations [66]. In formulating the variable precision rough set model, Ziarko [66] used the relative degree of misclassification function c and the granule based definition of approximations. To be consistent with the previous and subsequent discussions, we present a slightly different, but equivalent, formulation based on the rough inclusion, v(a [x]) = A [x] [x] = P(A [x]), (18) 13

14 and the element based definition. As mentioned in the earlier discussion, based on the rough inclusion v, we can define different levels of set inclusion [4,66]: [x] α A v(a [x]) α P(A [x]) α, (19) where α (0, 1]. When defining the lower approximation, the majority requirement of the variable precision rough set model suggests that more than 50% of elements in an equivalence class [x] must be in A in order for x to be in the lower approximation. In other words, the set-theoretic condition (SC) must hold to a degree greater than 0.5. We need to choose the threshold value α in the range (0.5, 1]. By generalizing equation (4), the α-level lower approximation is given by: for α (0.5, 1], apr α (A) = {x U [x] α A} {x U P(A [x]) α}. (20) The corresponding upper approximation is defined based on the dual of the lower approximation: apr 1 α (A)=(apr α (A c )) c = {x U P(A [x]) > 1 α}. (21) The condition 0.5 < α 1 implies 0 1 α < 0.5. It follows that the lower approximation is a subset of the upper approximation. The pair of parameters (α, 1 α) is referred to as the symmetric bounds, as it produces a pair of dual approximation operators (apr α, apr 1 α ). Variable precision rough sets with asymmetric bounds were examined by Katzberg and Ziarko [14]. It is only required that 0 β < α 1, where β is used to define the upper approximation and α is used to define the lower approximation, as defined in equation (16). Although apr α and apr β are not necessarily dual to each other, Yao and Wong [56] showed that the two pairs of operators, (apr α, apr β ) and (apr 1 β, apr 1 α ) are complement to each other. For the special rough inclusion v(a [x]) = P(A [x]), the probabilistic approximations from the decision-theoretic model and the variable precision model are equivalent. The main differences are their formulations. The decision-theoretic model systematically derives many types of approximation operators and provides theoretical guidelines for the estimation of required parameters, while the variable precision model relies much on intuitive arguments. 14

15 Ślȩzak and Ziarko [35] and Ślȩzak [34] introduced the Bayesian rough set model in an attempt to provide an alternative interpretation of the required parameters in the variable precision rough set model. By setting the parameters as the a priori probabilities, a pair of probabilistic approximations is defined by [35]: for A U, apr P(A) (A) = {x U P(A [x]) > P(A)}, apr P(A) (A) = {x U P(A [x]) P(A)}. (22) They correspond to the case where α = β = P(A). Ślȩzak [34] examined a more complicated version of Bayesian rough set approximations by comparing probabilities P([x] A) and P([x] A c ): bapr δ1 (A) = {x U P([x] A) δ 1 P([x] A c )}, bapr δ2 (A) = {x U P([x] A) δ 2 P([x] A c )}, (23) where δ 1 and δ 2 are parameters. Based on the Bayes rule, one can easily find their corresponding variable precision approximations [34]. The corresponding parameters of the variable precision approximations are expressed in terms the probability P(A), δ 1 and δ 2. In comparison with the variable precision model, the new parameters of the Bayesian rough set model are less intuitive and their estimation becomes a challenge. Greco, Matarazzo and S lowiński [11,12] observed that rough membership functions and rough inclusions, as defined by the conditional probabilities P(A [x]), consider the overlap of A and [x] and do not explicitly consider the overlap of A and [x] c. By considering both overlaps, they introduced a relative rough membership function: A [x] ˆµ A (x) = A [x]c [x] [x] c =P(A [x]) P(A [x] c ). (24) The relative rough membership function is an instance of a class of measures known as the Bayesian confirmation measures [9]. By incorporating a confirmation measure to the existing parameterized models, they proposed two-parameterized approximations [11,12]: apr α,a = {x U P(A [x]) α and bc([x], A) a}, apr β,b = {x U P(A [x]) > β or bc([x], A) > b}, (25) where bc( ) is a Bayesian confirmation measure, and a and b are parameters in 15

16 the range of bc( ). They have shown that the variable precision and Bayesian rough set models are special cases. More details of the two-parameterized approximation models can be found in their paper in this issue [11]. The extra parameters may make the model more effective, which at the same time leads to more difficulties in estimating those parameters. 5 The Decision-Theoretic Rough Set Model A fundamental difficulty with the probabilistic, variable precision, and parameterized approximations introduced in the last section is the physical interpretation of the required threshold parameters, as well as systematic methods for setting the parameters. This difficulty has in fact been resolved in the decision-theoretic model of rough sets proposed earlier [52,56,58]. This section reviews and summarizes the main results of the decision-theoretic framework and its connections to other studies. It draws extensive results from two previous papers [52,56] on the one hand and re-interprets these results on the other. For clarity, we only consider the element based definition of probabilistic approximation operators. The same argument can be easily applied to other cases. 5.1 An overview of the Bayesian decision procedure Bayesian decision procedure deals mainly with making decision with minimum risk or cost under probabilistic uncertainty. We present an overview by following the discussion in the textbook by Duda and Hart [7], in which more detailed information can be found. Let Ω = {w 1,...,w s } be a finite set of s states, and let A = {a 1,...,a m } be a finite set of m possible actions. Let P(w j x) be the conditional probability of an object x being in state w j given that the object is described by x. Without loss of generality, we simply assume that these conditional probabilities P(w j x) are known. Let λ(a i w j ) denote the loss, or cost, for taking action a i when the state is w j. For an object with description x, suppose action a i is taken. Since P(w j x) is the probability that the true state is w j given x, the expected loss associated with taking action a i is given by: s R(a i x) = λ(a i w j )P(w j x). (26) j=1 16

17 The quantity R(a i x) is also called the conditional risk. Given description x, a decision rule is a function τ(x) that specifies which action to take. That is, for every x, τ(x) assumes one of the actions, a 1,...,a m. The overall risk R is the expected loss associated with a given decision rule. Since R(τ(x) x) is the conditional risk associated with action τ(x), the overall risk is defined by: R = x R(τ(x) x)p(x), (27) where the summation is over the set of all possible descriptions of objects, i.e., the knowledge representation space. If τ(x) is chosen so that R(τ(x) x) is as small as possible for every x, the overall risk R is minimized. The Bayesian decision procedure can be formally stated as follows. For every x, compute the conditional risk R(a i x) for i = 1,..., m defined by equation (26), and then select the action for which the conditional risk is minimum. If more than one action minimizes R(a i x), any tie-breaking rule can be used. 5.2 Probabilistic rough set approximations In an approximation space apr = (U, E), all elements in the equivalence class [x] share the same description [27,43]. For a given subset A U, the approximation operators partition the universe into three disjoint classes POS(A), NEG(A), and BND(A). Furthermore, one decides how to assign x into the three regions based on the conditional probability P(A [x]). It follows that the Bayesian decision procedure can be immediately applied to solve this problem [52,56,58]. For deriving the probabilistic approximation operators, we have the following problem. The set of states is given by Ω = {A, A c } indicating that an element is in A and not in A, respectively. We use the same symbol to denote both a subset A and the corresponding state. With respect to three regions, the set of actions is given by A = {a 1, a 2, a 3 }, where a 1, a 2, and a 3 represent the three actions in classifying an object, namely, deciding POS(A), deciding NEG(A), and deciding BND(A), respectively. Let λ(a i A) denote the loss incurred for taking action a i when an object in fact belongs to A, and let λ(a i A c ) denote the loss incurred for taking the same action when the object does not belong to A. The rough membership values µ A (x) = P(A [x]) and µ A c(x) = P(A c [x]) = 1 P(A [x]) are in fact the probabilities that an object in the equivalence class [x] belongs to A and A c, respectively. The expected loss R(a i [x]) associated with taking the individual actions can be expressed as: 17

18 R(a 1 [x]) =λ 11 P(A [x]) + λ 12 P(A c [x]), R(a 2 [x]) =λ 21 P(A [x]) + λ 22 P(A c [x]), R(a 3 [x]) =λ 31 P(A [x]) + λ 32 P(A c [x]), (28) where λ i1 = λ(a i A), λ i2 = λ(a i A c ), and i = 1, 2, 3. The Bayesian decision procedure leads to the following minimum-risk decision rules: (P) (N) (B) If R(a 1 [x]) R(a 2 [x]) and R(a 1 [x]) R(a 3 [x]), decide POS(A); If R(a 2 [x]) R(a 1 [x]) and R(a 2 [x]) R(a 3 [x]), decide NEG(A); If R(a 3 [x]) R(a 1 [x]) and R(a 3 [x]) R(a 2 [x]), decide BND(A). Tie-breaking rules should be added so that each element is classified into only one region. Since P(A [x]) + P(A c [x]) = 1, the above decision rules can be simplified so that only the probabilities P(A [x]) are involved. We can classify any object in the equivalence class [x] based only on the probabilities P(A [x]), i.e., the rough membership values, and the given loss function λ ij, i = 1, 2, 3 and j = 1, 2. Consider a special kind of loss functions with λ 11 λ 31 < λ 21 and λ 22 λ 32 < λ 12. That is, the loss of classifying an object x belonging to A into the positive region POS(A) is less than or equal to the loss of classifying x into the boundary region BND(A), and both of these losses are strictly less than the loss of classifying x into the negative region NEG(A). The reverse order of losses is used for classifying an object that does not belong to A. For this type of loss functions, the minimum-risk decision rules (P)-(B) can be written as: (P) (N) (B) If P(A [x]) γ and P(A [x]) α, decide POS(A); If P(A [x]) β and P(A [x]) γ, decide NEG(A); If β P(A [x]) α, decide BND(A); where λ 12 λ 32 α = (λ 31 λ 32 ) (λ 11 λ 12 ), λ 12 λ 22 γ = (λ 21 λ 22 ) (λ 11 λ 12 ), 18

19 λ 32 λ 22 β = (λ 21 λ 22 ) (λ 31 λ 32 ). (29) By the assumptions, λ 11 λ 31 < λ 21 and λ 22 λ 32 < λ 12, it follows that α (0, 1], γ (0, 1), and β [0, 1). If a loss function with λ 11 λ 31 < λ 21 and λ 22 λ 32 < λ 12 further satisfies the condition: (λ 12 λ 32 )(λ 21 λ 31 ) (λ 31 λ 11 )(λ 32 λ 22 ), (30) then α γ β. The condition ensures that probabilistic rough set approximations are consistent with the standard rough set approximations. In other words, the lower approximation is a subset of the upper approximation, and the boundary region may be non-empty. The physical meaning of condition (30) may be interpreted as follows. Let l = (λ 12 λ 32 )(λ 21 λ 31 ) and r = (λ 31 λ 11 )(λ 32 λ 22 ). While l is the product of the differences between the cost of making an incorrect classification and cost of classifying an element into the boundary region, r is the product of the differences between the cost of classifying an element into the boundary region and the cost of a correct classification. A larger value of l, or equivalently a smaller value of r, can be obtained if we move λ 32 away from λ 12, or move λ 31 away from λ 21. In fact, the condition can be intuitively interpreted as saying that cost of classifying an element into the boundary region is closer to the cost of a correct classification than to the cost of an incorrect classification. Such a condition seems to be reasonable. When α > β, we have α > γ > β. After tie-breaking, we obtain the decision rules: (P1) (N1) (B1) If P(A [x]) α, decide POS(A); If P(A [x]) β, decide NEG(A); If β < P(A [x]) < α, decide BND(A). Based on the relationship between approximations and the three regions, we obtain the probabilistic approximations: apr α (A) = {x U P(A [x]) α}, apr β (A) = {x U P(A [x]) > β}. When α = β, we have α = γ = β. In this case, we use the decision rules: (P2) If P(A [x]) > α, decide POS(A); 19

20 (N2) (B2) If P(A [x]) < α, decide NEG(A); If P(A [x]) = α, decide BND(A). For the second set of decision rules, we use a tie-breaking criterion so that the boundary region may be non-empty. Probabilistic approximations can be obtained, which is similar to the 0.5 probabilistic approximations introduced by Pawlak, Wong, Ziarko [29]. As an example to illustrate the probabilistic approximations, consider a loss function: λ 12 = λ 21 = 4, λ 31 = λ 32 = 1, λ 11 = λ 22 = 0. (31) It states that there is no cost for a correct classification, 4 units of cost for an incorrect classification, and 1 unit cost for classifying an object into the boundary region. From equation (29), we have α = 0.75, β = 0.25 and γ = 0.5. By decision rules (P1)-(B1), we have a pair of dual approximation operators apr 0.75 and apr In general, the relationships between a loss function λ and the pair of parameters (α, β) can be established. For a loss function with λ 11 λ 31 < λ 21 and λ 22 λ 32 < λ 12, we have [52]: α is monotonic non-decreasing with respect to λ 12 and monotonic nonincreasing with respect to λ 32. If λ 11 < λ 31, α is strictly monotonic increasing with respect to λ 12 and strictly monotonic decreasing with respect to λ 32. α is strictly monotonic decreasing with respect to λ 31 and strictly monotonic increasing with respect to λ 11. β is monotonic non-increasing with respect to λ 21 and monotonic nondecreasing with respect to λ 31. If λ 22 < λ 32, β is strictly monotonic decreasing respect to λ 21 and strictly monotonic increasing with respect to λ 31. β is strictly monotonic increasing with respect to λ 32 and strictly monotonic decreasing with respect to λ 22. Such connections between the required parameters of probabilistic rough set approximations and loss functions have significant implications in applying the decision-theoretic model of rough sets. For example, if we increase the cost of an incorrect classification λ 12 and keep other costs unchanged, the value α would not be decreased. Parameters α and β are determined from a loss function. One may argue that a loss function may be considered as a set of parameters. However, in contrast to the standard threshold values, they are not abstract notions, but have an intuitive interpretation. One can easily interpret and measure loss or cost in a real application. In fact, the results 20