ONDERZOEKSRAPPORT NR 9306 Factoring Decision Table Knowledge using Normalization Theor b Jan VANTHIENEN Monique SNOECK D/1993/2376/6
FACTO RING DECISION TABLE KNOWLEDGE USING NORMALIZATION THEORY J. V ANTIUENEN M. SNOECK Katholieke Universiteit Leuven Department of Applied Economic Sciences Dekenstraat 2, 3000 Leuven (Belgium) ABSTRACT Large decision tables can be transformed into a structure of table and subtables. Although it can be left to the designer to evaluate when and how a decision table should be transformed into a structure of related decision tables, it is eligible to rel on a uniform theoretical basis to structure decision tables, rather than reling on preferences and experience of the decision table builder. A major advantage is that if the structuring process is governed b clearl stated rules, automation of this process becomes feasible. In this paper a structuring process is proposed equivalent to normalization rules in relational database design, based on the correspondence between functional dependencies in database design and logical implication. Index Terms: Decision Modeling, Decision Tables, Decision Making, Knowledge Acquisition, Knowledge Representation, Normalization, Software Engineering, Sstems Analsis and Design
KNOWLEDGE FACTORING 2 1. PROBLEM DESCRIPTION Large decision tables can be transformed (factored) into a structure of dependent tables and subtables. Although it can be left to the user to evaluate when and how a decision table should be transformed into a structure of dependent decision tables, it is preferable to use a well-founded technique to investigate wether and how a decision table can be factored. The main reason to factor a decision table is the same as for a database relation : avoid redundanc and conflicting information (due to update anomalies), thereb enhancing consistenc, efficienc and maintainabilit of the knowledge stored in the tables. Even if in some cases it is possible to factor a decision table, this might not be desirable as the specific advantages of the decision table representation ma be lost in the factoring process. In addition to the argument of redundanc and conflict avoidance, the decision to factor a decision table depends on four other reasons [9]. A decision table can/should be factored : 1) To avoid large, unmanageable and complicated decision tables Decision tables tend to grow rapidl as the number of relevant conditions increase. For an expanded decision table, the number of columns is the product of the number of possible condition states for ever condition. For example, a table with 8 conditions with two states each will contain 256 columns in the expanded table. 2) To simplif complex decision situations and increase the intelligibilit Sometimes actions or conditions can be contracted into a single action or condition at a higher level of abstraction, in which case an action subtable (respectivel condition subtable) is obtained.
KNOWLEDGE FACfORING 3 3) To increase modularit and flexibilit Small changes in the decision logic can be localized into one or more small tables which will be easier to maintain. 4) Because some conditions require a preliminar action Sometimes sequence restrictions exist between conditions and actions. Actions can, for example, initialize certain values necessar in conditions. Representing these bounded conditions and actions in a single table leads to a mixing up of conditions and actions and results in a low readabilit of the decision table. In this paper a structuring process is proposed, based on the equivalence between functional dependencies in database design and (a subset of) propositional logic [4, 7]. Although the analog between decision table knowledge and database dependencies is striking, there are some major differences. In the case of a decision table the functional dependencies themselves are stored in the database and are the possible subject of updates. In contrast, the functional dependencies for database relations form an implicit set of rules which must be enforced when the database is updated, but the are not subject to change themselves. In spite of these important differences, the normalization rules of database design provide an excellent guideline for the factoring of decision tables. Because of the correspondence between logical implication and functional dependenc, the rules for normalization of database relations can be translated into rules for factoring decision tables into subtables. The following paragraph gives a short introduction to decision table theor. Next, the analog with database relations is established. Before the normalization rules for decision tables are formulated in the fifth paragraph, a short review of database relation normalization is given.
KNOWLEDGE FACTORING 4 2. KNOWLEDGE REPRESENTATION USING DECISION TABLES In order to make a meaningful use of decision tables possible, the decision table has to be defined clearl and must meet the important requirements of consistenc and completeness. For these purposes, the decision table is defined as a function. - CS = { CSj} (i=l..cnum) is the set of condition subjects; - CD = { CDj} (i=l..cnum) is the set of condition domains, with CDi the domain of condition i, i.e. the set of all possible values of condition subject CSi; - CT = {CTil (i=l..cnum) is the set of condition state sets, with CTi = {Si,k} (k=l..ni) an ordered set of ni condition states Si,k Each condition state Si,k is a logical expression concerning the elements of CDi, that determines a subset of CDi, such that the set of all these subsets constitutes a partition of CDi (completeness and exclusivit of the condition states); - AS= {ASj} (j=l..anum) is the set of action subjects; - A V = {A Vj} (j= Lanum) is the set of action value sets, with A Vj = {true (x), false (-)} 1 the set of action values, which is, in first instance, equal for ever action subject, for reasons of consistenc checking. The (expanded) decision table is a relation between the Cartesian product of the condition states CT 1 x... x CT en urn and the Cartesian product of the action values A V 1 x... x A V anum More particularl, as each combination of condition values must be mapped to at most one action configuration, the decision table can be defined as a function from CT 1 x... x CT cnum toav 1 x... x AVanum 1. Sometimes the value nil (.) is included in the set of action values, but in this paper this value is omitted to avoid confusion with null values in database theor, without however affecting the generalit of the discussion.
KNOWLEDGE FACTORING 5 In this definition, the decision table concept is deliberatel restricted to the so called singlehit table, where columns are mutuall exclusive. Onl this tpe of table allows eas checking for consistenc and completeness. If each column onl contains simple states (no contractions or irrelevant conditions), the table is called an expanded decision table (canonical form), in the other case the table is called a contracted decision table (consolidated form). The translation from one form to the other is defined as expansion (rule expansion) and contraction (consolidation) respectivel [1]. 3. THE DECISION TABLE AS DATABASE RELATION The decision table can be seen a set of ordered n-tuples (ct1,...,ctcnum av 1..., av anum), with c~ e CTi and avj e AVj, that can be represented as a relational table. The attributes of the relation are the condition subjects and action subjects, while the domains of the relation correspond with the condition state sets CT 1,..., CTcnum and the action value sets AV 1,..., AV anum The attribute values are the condition states and action values. The number of tuples in the relation equals the cardinalit of the condition space CR, because of the demand of completeness and exclusivit. The degree of the relation equals the number of conditions plus the number of actions. The relational table, as representation of a decision table, has the following characteristics [2]: - each row represents a column of the decision table; - the rows do not have an particular order (but some orderings are more useful);
KNOWLEDGE FACTORING 6 - all rows are distinct (exclusivit); - the order of the columns (conditions and actions) is not important to the description of the problem at the logical level (unless a certain order has to be respected at execution time because of side effects, for instance an ordering of the actions); - the meaning of each column is explained through a named domain as heading (condition or action subject); - on each row/column position of the table, an attribute value (a condition state or an action value, possibl "nil") is found, and not a set of values. It is clear that such relational table is identical to the transposed expanded decision table, so that the rows correspond with the columns of the decision table and vice versa. The identit is formal and does not refer to the utilit of both representation methods. This becomes important when the dimensions increase and the table must be split. Since ever condition combination occurs precisel once in the relation, the condition attribute values uniquel identif the n-tuples in the relation (candidate ke). It is, indeed, the intention of the decision table to indicate which actions should be executed for a given combination of conditions. So the set of condition attributes is defined as primar ke. The action attributes can then be indicated as the non-ke attributes. A combination of non-ke attributes (actions), that is part of the primar ke of another table and so refers to that table (foreign ke), corresponds with a condition assignment as action in a condition subtable (called condition reference). B analog with dependencies in relational tables, the relationship between conditions and actions (or possibl the interrelationship between conditions or actions) in decision tables can be expressed as a cause-effect relation. Such logical if...then... -relation corresponds with the
KNOWLEDGE FACfORING 7 implication in propositional logic, which is equivalent to the "dependenc statement" for functional dependencies. The decision table, being a set of implications, can therefore be described in terms of functional dependencies. In the first instance, especiall the dependenc between conditions and actions is important, so that functional dependenc can be defined as: Given a decision table DT with conditions Ci (i=l..cnum) and actions Aj (j=l..anum) and, Y subsets of resp. the condition set and action set of DT. Y is functional dependent on (notation: ~ Y) if with ever combination of values in DT corresponds one and onl one configuration of Y -values. Since ever combination of condition states occurs at most once, each action is functional dependent on the complete condition set (primar ke). The formal correspondence between the decision table and the relational table is given in figure 2. Decision table condition (row) condition states condition reference action action value stub number of rows entr column number of columns Relational table ke attribute (column) ke domain foreign ke non-ke attribute non-ke domain heading degree attribute value n-tuple cardinalit figure 2: terminolog of the decision table and the relational table Since the decision table is equivalent to the relational table, the relational technique (in the form of a relational DBMS) can be used to construct the decision table. This means that both
KNOWLEDGE FACTORING 8 the phsical storage and the construction and manipulation of the decision table can be partl executed through the relational structure and operators (relational algebra) [8]. 4. FUNCTIONAL DEPENDENCIES AND NORMAUZATION 4.1. Functional dependencies : definition To define functional dependenc [7], we make use of a projection of tuples on a subset of the set of attribute tpes (column selection). The selection of one attribute tpe is defined as: 1tDi : R(Dl,...,Dn) -+ Di : (dl dn) -+ (di) If more than one attribute tpe is selected and is the subset of {D 1,...,Dn} that indicates the attributes to select, then 1tx is defined as follows: 7tx: R(DI,...,Dn)-+ Di 1 x... xdik: (d 1,...,dn)-+ (dir" dik) where = {Dil''.. Dik} Given a relation R(D 1,...,Dn) and, Y subsets of {D 1,...,Dn}, then Y is functionall dependent on (denoted as ~ Y) if ever two tuples in R having the same values, have the same Y -values at an one time. More formall : Y is full functionall dependent on if ~ Y and there is no subset of of which Y is functional dependent : ~ Y and ---.(:3 ' c, ' =t : ' ~ Y) Y is non-transitivel dependent on if ~ Y and if there is no set of attributes S c {D 1,...,Dn} for which holds that ~ SandS~ Y unless also S ~.
KNOWLEDGE FACI'ORING 9 Y is multi-valued functionall dependent on ( ~~ Y) if the set of Y-values that matches a given (-value,z-value) pair (Z ~ {D 1,...,D 0 }\) in R depends onl on the value and is independent of the Z-value. 4.2. Normalization In order to avoid update anomalies, database relations should be normalized. The normalization procedure will decompose a single relation into a set of relations in such a wa that the decomposition is reversible. This reversibilit is important because it means that no information is lost during the normalization process. A database relation is in [2][4][5] First Normal Form : if and onl if the domain of ever attribute is a set of atomic values. This means that an attribute value can not be a set or a repeating group. Second Normal Form : if and onl if it is in First Normal Form and ever non-ke attribute is full functional dependent on the primar ke. Third Normal Form: if and onl if it is in Second Normal Form and ever non-ke attribute is nontransitivel dependent on the primar ke. We do not discuss the fourth and fifth normal form as the do not have a meaningful equivalent for decision tables.
KNOWLEDGE FACfORING 10 5. NORMALIZATION RULES FOR DECISION TABLES In the case of actions and conditions the functional dependenc can be translated into a logical implication in a straightforward wa : if action Ai is functional dependent on condition subject CSj, then CSj logicall implies Ai. In both cases the notation is the same: CSj ~ Ai. When Ai is full functional dependent on CSj, we sa that CSj strictl implies Ai. The equivalence between functional dependenc and logical implication has been thoroughl investigated b Sagiv et al. [7]. Because of this equivalence, the same normalization rules can be applied to both database relations and decision tables. Both normalization of relations and of decision tables has as primar goal to avoid redundanc and update anomalies. In addition the normalization of decision tables simplifies decision tables and increases their readabilit. As the normalization procedure for decision tables is derived form the normalization process for database relations, we ma assume that also this decomposition process is reversible. In fact it can be proved b means of proposition logic that the join of the decision tables resulting from the decomposition process gives rise to the same set of decision rules. More formall, if a set of original decision rules D (Dis represented as a decision table) is decomposed into a set of new decision tables {D 1,..., D }, 0 then D and D 1 u D2 u... u 0 0 are equivalent. 5.1. First Normal Form Definition : An expanded decision table is in first normal form (lnf) if ever condition state and ever action value is an atomic value and not a set.
KNOWLEDGE FACfORING 11 Ever decision table that conforms the definition of the second paragraph automaticall is in first normal form. Indeed, the condition states are logical expressions2, and onl limited entr actions3 are considered whose values exclude each other. Not accepting sets of values for condition states the significantl facilitates the checking for consistenc and completeness of the decision table. Cl N Cl N C2 N N --> C2 N N Premium 2 :11 f ~ $: A~~: PR. 1 PR. 2 PR. 3 Figure 3 : Conversion to First Normal Form 5.2. Second Normal Form 5.2.1. Original Second Normal Form Definition : An expanded decision table is in second normal form (2NF) if and onl if it is in first normal form and ever action is full dependent on all the conditions. More formall: 2. Remark that although the logical expression can denote a set of values, it is the logical expression itself which is the condition value and which is indeed an atomic value. 3. Limited entr action values ('x' or'-') are alwas atomic, but of course also other action values are allowed b the first normal form, as long as these values are atomic (e.g. '1' is atomic but '1,2' is not).
KNOWLEDGE FACfORING 12 Cl N C2 a b c a b c Al A2 Figure 4 : A table in Second Normal Form The second normal form is a strong demand. Indeed, decision tables are a powerful wa to clearl represent relations between conditions and actions, even if some actions are determined b onl part of the conditions. The breakdown of decision tables into a structure of decision table in second normal form is theoreticall appealing but not alwas of practical use, because the overview can be lost b the breakdown. This is illustrated in the next figure: Cl N Cl N Al C2 N N --> Al Cl N A2 C2 N N A2 Figure 5 : Conversion to Second Normal Form Except in a number of specific cases, the conversion to second normal form leads to a number of sequential decision tables in which certain conditions (or actions) are repeated. This repetition of conditions is seen as inconvenient and a significant loss of readabilit while at the same time it introduces inefficienc due to repeated testing of the same conditions. The more that it is exactl a major goal of the decision table technique to visualize the combined effect of a set of conditions. The consequences of not meeting the
KNOWLEDGE FACTORING 13 second normal form are not alwas a sufficient reason to split a decision table as illustrated in figure 5. However, in man cases the transformation to second normal form significantl simplifies the decision table because the resulting decision tables are smaller and easier to read. There are indeed some specific decision table constructs in which the transformed decision table is alwas a better representation of the decision logic. In this wa a number of variants of the second normal form are derived which are weaker than the second normal form but generall applicable : ever decision table is supposed to meet these weaker variants of the second normal form. 5.22. Weaker variants ofthe Second Normal Form Elementar second normal form Definition : A decision table is in Elementar 2NF (E2NF) if and onl if it is in first normal form and the complete action set is full dependent of the whole condition set. In other words : there does not exist a subset of the condition set of which the whole action set is dependent. More formall : -, (3 CT c CT, CT' * CT : CT ---7 A V) Cl N Cl N C2 N N --> C3 N N C3 N N N N Al Al - - - - A2 A2 - - - - Figure 6 : conversion to Elementar 2NF
KNOWLEDGE FACTORING 14 In a decision table that is not in elementar second normal form, superfluous conditions can be found which can be deleted from the decision table without an loss of knowledge. E2NF does not impl that ever single action is full dependent of the whole set of conditions. The possible dependencies that can occur in a decision table in elementar second normal form are listed below : Given the set of condition subjects CS = {CS1,... CScnum}, then Aj (1 $; j $; anum) is implied b CS (b definition); {Al,..., Aanuml is implied b CS (union); Aj (1 $;j $;anum) is strictl implied b CS (2NF); {Al,..., Aanum} is strictl implied b CS (elementar 2NF); Disjunctive second normal form Definition: A decision table is in disjunctive 2NF (D2NF) if and onl if it is in lnf and it is not composed of unrelated sets of condition and action subjects. This means that there does not exist a subset of the action subjects that is full dependent of a subset of the condition subjects while the remainder of the action subjects are dependent of the remainder of the condition subjects. More formall : --.( :3 CT c CT, A' ca: (CT'--? A' A C1\CT--? A \A')) This is illustrated in figure 7 where the global decision table is a composition of two completel independent decision tables. The split decision table obviousl is a better representation than the original table (under the assumption that there are no sequence restrictions between the condition subjects). In this case the transformation alwas leads to a smaller number of columns, except in the case of limited entr conditions with two values where the number of columns remains equal.
KNOWLEDGE FACfORING 15 Cl N Cl N C3 N N C2 N N Al - - C3 N N N N --> Al - - - - A2 - - C2 N A2 Figure 7 : Normalization of unrelated subsets. In one special situation there might be action subjects that must alwas be executed and are dependent on none of the condition subjects. Such actions can be put in a separate table, but are most of the time kept in the original table to keep the overview. The same is valid for action subjects that must never be executed and which are in fact superfluous. Partiall related second normal form Definition : A decision table is in partiall related 2NF (P2NF)4 if and onl if it is in lnf and there is no subset of action subjects (i) that is full dependent of a subset of the condition subjects while the remainder of the action subjects is dependent of a subset of the same condition subjects together with the remainder of the condition subjects and (ii) for which holds that the different configurations5 of the actions with relation to the common condition subjects do not have other common actions or condition subjects. 4. P2NF implies D2NF, which in tum implies E2NF. 5. A configuration is the configuration of values a certain set of actions takes for a given condition.
KNOWLEDGE FACI'ORING 16 More Formall : --, (3 CC c CT, CT c Cf, Cf n CC = 0, A' c A V : ( CC u Cf' ~A' A CC u (Cl\(CC u CT)) ~A V\A')) This is illustrated in figure 8. In this case subtables Tl and T2 are in parallel with each other : onl one of both will be executed, depending on the result of condition 1. Cl N Cl N Tl T2 C2 N --> C3 N Tl C2 N T2 Figure 8 : Normalization of partiall related subsets One could argue that in this case the global decision table is shorter and clearer than the normalized structure and as a consequence there is no advantage to split the global table. This, however, is not alwas the case. The partiall related 2NF splits condition and actions subjects from the common action and condition subjects, but onl if the different action configurations do not have conditions or actions in common with the subset, to avoid repetition. Figure 9 illustrates this with a few examples. The splitting of the table must in all respects be considered at construction and manipulation time. For validation purposes the global decision table might sometimes be preferred, which can then be considered as a view.
KNOWLEDGE FACTORING 17 Cl a b c Cl a b c Al Tl C2 N N --> (Two configurations of A2 and A3 with respect to C 1 are equal, the other configurations have no actions or conditions in common) Cl a b c Cl a b c C2 N N C2 N N Al A2 A3 (no configurations are equal, different configurations have C2 in common) (two configurations are equal; different configurations have A2 in common) Figure 9: Partiall related 2NF. Beside the maintenance and isolation advantages, there are two major reasons to perform this nonnalization: 1) The global decision table will not alwas be as clear as the normalized structure, as the ordering of conditions plas a central role in the readabilit of the decision table. While the ordering of the parallel conditions is irrelevant for the resulting width of the decision table, the ordering of the common conditions is. Placing the common conditions at the end of the list, makes the decision table unsurveable and obscures the existence of unrelated conditions (see figure 10).
KNOWLEDGE FACfORING 18 C3 N Cl N C2 N N C2 N Cl N N N --> C3 N Al A2 Al A2 Figure 10 : hidden mutual unrelated conditions 2) The don't care smbol as entr for an unrelated condition does not alwas indicate an irrelevanc. Testing this condition is not alwas onl irrelevant, it might be undesirable, because of possible side-effects. This is the case when the condition is in fact a condition subtable or when the condition has a hidden bound action. Reordering of conditions is of no help; in fact the don't care smbol should be replaced b a "do not test" smbol. In figure 11 two illustrations of this case are given. In the left table the irrelevanc is not correct as there never was a jur decision (this is different than saing "no matter what the decided..."). The same is true for the table on the right hand side where the credit limit can not be found as the client does not exist et. These kind of problems can be avoided b using a different notation, but in turn this is no solution for the condition ordering problem. Splitting the table solves both problems. Sit for all exams N Client exists N Favourable decision Jur Y N - Credit limit OK Y N - Actions Actions Figure 11 : improper use of the don't care smbol
KNO~GEFACTO&mG 19 5.3. Third Normal Form Definition : A decision table is in third normal form if and onl if it is in second normal form (possibl onl elementar 2NF) and ever action is non transitivel dependent of the conditions. More formall : -.(3 cr c CT, ci e cr. q e: cr : cr ~ q) Third normal form is alwas a matter of related condition or action subjects. When action subjects are mutuall related it is most of the time sufficient to combine them into one action subject or an action subtable without conditions. Dependencies between conditions, (in fact impossibilities) indicate that a certain condition is dependent of other condition combinations and plas the role of an action. This can be obtained b putting the condition in a condition subtable where the actions determine the value of the condition subject (see figure 12 where C3 H ((ClA C2) v (-.ClA -.C2))). If splitting the condition table does not impl repetition of conditions and actions, the conversion to third normal form is alwas recommendable. If repetitions are necessar the surveabilit might be lost so that the global decision table is preferable. t Condition 1 N Condition 2 N N Condition 3 N N N N --> Condition 3? N Condition 1 N Action 1 - - - - - - Impossible - - - - Condition 2 N N v Condition 3 N N Figure 12 : conversion to third normal form
KNOWLEDGE FACfORING 20 6. PRACTICAL IMPLICATIONS As has been illustrated in the various examples, normalization rules for decision tables are an excellent technique to investigate how and when a decision table can be factored. It was however also clear from the examples that a possible factoring is not alwas recommendable from a readabilit or surveabilit point of view. As with normalization, ultimate decomposition ma be abandoned for well-defined reasons, as long as one is aware of the potential risk of redundanc or dependenc. 7. CONCLUSION In this paper it was demonstrated how normalization rules for database design can successfull be transposed to the decision table formalism b making use of the strict correspondence between functional dependenc and (a subset of) propositional logic. The resulting rules can be used as a guideline for decision table factoring.
KNOWLEDGE FACTORING 21 REFERENCES [1] Codasl, "A Modern Appraisal of Decision Tables", Report of the Decision Table Task Group, ACM, New York, 1982, pp. 230-232. [2] Codd E., "A Relational Model of Data for Large Shared Data Banks", Communications of the ACM, 13(6), pp. 377-387, 1970. [3] Date C. J., "An introduction to Database Sstems, Volume 1, Fifth Edition", Addison Wesle Publishing Compan, 1990, 854pp. [4] Fagin, R., Functional Dependencies in a Relational Database and Propositional Logic, IBM Journal of Research & Development, 21(6), Nov. 1977, pp. 534-544. [5] Kent, W. [83], A Simple Guide to Five Normal Forms in Relational Database Theor, Communications of the ACM, 26(2), Febr. 1983, pp. 120-125. [6] Sagiv J., Delobel C., Parker D. S., Fagin R., "An equivalence Between Relational Database Dependencies and a Fragment of Propositional Logic", JACM, Vol. 28, No.3, Jul 1981, 435-453. [7] Ullman, J., Principles of Database Sstems, Computer Science Press, Inc., 1980, 379 pp. [8] Vanthienen J., "Automatiserings-aspecten van de specificatie, constructie en manipulatie van beslissingstabellen", K.U.Leuven Dept. Applied Econ. Doctoral Dissertation, 378 pp, 1986. [9] Verhelst M., "De Praktijk van Beslissingstabellen", Kluwer, Deventer/Antwerpen, 175 pp, 1980.