FINDING, MANAGING AND ENFORCING CFDS AND ARS VIA A SEMI-AUTOMATIC LEARNING STRATEGY

Size: px
Start display at page:

Download "FINDING, MANAGING AND ENFORCING CFDS AND ARS VIA A SEMI-AUTOMATIC LEARNING STRATEGY"

Transcription

1 STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LIX, Number 2, 2014 FINDING, MANAGING AND ENFORCING CFDS AND ARS VIA A SEMI-AUTOMATIC LEARNING STRATEGY KATALIN TÜNDE JÁNOSI-RANCZ Abstract. This paper describes our strategy, which finds Conditional Functional Dependencies (CFDs) and Association Rules (ARs), and instead of using them to clean dirty data we use them to prevent their appearance in the database. We achieve this by differentiating strict CFDs/ARs from apparent CFDs/ARs. If we know about a CFD/AR that it will be valid in the future, we can rely on them by creating constraints which guarantee that the CFD-rule will not be breached by insertions or modifications. Along with complete management of CFDs/ARs our implemented application called DependencyManager also uses Formal Concept Analysis (FCA) methods to analyze the strict CFDs/ARs and draw useful conclusions, helping the users of the application to prevent inconsistencies, fix bugs and optimize their queries and applications by providing a lattice of CFDs/ARs, using usefulness as the relation. 1. Introduction It is not enough to store data, as the correctness of stored data is also very important. Data cleanups are lowering the frequency of inconsistencies but they are also introducing new inconsistencies, because it s impossible to guarantee that these algorithms are choosing the correct pattern from more patterns. Data quality rules can be defined in the form of FDs, CFDs [6] and ARs among others. An FD X Y asserts that any two tuples that agree on the values of all the attributes in X must agree on the values of all the attributes in Y. However, many interesting constraints hold conditionally, that is, on only a subset of the relation. For example if in a dataset we have customers data from different countries, than for customers in the UK, Zip code determines Received by the editors: August 15, Mathematics Subject Classification. 68P CR Categories and Descriptors. H.2.8 [Database Management]: Database Applications Data mining. Key words and phrases. Dependency Mining, Conditional Functional Dependency, Formal Concept Analysis, Association Rules. 34

2 DEPENDENCY MANAGEMENT AND INCONSYSTENCY PREVENTION 35 Street, which is not true for other countries. This will be a CFD: [Country = UK, ZipCode] [Street]. A CFD ϕ on R is a pair (R : A B, T p ), where (1) A, B are sets of attributes in Attr(R), (2) A B is a standard FD, referred to as the FD embedded in ϕ; and (3) T p is a tableau with attributes in A and B, referred to as the pattern tableau of ϕ, where for each X in A B and each tuple t T p, t[x] is either a constant a in Dom(X), or an unnamed variable that draws values from Dom(X). CFDs can be described as sets of ARs, which are dependencies holding on particular values of attributes. An AR is a dependency c A = a, where A = (A 1,..., A n ) and a = (a 1,..., a n ), knowing that c is a condition, A is a set of columns and a is a set of constants. In our implementation only those conditions are supported, which can be represented in the form of C = ct, where C is a column and ct is a constant. ARs can be defined as special CFDs with an additional property, namely the determinant column set is empty set. If a formula for a dependency is almost met, because the number of records which are inconsistent with the rule is very small comparing to the total number of affected records, then one should not consider it to be a CFD/AR just yet, because the inconsistence of the rule s exceptions with the rule might be caused by incorrect or correct data. This leads to the conclusion that the correctness of the rule s exceptions must be checked. If there is at least a correct exception, then the rule is not correct. Otherwise the exceptions should be cleaned up and the rule is a CFD/AR. As a consequence we believe that a dependency learner should only consider some rules to be dependencies if there is no valid exception from the rule. After dependencies were successfully learned they should be validated. By validating dependencies we can determine whether they are Strict Dependencies (SDs) or Apparent Dependencies (ADs). ADs are all dependencies which are applicable to the current dataset but there is no guaranty that after any correct inserts or updates the dependency will be still applicable. Most of the previous works omitted the step of validation, they considered all CFDs/ARs to be valid, but that is a false presumption. Example 1. Let us consider the example of a company where in a given moment the position determines the salary; all managers have a salary of X, all technicians have a salary of Y and all cleaners have a salary of Z. If this is not a company policy, just a coincidence, then the FD is invalid, as there is no guarantee that the position will determine the salary in the future too. The functional dependency of work-position determining the salary is an AD.There is no way to automatically determine the validity of such dependencies, because programs are not aware of the nature of the entities stored as records and cannot make logical conclusions. This is the job of human specialists, who

3 36 KATALIN TÜNDE JÁNOSI-RANCZ understand the business logic of the system and the nature of the records stored in the table Our contributions. As part of our research we have implemented an application called DependencyManager, which learns FDs, CFDs and ARs from an arbitrary database. We can validate the learned dependencies by accepting or rejecting them, thus modifying its data from candidate to Strict(SD) or Apparent Dependencies(AD). We have created support for the simplification of the validation by using the relation of MU as the most useful candidates should be taken into account from all the other dependencies, therefore we drastically reduce the number of dependencies to be taken into account and we only store the most useful SDs because all the other dependencies are implied by them. The result of our research is a system which is able to prevent all the possible inconsistencies resulting from inserts or updates which do not comply to the schema of the accepted SDs using constraints. We use FCA methods to analyze the strict CFDs/ARs and draw useful conclusions. Our approach is useful, because: 1. It prevents inconsistencies to appear in the database. 2. If an error occurs because of mishandling data where a strict CFD/AR is applicable, then in the error logs the problem can be identified and potentially bugs can be found and fixed. 3. Knowing that a CFD/AR is strict, using all the known value sets for determinant columns and dependent columns we can determine the values of dependent columns using a value set of determinant columns. This way data-processing and backup-creating algorithms can be optimized. 2. Related work CFDs have recently been proposed for pattern-finding purposes; CFDpatterns are frequently used by data-cleaners. In [6, 3, 11, 7], the authors employed exact and approximate CFDs to characterize the semantics of data, for example, discovering and verifying the confidence, support, and parsimony of CFDs on a given relation instance. Recent research on CFDs focused mainly on implication analysis, consistency and axiomatizability, see [7]. In [5], the authors studied how to efficiently estimate the confidence of a CFD with a small number of passes over the input using small space. Fan at al in [9] developed three algorithms for discovering minimal CFDs and a novel optimization technique via closed-item-set mining. A hierarchy of CFDs, FDs and ARs has been proposed in [16] with some theoretical results on pattern tableaux equivalence. Many integrity constraints have been studied for data cleaning in [4, 17, 2, 13]. Existing data repair techniques do not guarantee to find correct fixes in data monitoring, because they may deteriorate the data introducing new errors when trying to repair it. [8] proposed editing rules that, compared to constraints used in data cleaning, are capable to find precise

4 DEPENDENCY MANAGEMENT AND INCONSYSTENCY PREVENTION 37 fixes by updating input tuples with master data. [17] proposed a framework called GDR (Guided Data Repair), to handle the problem of data cleaning. This framework involves the user in controlling the cleaning process side by side with existing automatic cleaning techniques through an interactive process. In these previous works we have seen algorithms which find the CFDs and for each CFD, the presence of CFD was periodically checked. Previous works focused on finding dirty data and cleaning them. We believe it is easier to prevent the appearance of dirty data instead of cleaning an already dirty database. So, our algorithm, instead of finding CFDs and applying them for cleaning data, finds CFDs and if a CFD is validated to be an SD, then constraints are generated to prevent any insertion/modification that results in breaking the rule. It is well known that, FDs admit interesting characterizations in terms of FCA [10, 12]. Different authors in [1, 14] considered that the internal logic of data can also be displayed, by means of the so-called implications, which proved to be the proper framework to describe functional dependencies with FCA tools. In [1], the authors studied the lattice characterization and its properties for Armstrong and symmetric dependencies. In [15], the authors present the relation between CFDs and FCA. They showed that the lattice of CFDs is a synthetic representation of the concept lattice. In contrast to these works, which focused mainly on application of FCA in mining different types of dependencies, we analyze the properties of CFDs and ARs discovered with our application by an implemented FCA. In other words, we exploit FCA here to derive ontology containing concepts. 3. Learning, Presentation, Validation and Forgetting In this section we share our strategy, which is based on automatic learning, semi-automatic validation and automatic forgetting Learning. To gather CFDs in a database, we must iterate all the conditions of all columns of all tables. In our case a condition is of the form of R.C = ct, where R is a table, C is a column and ct is a constant. ARs can be found like CFDs; we just have to iterate every condition of every column of every table. For each condition we must find the ARs. If A is the set of determinant columns R.A = {R.A 1,..., R.A m }; B is the set of dependent columns R.B = {R.B 1,..., R.B n }; a = {a 1,..., a m } is an m- dimensional constant and the condition can be checked by a Boolean function c for any record r, then the CFD is valid if and only if it is inconsistent to insert/update record r 1 in such a way that the following criteria holds: r 2 R such as ((r 1.A 1 = r 2.A 1 ) (r 1.A m = r 2.A m ) ((r 1.B 1 r 2.B 1 ) (r 1.B n r 2.B n ))) c(r 2 ) c(r 1 )

5 38 KATALIN TÜNDE JÁNOSI-RANCZ and the AR is valid if and only if it is inconsistent to insert/update record r 1 in such a way that the following criteria holds: ((r 1.A 1 a 1 ) (r 1.A m a m )) c(r 1 ). Periodical checking for new ARs and CFDs is essential if one intends to handle them. Newly found dependencies are stored in the database. As new rules are discovered and stored, this periodical action is called learning. If a dependency is already known by the system, then the rule will not be learned again. Essentially all the rows of any table in any given moment can be described as a record which fulfills a dependency. This would generate many dependencies, slowing down the learning algorithm and storing a lot of dependencies, while most of them are ADs. As a consequence, constraints are needed to increase the speed of the algorithm and store only relevant ARs and CFDs. For now, we have created two constraints: MinOccurrenceFrequency and MinOccurrenceRate. With these constraints we can prevent over-learning. DependencyManager allows its users to set the values of these constraints. The first step is to discover conditions. Condition Occurrence is the number of occurrence of a condition in a table. Condition Occurrence depends on a Relation, a Column and a Value and determines the occurrence number of the condition defined by R, C and V. The formula of R.C = V is considered to be a condition if the constraints of M inoccurrencef requency and M inoccurrencerate are met. Let us introduce the notations CO(R, C, V ) = card(r R : r.c = V )and { c : R C V {0; 1}, c(r, C, V ) = (R.C = V ). Observe that CO(R, C, V ) M inoccurrencef requency, CO(R, C, V ) card(r R) M inoccurrencerate. The learner in a cycle iterates through all the columns of all tables and learns the potential conditions. If the set of columns of R is Cols and c(r, C, V ) is a condition, where C is a column from Cols, then the possible AR and CFD scenarios can be described as: ARScenarios(R, C) = SubCols (Cols \ {C}) and SubCols. CF DScenarios(R, C) = Determinant (Cols \ {C}) and Dependent (Cols \ {C}), such as Determinant Dependent =, Determinant, Dependent. We have defined an algorithm (Algorithm 3.1) which is used by the application to learn new CFDs and ARs. Intuitively speaking the algorithm iterates all the interesting conditions of all columns of all tables in the database except the system tables. In each step, all the possible AR and CFD scenarios are generated and if no equivalent is found as a CFD, SD or AD, then the

6 DEPENDENCY MANAGEMENT AND INCONSYSTENCY PREVENTION 39 Algorithm 3.1 Sd-candidate learner 1: for all R in T ables do 2: Conditions 3: for all C in R.Cols do 4: Conditions.Clear() 5: Conditions.AddRange(GetConditions(R, C)) 6: for all c in Conditions do 7: ARScenarios GetARScenarios(R, C) 8: for all ARScenario in ARScenarios do 9: if ((IsAR(c, R, C, ARScenario)) And (N ot(isalreadylearnedar(c, R, C, ARScenario)))) then 10: StoreAR(c, R, C, ARScenario) 11: end if 12: end for 13: CF DScenarios GetCF DScenarios(R, C) 14: for all CF DScenario in CF DScenarios do 15: if ((IsCF D(c, R, C, CF DScenario))And (N ot(isalreadylearnedcf D(c, R, C, CF DScenario)))) then 16: StoreCF D(c, R, C, CF DScenario) 17: end if 18: end for 19: end for 20: end for 21: end for given scenario will be considered to be a new CFD. In Algorithm 3.1 Conditions is the set of conditions applicable on R.C, fulfilling the constraints of M inoccurrencef requency and M inoccurrencerate. Complexity of Algorithm 3.1. can be calculated knowing the number of tables, the average number of table columns and the average number of column conditions, while its space complexity depends on the number of dependencies, the average number of determinant columns and the average number of dependent columns. Complexity of Algorithm 3.1 = Θ (T AF N ACN (card(arscenarios(t, AF N)) + card(cf DScenarios(R, AF N))) = Θ(T AF N ACN (2 AF N AF N AF N 1 )) = Θ(T AF N ACN (3 AF N 1 2 AF N 1 1)), where T is the number of Tables, AF N is the average number of table columns, ACN is the average number of column conditions. Space complexity of Algorithm 3.1 = Θ((card(Dependency)) (1 + avg(card(determinant)) + avg(card(dependent))))

7 40 KATALIN TÜNDE JÁNOSI-RANCZ One of our experiment were conducted on a database containing exceptions and we will present this example in entire paper to illustrate the steps of our method. For our experiments, we used the attribute Message, AppID, CrossAppException and OperatingSystemException, where M essage represented an Exception message, AppID was the ID of the Application where the given Exception occurred, CrossAppException and OperatingSystem- Exception were Boolean values. Possible values for AppID were 1 (Forum), 2 (Wiki), 3 (Desktop), 4 (Website) and 5 (Engine). Table 3.1. shows some SDs, a part of accepted dependencies. Id Condition Determinant Dependent 3 e.message= Authenti- e.crossappexception CrossAppException= 1 and cation needed e.opsystexception OpSystException= 0 6 e.message= Connection e.crossappexception AppID= 3 to DB failed e.opsystexception CrossAppException= 0 e.appid OpSystException= 0 5 e.message= File not e.crossappexception AppID= 3 found e.opsystexception CrossAppException= 0 e.appid OpSystException= 1 2 e.message= Thread was e.crossappexception AppID= 1 already closed e.appid CrossAppException= 0 e.opsystexception OpSystException= 0 4 e.opsystexception= 1 e.crossappexception Message= File not found e.appid AppID= 3 e.message CrossAppException= 0 Table 1. SDs, accepted CFDs 3.2. Usefulness. In this section we define the More Useful relation. Let D 1 and D 2 be two CFDs, and D 1.c, D 1.A, D 1.B and D 2.c, D 2.A, D 2.B be the condition, determinant column set and dependent column set of D 1 and D 2 respectively. We define the More Useful (MU) relation as: D 1 MU D 2 (D 1.c = D 2.c) (D 1.A D 2.A) (D 1.B D 2.B). It is impossible to compare the usefulness of D 1 and D 2 if their conditions are not the same, because they are not applicable on the same record set. Also, if D 1 is more useful than D 2, then D 1 is more descriptive, because D 1.A cannot contain any column outside of D 2.A and D 2.B cannot contain any column outside of D 1.B. MU is not a strict relation; D 1 MU D 1 is true to guarantee reflexivity and anti-symmetry. If D 1 is a CFD and there is no D 2 D 1 which is more useful than D 1, then D 1 is a most useful CFD. Note that a non-ar cannot be more useful than an AR, because the determinant column set of an AR is empty set, the smallest set in the relation. MU is reflexive,

8 DEPENDENCY MANAGEMENT AND INCONSYSTENCY PREVENTION 41 transitive and anti-symmetric. On a set of CFDs defined for a database MU is an order Validation. Validation is not always easy for users, because there can be many dependencies learned, but we used the More Useful relation, which reduces significantly the difficulty of validation if used properly. As we mentioned earlier, the user sees a list of most useful CFDs/ARs. He can decide whether they are SDs or ADs. If the user accepts a dependency D 1, then all D 2 will instantly become invalid which meets the criteria of D 1 MU D 2. As a result any D 2 less useful than D 1 is redundant with D 1. If the user rejects a CFD/AR D 1, then it will become invalid and all CFDs/ARs D 2 which meet the criteria of (D 1 MU D 2 ) ( D 3 such that D 1 MU D 3 MU D 2 ) will be shown to the user, because they will become most useful ARs/CFDs. DependencyManager lets the user decide whether a dependency is an SD or AD. The following example illustrates the relation of usefulness between some CFDs. Example 2. D 1 is a CFD, D 1.c = (Exception.CrossAppException = 0 ), D 1.A = (Exception.Message), D 1.B = (Exception.AppID). D 2 is a CFD, D 2.c = D 1.c, D 2.A = (Exception.Message), D 2.B = (Exception.OperatingSystemException, Exception.AppID). D 3 is a CFD, D 3.c = D 1.c, D 3.B = (Exception.AppID), D 3.A = (Exception.OperatingSystemException, Exception.Message). D 4 is a CFD, D 4.c = D 1.c, D 4.A = (Exception.Message), D 4.B = (Exception.OperatingSystemException). D 5 is a CFD, D 5.c = D 1.c, D 5.A = (Exception.AppID, Exception.Message), D 5.B = (Exception.OperatingSystemException). In our experiment D 1 was SD and D 2, D 3, D 4, D 5 were ADs. (D 1.c = D 2.c) (D 1.A = D 2.A) (D 2.B D 1.B) D 2 MU D 1. (D 1.c = D 3.c) (D 1.A D 3.A) (D 1.B = D 3.B) D 1 MU D 3. D 2 MU D 1 MU D 3 D 2 MU D 3. (D 2.c = D 4.c) (D 2.A = D 4.A) (D 2.B D 4.B) D 2 MU D 4. (D 2.c = D 5.c) (D 2.A D 5.A) (D 2.B D 5.B) D 2 MU D 5. (D 4.c = D 5.c) (D 4.A D 5.A) (D 4.B = D 5.B) D 4 MU D 5. D 2 is a most useful element. The user did not see D 1, D 3, D 4 and D 5 because D 2 was more useful. When we invalidated D 2, then D 1 and D 4 appeared on the view, because after we excluded D 2 from the set, D 1 and D 4 became most useful. D 5 was still hidden, because D 4 is more useful than D 5 and D 3 was hidden, because D 1 is more useful than D 3. The size of needed space depends on the number of dependencies. If there are n dependencies, where k columns appear in a dependency on average,

9 42 KATALIN TÜNDE JÁNOSI-RANCZ Case Type Description Automatically detectable 1 SD-candidate It is no longer a CFD/AR Yes 2 AD It is no longer a CFD/AR Yes 3 AD It was mistakenly rejected No 4 SD It was mistakenly accepted No 5 SD A more useful SD appeared Yes 6 SD Schema change Yes Table 2. Cases when forgetting is useful then the system uses n (k + 1) records to store the dependencies. If we accept m CFDs/ARs, then the number of generated constraints will be m, those constraints need to be stored. We conclude that the space complexity of DependencyManager is relatively low Forgetting. After CFDs/ARs are learned, they are stored as SD-candidates. When they are accepted, they will be stored as SDs and removed from the SD-candidates. When SD-candidates are rejected, they will be stored as ADs, but not as SD-candidates. When we completely delete an SD, an AD or an SD-candidate, then we forget them and delete it along with its constraint. Table 2. describes the possible cases when an SD-candidate, and AD or an SD should be forgotten. Forgetting is useful to simplify dependency maintenance and it is included into DependencyManager. 4. Experimental results We have evaluated efficiency and effectiveness of our algorithm on four datasets. In datasets named Numbers1 and Numbers2 the rows were generated numbers, using formulas to guarantee the occurrence of CFDs and ARs. We also conducted experiments used real datasets from the UCI machine learning repository namely, the Wisconsin breast cancer (WBC). The Exceptions dataset was also generated by us using real data. Dataset Arity No. Rows MOF MOR SD AD Numbers Numbers WBC Exceptions Table 3. Experimental results (MOF = MinOccurrenceFrequency, MOR = MinOccurrenceRate)

10 DEPENDENCY MANAGEMENT AND INCONSYSTENCY PREVENTION 43 Our algorithm doesn t represent a new approach for finding CFDs, therefore there is no point to compare it to the CFD-finders of previous works. Our algorithm represents a new approach to CFD-handling. Table 3. describes the parameters of the used datasets, the MinOccurrenceFrequency and MinOccurrenceRate used in our experiments and the number of accepted/rejected dependencies: SDs and ADs for each dataset. We use MOF and MOR to set the sensitivity of the system. If we choose low occurrence frequency then it is very sensible even to low numbered occurrences of dependencies than we will store a lot of dependency candidates from which many might be invalid, or if we set the sensitivity to be too high then we might not discover all the valid dependencies. The occurrence rate is the rate specified to be the minimum ratio between the number of records where the dependencies are applicable and the total number of records. We have to choose the sensitivity which potentially enabled us to find all the valid dependencies. Table 3. shows these settings chosen in our cases, but these numbers can vary from database to database or preferences. If an SD pattern is accepted (validated), then a constraint prevents inserts and updates inconsistent with the accepted pattern. We quantified the amount of data protected by our system against inconsistency violation. In our dataset Numbers1 8620, Numbers , BCW 384, Exceptions 7083 presents the records complied to at list an SD and an arbitrary insert or update in the future has a probability of 43.1 %, %, % and 63.28% to be protected by our system against inconsistencies violating our SDs. If an error occurs because of mishandling data where an SD is applicable, then in the error logs the problem can be identified and potentially bugs can be found and fixed. SD management is simplified by the More Useful relation, showing only the most useful SDs to the system/users, therefore we drastically reduce the number of dependencies to be taken into account and we only store the Most Useful SDs because all the other dependencies are implied by them.. We estimated the benefit of using the MU relation. If a table has n columns and a dependency has a column for the condition, s columns in the determinant column set and d columns in the dependent column set, then there are 2 (n s d 1) +2 d 1 possible dependencies which are less useful. If we accept the given dependency, we automatically ref use all the less useful dependencies, which optimizes by helping the system/user with the automatic reduction of the dependencies to be taken into account by a maximum of 2 (n s d 1) +2 d 1 dependencies. This method helps a lot in the validation and the usage of the dependencies. This is an exponential optimization. As a result of our experiments we have seen that the constraints were successfully generated and prevented any attempt to breach their CFDs, we

11 44 KATALIN TÜNDE JÁNOSI-RANCZ have easily identified the SDs which were ready for our FCA module for further analysation. 5. Studying the set of SDs with FCA We have used FCA (Formal Concept Analysis) methods to analyze SDs and draw conclusions from them. Our approach was agnostic towards the nature of the content of the database; we have implemented an FCA applicable to accepted dependencies of any database using our system. FCA [10] is a useful algorithm that can be used to draw conceptual conclusions based on a set of data; its steps are as follows: 1. Creating a context which will contain objects and attributes. 2. Creating concepts by grouping objects which have the same set of attributes; objects with different attribute sets will be grouped into different concepts. 3. Each formal context is transformed into a concept lattice, which is the basis for further data analysis. We have a smallest and a biggest possible value in the lattice and we know that not any concepts C 1, C 2 from C are comparable. We use the Join and the Meet operations between concepts, knowing that (Meet(C 1, C 2 ) = C 3 ) (((C 1.A C 2.A = ) (C 3 = C Min )) ((C 1.A C 2.A ) (C 1.O C 2.O = C 3.O) (C 1.A C 2.A = C 3.A))), and (Join(C 1, C 2 ) = C 3 ) (((C 1.O C 2.O = ) (C 3 = C Max )) ((C 1.O C 2.O ) (C 1.O C 2.O = C 3.O) (C 1.A C 2.A = C 3.A))), the result of which is a concept from C, because C 1.O All(O, C) and C 1.A All(A, C) for any C 1 C. Using FCA we can draw interesting conclusions. In DependencyManager we have implemented an FCA module, which studies the SDs. In this module the objects are the SDs and the attributes are possible properties for the objects. The attributes are Weakness, Determinant Column Cluster s Size, Column Cluster s Size, Frequency of Occurrence and Almost Symmetrical. Weakness measures the smallness of the dependencies in the more useful relation. The less useful dependency the more weaker it is. We consider D 1 to be weaker than D 2 if D 2 MU D 1. If an SD is fairly weak, then it is normal to assign it lower priority in our researches than the priority assigned to its more useful counterparts. Definition 1. (Weakness) The formula of W eakness(c 1, C 2, CT ) = C 1 C 2 2CT defines the attribute weakness of concepts C 1 and C 2, where C 1 is the number of determinant columns, C 2 is the number of dependent columns and CT is the total number of columns in the table.

12 DEPENDENCY MANAGEMENT AND INCONSYSTENCY PREVENTION 45 Proposition 1. The numerical value of Weakness is between 0 and 1. We have defined the Fuzzy sets of None, Very Low, Low, Medium, High, Very High and Total for this attribute. Based on this formula we believe that a dependency having weakness of zero can be considered to having no weakness or an other dependency having a value between 0 and 0.2 having a very low weakness rate and so on. Determinant Column Cluster s Size is the number of SDs having the same set of determinant columns. We have defined the Fuzzy sets of Tiny, Small, Medium, Big and Huge for this attribute. If we have many dependencies having the same determinant column set then the given determinant column set is the source of multiple potential implication schema which is the starting point of many learning algorithm. Column Cluster s Size is the number of SDs having the same dependent columns and the same determinant columns. This type of attribute shows concepts which have the same schema different only in condition. If we have many such schemas then we have to think about the scheme of the database because we might learn FDs with dirty data. We have defined the Fuzzy sets of Tiny, Small, Medium, Big and Huge for this attribute. We consider cluster size of 1 being tiny, cluster size of 6 being big. Frequency of Occurrence is the percentage of occurrence of an FD pattern. This number is the sum of the number of all records matching any SDs with the same FD-pattern. For instance, if a CFD is fulfilling R.A R.B, where R.A and R.B are column sets from the R table with the condition of T.C = somevalue, then the CFD is matching the FD pattern of R.A R.B. Of course it is not an FD because of the condition, but still it matches an FD-pattern. We have defined the Fuzzy sets of Nonexistent, Very Rare, Rare, Medium, Frequent, Very Frequent and FD. If D 1 and D 2 are SDs such as DependentCols D1 = DeterminantCols D2 and DependentCols D2 = DeterminantCols D1, then D 1 is Almost Symmetrical with D 2. We have used Fuzzy values for these attributes, Table 4. shows the quantitative metrics concerning our point of view of the meaning of these values. Example 3. illustrates the power of the combination of SDs and FCA. Example 3. Let s consider a table R, which holds many records, with a lot of white noise due to dirty data. Also, let s consider that there is a pattern of R.A R.B where R.A and R.B are sets of columns. Because of the white noise there is no tool able to find the FD of R.A R.B. However, Column Cluster s Size of CFD s matching the pattern of R.A R.B is Big, all CFDs matching the pattern are not almost symmetrical, their weakness is the same, the Determinant Column Cluster s Size of the pattern is Big, the Frequency of Occurrence of all CFDs matching the pattern is either frequent or very

13 46 KATALIN TÜNDE JÁNOSI-RANCZ W None Very Low Low Medium High Very High Total 0 <0.2 <0.4 <0.6 <0.8 <1 1 D Tiny Small Medium Big Huge 1 <= 3 <= 5 <=7 >7 C Tiny Small Medium Big Huge 1 <= 3 <= 5 <=7 >7 O Nonexistent Very Rare Rare Medium Frequent Very Frequent FD 0% <0.15% <0.4% <0.6% <0.75% <1% 1% A Yes No Table 4. Quantitative metrics used for Fuzzy Attribute sets( W=Weakness, D=Determinant Col. Clusters Size, C= Col. Clusters Size, O=Occurrence, A=Almost Symmetrical) frequent (because of the table has many records and many dirty data). Because of these facts all the SDs meeting the FD pattern of R.A R.B will be objects of Concept 1. If a user studies the results of the FCA algorithm, he will find many objects with the FD pattern of R.A R.B in Concept 1. He will study the records of the table and will discover that the FD pattern is generating so many CFDs because in fact it is an FD obfuscated by dirty data, so he will know a useful information for certain which is a great starting point of fixing inconsistency problems in this case. Thanks to the analysis provided by FCA the user got a clue which has led him to the important conclusion that the FD pattern of R.A R.B is essentially an FD instead of many CFDs. Determinant Weakness Fd Occurrence Symmetrical Universal ID:7 small low true rare false tiny ID:8 small medium false medium false tiny ID:3 medium very low false rare false tiny ID:6 medium none false very rare false small ID:5 medium none false very rare false small ID:2 medium none false very rare false small ID:4 medium none false very rare false tiny Table 5. FCA Context of SDs Example 4. We have created the FCA context, shown in Table 5, by pairing the fuzzy and boolean attributes to objects which are SDs in our Database about Exceptions. The IDs in the table represents the ID of an SD, see Table 3.1. From the context we built the concepts and generate the lattice from

14 DEPENDENCY MANAGEMENT AND INCONSYSTENCY PREVENTION 47 Level 3: C 68 = ({6, 5, 2, 4}, {W:None, D:Medium, O:Very rare}); C 69 = ({4}, {W:None, D:Medium, U:Tiny}); C 70 =( {6, 5, 2}, {W:None, D:Medium, U:Small}); C 91 = ({4}, {W:None, O:Very rare, U:Tiny}); C 154 = ({3}, {W:Very low, D:Medium, O:Rare}); C 155 = ({3}, {W:Very low, D:Medium, U:Tiny}); C 173 = ({3}, {W:Very low, O:Rare, U:Tiny});... Level 4: C 344 =( {4}, {W:None, D:Medium, O:Very rare, U:Tiny}); C 345 = ({6, 5, 2}, {W:None, D:Medium, O:Very rare, U:Small}); C 441 = ({3}, {W:Very low, D:Medium, O:Rare, U:Tiny}); C 484 = ({7}, {W:Low, D:Small, O:Rare, U:Tiny}); C 515 = ({8}, {W:Medium, D:Small, O:Medium, U:Tiny}) Table 6. Part of the generated concepts which we can read useful conclusions. Some of the generated concepts can be found in Table 6. For example at level 3 we have the concept (C 70, Objects = {6, 5, 2}, Attributes = {Weakness:None, Determinant:Medium, Universal:Small}); and from this concept we can read that maybe we should think more about this dependencies because the determinant cluster is medium and is paired with virtually no weakness and this might lead to useful conclusion about our exception. At level 4 we can see another concept containing the same object (C 345, Objects = {6, 5, 2}, Attributes = {Weakness:None, Determinant:Medium, Occurrence:Very rare, Universal:Small}); and by the extra information of the occurrence being very rare we can be assured that this might not be the highest priority for further analysis because the occurrence rate discouraged us from focusing on this three objects otherwise if we were not using FCA then we might have been allocating a lot of time to analyze this FDs. The human mind can comprehend object-attribute pairs but in the lattice generated with our FCA the higher the level is the more difficult is for the human mind to comprehend it without FCA analysis. If a human observes that the size of the determinant cluster is big and sometimes even the dependent column cluster size is raised. If this is paired with virtually no weakness then we might allocate a lot of time to get more information about this object set = {6,5,2}, however at the fourth level we can see that they are very rare which lowers the priority of further investigations. As we can see, FCA is relevant in this study, as it provides conceptual information as the result of SD-analysis with the attributes described in Section 5. To our knowledge no previous work contained FCA implementation where

15 48 KATALIN TÜNDE JÁNOSI-RANCZ the input objects were SDs and the input attributes were properties of these dependencies. 6. Conclusions and Future Works In this paper we have focused on inconsistency prevention instead of fixing them. Our method prevents inconsistencies by learning CFD and AR patterns. If such a pattern is accepted (validated), then a constraint prevents inserts and updates inconsistent with the accepted pattern. DependencyManager has a novel strategy for consistency protection with the additional benefit of preventing inconsistencies before they appear in the database, is scalable, easy to use and powerful in preventing inconsistencies. We have used FCA to analyze the SDs and draw useful conclusions; this way the users of the database can understand the db-schema deeper. We have introduced a novelty for FCA analysis by using FCA for dependency; analyzing abstract database patterns with it. In the future we intend to make this strategy even more useful, we intend to find and analyze cross-table SD-candidates. We also want to generalize the set of conditions by using more columns in the boolean functions instead of only one and using more operator types, (our current implementation considers equality as the only conditional operator). Automatization of the validation process would be useful, this feature will probably be based on user-defined rules which will help the learner module to automatically determine whether an SD-candidate is an AD or an SD. The system stores the errors originating from unsuccessful inserts and updates which were prevented by the constraints created for SDs; this will generate knowledge which can be used to determine the cause of failure of inserts and updates. If the cause of failure is an incorrectly accepted pattern, the pattern should be dismissed; otherwise the incorrect user action or incorrect functionality can be detected. Known SDs can be used to increase application performances. This can easily be understood by knowing that these rules are enforced by constraints, so it is enough to load all possible combinations for determinant column sets along with their corresponding values in the dependent column set where the condition of the SD is met. In this way it will be possible to determine the values of the dependent columns of other records. If there are too many possible combinations, the knowledge base can still be filled with only the most frequent or most recent combinations. This method for caching SDs can enhance performance, especially in applications where data is periodically loaded, such as applications responsible for the creation of backups.

16 DEPENDENCY MANAGEMENT AND INCONSYSTENCY PREVENTION Acknowledgment We would like to thank Arpad Lajos for his advice and help during this research. References [1] J. Baixeries. A formal context for symmetric dependencies. In R. Medina and S. A. Obiedkov, editors, ICFCA, volume 4933 of LNCS, pages Springer, [2] G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. PVLDB, 3(1): , [3] F. Chiang and R. J. Miller. Discovering data quality rules. PVLDB, 1(1): , [4] G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. VLDB, pages ACM, [5] G. Cormode, L. Golab, F. Korn, A. McGregor, D. Srivastava, and X. Zhang. Estimating the confidence of conditional functional dependencies. SIGMOD Conference, pages ACM, [6] W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst., 33(2), 2008a. [7] W. Fan, F. Geerts, L. V. S. Lakshmanan, and M. Xiong. Discovering conditional functional dependencies. In ICDE, pages IEEE, [8] W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. PVLDB, 3(1): , [9] W. Fan, F. Geerts, J. Li, and M. Xiong. Discovering conditional functional dependencies. IEEE Trans. Knowl. Data Eng., 23(5): , [10] B. Ganter and R. Wille. Formal concept analysis - mathematical foundations. Springer, [11] L. Golab, H. J. Karloff, F. Korn, D. Srivastava, and B. Yu. On generating near-optimal tableaux for conditional functional dependencies. PVLDB, 1(1): , [12] K. T. Janosi-Rancz, V. Varga and T. Nagy Detecting XML Functional Dependencies through Formal Concept Analysis, 14th ADBIS Conference, LNCS 6295, pp , 2010 [13] S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. ICDT, vol. 361 of ACM ICPS, pages 53 62, [14] S. Lopes, J.-M. Petit, and L. Lakhal. Functional and approximate dependency mining: database and fca points of view. J. Exp. Theor. Artif. Intell., 14(2-3):93 114, [15] R. Medina and L. Nourine. Conditional functional dependencies: An fca point of view. In L. Kwuida and B. Sertkaya, editors, ICFCA, volume 5986 of LNCS, pages Springer, [16] R. Medina and L. Nourine. A unified hierarchy for functional dependencies, conditional functional dependencies and association rules. In ICFCA, pages , [17] M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 4(5): , Sapientia Hungarian University of Transylvania, Tirgu Mures, Romania address: tsuto@ms.sapientia.ro

A Fast and Efficient Method to Find the Conditional Functional Dependencies in Databases

A Fast and Efficient Method to Find the Conditional Functional Dependencies in Databases International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 3, Issue 5 (August 2012), PP. 56-61 A Fast and Efficient Method to Find the Conditional

More information

Improving Data Excellence Reliability and Precision of Functional Dependency

Improving Data Excellence Reliability and Precision of Functional Dependency Improving Data Excellence Reliability and Precision of Functional Dependency Tiwari Pushpa 1, K. Dhanasree 2, Dr.R.V.Krishnaiah 3 1 M.Tech Student, 2 Associate Professor, 3 PG Coordinator 1 Dept of CSE,

More information

Big Data Cleaning. Nan Tang. Qatar Computing Research Institute, Doha, Qatar ntang@qf.org.qa

Big Data Cleaning. Nan Tang. Qatar Computing Research Institute, Doha, Qatar ntang@qf.org.qa Big Data Cleaning Nan Tang Qatar Computing Research Institute, Doha, Qatar ntang@qf.org.qa Abstract. Data cleaning is, in fact, a lively subject that has played an important part in the history of data

More information

Conditional-Functional-Dependencies Discovery

Conditional-Functional-Dependencies Discovery Conditional-Functional-Dependencies Discovery Raoul Medina 1 Lhouari Nourine 2 Research Report LIMOS/RR-08-13 19 décembre 2008 1 medina@isima.fr 2 nourine@isima.fr, Université Blaise Pascal LIMOS Campus

More information

Data Development and User Interaction in GDR

Data Development and User Interaction in GDR Guided Data Repair Mohamed Yakout 1,2 Ahmed K. Elmagarmid 2 Jennifer Neville 1 Mourad Ouzzani 1 Ihab F. Ilyas 2,3 1 Purdue University 2 Qatar Computing Research Institute 3 University of Waterloo Qatar

More information

Discovering Conditional Functional Dependencies

Discovering Conditional Functional Dependencies 1 Discovering Conditional Functional Dependencies Wenfei Fan, Floris Geerts, Jianzhong Li, Ming Xiong Abstract This paper investigates the discovery of conditional functional dependencies (CFDs). CFDs

More information

Improving Data Quality: Consistency and Accuracy

Improving Data Quality: Consistency and Accuracy Improving Data Quality: Consistency and Accuracy Gao Cong 1 Wenfei Fan 2,3 Floris Geerts 2,4,5 Xibei Jia 2 Shuai Ma 2 1 Microsoft Research Asia 2 University of Edinburgh 4 Hasselt University 3 Bell Laboratories

More information

Sampling from repairs of conditional functional dependency violations

Sampling from repairs of conditional functional dependency violations The VLDB Journal (2014) 23:103 128 DOI 10.1007/s00778-013-0316-z REGULAR PAPER Sampling from repairs of conditional functional dependency violations George Beskales Ihab F. Ilyas Lukasz Golab Artur Galiullin

More information

Addressing internal consistency with multidimensional conditional functional dependencies

Addressing internal consistency with multidimensional conditional functional dependencies Addressing internal consistency with multidimensional conditional functional dependencies Stefan Brüggemann OFFIS - Institute for Information Technology Escherweg 2 262 Oldenburg Germany brueggemann@offis.de

More information

Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

More information

Edinburgh Research Explorer

Edinburgh Research Explorer Edinburgh Research Explorer CerFix: A System for Cleaning Data with Certain Fixes Citation for published version: Fan, W, Li, J, Ma, S, Tang, N & Yu, W 2011, 'CerFix: A System for Cleaning Data with Certain

More information

Models for Distributed, Large Scale Data Cleaning

Models for Distributed, Large Scale Data Cleaning Models for Distributed, Large Scale Data Cleaning Vincent J. Maccio, Fei Chiang, and Douglas G. Down McMaster University Hamilton, Ontario, Canada {macciov,fchiang,downd}@mcmaster.ca Abstract. Poor data

More information

How To Find Out What A Key Is In A Database Engine

How To Find Out What A Key Is In A Database Engine Database design theory, Part I Functional dependencies Introduction As we saw in the last segment, designing a good database is a non trivial matter. The E/R model gives a useful rapid prototyping tool,

More information

An Empirical Data Cleaning Technique for CFDs

An Empirical Data Cleaning Technique for CFDs An Empirical Data Cleaning Technique for CFDs Satyanarayana Mummana 1, Ravi kiran Rompella 2 1 Asst Professor, Dept of Computer Science and Engineering, Avanthi Institute of Engineering & Technology, Visakhapatnam,

More information

A Model For Revelation Of Data Leakage In Data Distribution

A Model For Revelation Of Data Leakage In Data Distribution A Model For Revelation Of Data Leakage In Data Distribution Saranya.R Assistant Professor, Department Of Computer Science and Engineering Lord Jegannath college of Engineering and Technology Nagercoil,

More information

AN EFFICIENT EXTENSION OF CONDITIONAL FUNCTIONAL DEPENDENCIES IN DISTRIBUTED DATABASES

AN EFFICIENT EXTENSION OF CONDITIONAL FUNCTIONAL DEPENDENCIES IN DISTRIBUTED DATABASES IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 3, Mar 2014, 17-24 Impact Journals AN EFFICIENT EXTENSION OF CONDITIONAL

More information

Discovering Functional Dependencies and Association Rules by Navigating in a Lattice of OLAP Views

Discovering Functional Dependencies and Association Rules by Navigating in a Lattice of OLAP Views Discovering Functional Dependencies and Association Rules by Navigating in a Lattice of OLAP Views Pierre Allard, Sébastien Ferré, and Olivier Ridoux IRISA, Université de Rennes 1, Campus de Beaulieu 35042

More information

2 Associating Facts with Time

2 Associating Facts with Time TEMPORAL DATABASES Richard Thomas Snodgrass A temporal database (see Temporal Database) contains time-varying data. Time is an important aspect of all real-world phenomena. Events occur at specific points

More information

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria

More information

Functional Dependencies and Finding a Minimal Cover

Functional Dependencies and Finding a Minimal Cover Functional Dependencies and Finding a Minimal Cover Robert Soulé 1 Normalization An anomaly occurs in a database when you can update, insert, or delete data, and get undesired side-effects. These side

More information

Design of Relational Database Schemas

Design of Relational Database Schemas Design of Relational Database Schemas T. M. Murali October 27, November 1, 2010 Plan Till Thanksgiving What are the typical problems or anomalies in relational designs? Introduce the idea of decomposing

More information

Functional Dependency Generation and Applications in pay-as-you-go data integration systems

Functional Dependency Generation and Applications in pay-as-you-go data integration systems Functional Dependency Generation and Applications in pay-as-you-go data integration systems Daisy Zhe Wang, Luna Dong, Anish Das Sarma, Michael J. Franklin, and Alon Halevy UC Berkeley and AT&T Research

More information

Detailed Investigation on Strategies Developed for Effective Discovery of Matching Dependencies

Detailed Investigation on Strategies Developed for Effective Discovery of Matching Dependencies Detailed Investigation on Strategies Developed for Effective Discovery of Matching Dependencies R.Santhya 1, S.Latha 2, Prof.S.Balamurugan 3, S.Charanyaa 4 Department of IT, Kalaignar Karunanidhi Institute

More information

Normalization in Database Design

Normalization in Database Design in Database Design Marek Rychly mrychly@strathmore.edu Strathmore University, @ilabafrica & Brno University of Technology, Faculty of Information Technology Advanced Databases and Enterprise Systems 14

More information

Cluster analysis and Association analysis for the same data

Cluster analysis and Association analysis for the same data Cluster analysis and Association analysis for the same data Huaiguo Fu Telecommunications Software & Systems Group Waterford Institute of Technology Waterford, Ireland hfu@tssg.org Abstract: Both cluster

More information

Data Quality in Information Integration and Business Intelligence

Data Quality in Information Integration and Business Intelligence Data Quality in Information Integration and Business Intelligence Leopoldo Bertossi Carleton University School of Computer Science Ottawa, Canada : Faculty Fellow of the IBM Center for Advanced Studies

More information

The Data Analytics Group at the Qatar Computing Research Institute

The Data Analytics Group at the Qatar Computing Research Institute The Data Analytics Group at the Qatar Computing Research Institute George Beskales Gautam Das Ahmed K. Elmagarmid Ihab F. Ilyas Felix Naumann Mourad Ouzzani Paolo Papotti Jorge Quiane-Ruiz Nan Tang Qatar

More information

Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL

Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL Jasna S MTech Student TKM College of engineering Kollam Manu J Pillai Assistant Professor

More information

Discovering Data Quality Rules

Discovering Data Quality Rules Discovering Data Quality Rules Fei Chiang University of Toronto fchiang@cs.toronto.edu Renée J. Miller University of Toronto miller@cs.toronto.edu ABSTRACT Dirty data is a serious problem for businesses

More information

Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner

Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner 24 Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner Rekha S. Nyaykhor M. Tech, Dept. Of CSE, Priyadarshini Bhagwati College of Engineering, Nagpur, India

More information

Holistic Data Cleaning: Putting Violations Into Context

Holistic Data Cleaning: Putting Violations Into Context Holistic Data Cleaning: Putting Violations Into Context Xu Chu 1 Ihab F. Ilyas 2 Paolo Papotti 2 1 University of Waterloo, Canada 2 Qatar Computing Research Institute (QCRI), Qatar x4chu@uwaterloo.ca,

More information

Database design 1 The Database Design Process: Before you build the tables and other objects that will make up your system, it is important to take time to design it. A good design is the keystone to creating

More information

A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms

A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms Ian Davidson 1st author's affiliation 1st line of address 2nd line of address Telephone number, incl country code 1st

More information

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LIX, Number 1, 2014 A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE DIANA HALIŢĂ AND DARIUS BUFNEA Abstract. Then

More information

Pattern Insight Clone Detection

Pattern Insight Clone Detection Pattern Insight Clone Detection TM The fastest, most effective way to discover all similar code segments What is Clone Detection? Pattern Insight Clone Detection is a powerful pattern discovery technology

More information

DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE

DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE SK MD OBAIDULLAH Department of Computer Science & Engineering, Aliah University, Saltlake, Sector-V, Kol-900091, West Bengal, India sk.obaidullah@gmail.com

More information

Dynamic Data in terms of Data Mining Streams

Dynamic Data in terms of Data Mining Streams International Journal of Computer Science and Software Engineering Volume 2, Number 1 (2015), pp. 1-6 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining

More information

Data Cleaning: An Abstraction-based Approach

Data Cleaning: An Abstraction-based Approach Data Cleaning: An Abstraction-based Approach Dileep Kumar Koshley Department of Computer Science and Engineering Indian Institute of Technology Patna, India Email: dileep.mtcs3@iitp.ac.in Raju Halder Department

More information

NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE

NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE www.arpapress.com/volumes/vol13issue3/ijrras_13_3_18.pdf NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE Hebah H. O. Nasereddin Middle East University, P.O. Box: 144378, Code 11814, Amman-Jordan

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data

More information

Chapter 20: Data Analysis

Chapter 20: Data Analysis Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

More information

Incremental Violation Detection of Inconsistencies in Distributed Cloud Data

Incremental Violation Detection of Inconsistencies in Distributed Cloud Data Incremental Violation Detection of Inconsistencies in Distributed Cloud Data S. Anantha Arulmary 1, N.S. Usha 2 1 Student, Department of Computer Science, SIR Issac Newton College of Engineering and Technology,

More information

A Data Generator for Multi-Stream Data

A Data Generator for Multi-Stream Data A Data Generator for Multi-Stream Data Zaigham Faraz Siddiqui, Myra Spiliopoulou, Panagiotis Symeonidis, and Eleftherios Tiakas University of Magdeburg ; University of Thessaloniki. [siddiqui,myra]@iti.cs.uni-magdeburg.de;

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION CHAPTER 1 INTRODUCTION 1. Introduction 1.1 Data Warehouse In the 1990's as organizations of scale began to need more timely data for their business, they found that traditional information systems technology

More information

IMPROVING BUSINESS PROCESS MODELING USING RECOMMENDATION METHOD

IMPROVING BUSINESS PROCESS MODELING USING RECOMMENDATION METHOD Journal homepage: www.mjret.in ISSN:2348-6953 IMPROVING BUSINESS PROCESS MODELING USING RECOMMENDATION METHOD Deepak Ramchandara Lad 1, Soumitra S. Das 2 Computer Dept. 12 Dr. D. Y. Patil School of Engineering,(Affiliated

More information

Propagating Functional Dependencies with Conditions

Propagating Functional Dependencies with Conditions Propagating Functional Dependencies with Conditions Wenfei Fan 1,2,3 Shuai Ma 1 Yanli Hu 1,5 Jie Liu 4 Yinghui Wu 1 1 University of Edinburgh 2 Bell Laboratories 3 Harbin Institute of Technologies 4 Chinese

More information

Cluster Description Formats, Problems and Algorithms

Cluster Description Formats, Problems and Algorithms Cluster Description Formats, Problems and Algorithms Byron J. Gao Martin Ester School of Computing Science, Simon Fraser University, Canada, V5A 1S6 bgao@cs.sfu.ca ester@cs.sfu.ca Abstract Clustering is

More information

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. The Relational Model. The relational model

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. The Relational Model. The relational model CS2Bh: Current Technologies Introduction to XML and Relational Databases Spring 2005 The Relational Model CS2 Spring 2005 (LN6) 1 The relational model Proposed by Codd in 1970. It is the dominant data

More information

Condensed Representation of Database Repairs for Consistent Query Answering. Jef Wijsen Université de Mons-Hainaut (UMH), Belgium

Condensed Representation of Database Repairs for Consistent Query Answering. Jef Wijsen Université de Mons-Hainaut (UMH), Belgium Condensed Representation of Database Repairs for Consistent Query Answering Jef Wijsen Université de Mons-Hainaut (UMH), Belgium Dagstuhl Seminar 03241 1 Motivation SAMPLES SNum Sampled Food Analyzed Lab

More information

Monitoring Web Browsing Habits of User Using Web Log Analysis and Role-Based Web Accessing Control. Phudinan Singkhamfu, Parinya Suwanasrikham

Monitoring Web Browsing Habits of User Using Web Log Analysis and Role-Based Web Accessing Control. Phudinan Singkhamfu, Parinya Suwanasrikham Monitoring Web Browsing Habits of User Using Web Log Analysis and Role-Based Web Accessing Control Phudinan Singkhamfu, Parinya Suwanasrikham Chiang Mai University, Thailand 0659 The Asian Conference on

More information

Completing Description Logic Knowledge Bases using Formal Concept Analysis

Completing Description Logic Knowledge Bases using Formal Concept Analysis Completing Description Logic Knowledge Bases using Formal Concept Analysis Franz Baader, 1 Bernhard Ganter, 1 Barış Sertkaya, 1 and Ulrike Sattler 2 1 TU Dresden, Germany and 2 The University of Manchester,

More information

Data Integrity Conventions and CFDs

Data Integrity Conventions and CFDs 6 Conditional Functional Dependencies for Capturing Data Inconsistencies WENFEI FAN University of Edinburgh and Bell Laboratories FLORIS GEERTS and XIBEI JIA University of Edinburgh and ANASTASIOS KEMENTSIETSIDIS

More information

Relational Database Basics Review

Relational Database Basics Review Relational Database Basics Review IT 4153 Advanced Database J.G. Zheng Spring 2012 Overview Database approach Database system Relational model Database development 2 File Processing Approaches Based on

More information

A Business Process Services Portal

A Business Process Services Portal A Business Process Services Portal IBM Research Report RZ 3782 Cédric Favre 1, Zohar Feldman 3, Beat Gfeller 1, Thomas Gschwind 1, Jana Koehler 1, Jochen M. Küster 1, Oleksandr Maistrenko 1, Alexandru

More information

arxiv:1204.3677v1 [cs.db] 17 Apr 2012

arxiv:1204.3677v1 [cs.db] 17 Apr 2012 Bayesian Data Cleaning for Web Data Yuheng Hu Sushovan De Yi Chen Subbarao Kambhampati Dept. of Computer Science and Engineering Arizona State University {yuhenghu, sushovan, yi, rao}@asu.edu arxiv:124.3677v1

More information

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA J.RAVI RAJESH PG Scholar Rajalakshmi engineering college Thandalam, Chennai. ravirajesh.j.2013.mecse@rajalakshmi.edu.in Mrs.

More information

Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm

Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm R. Sridevi et al Int. Journal of Engineering Research and Applications RESEARCH ARTICLE OPEN ACCESS Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm R. Sridevi,*

More information

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,

More information

Enforcing Data Quality Rules for a Synchronized VM Log Audit Environment Using Transformation Mapping Techniques

Enforcing Data Quality Rules for a Synchronized VM Log Audit Environment Using Transformation Mapping Techniques Enforcing Data Quality Rules for a Synchronized VM Log Audit Environment Using Transformation Mapping Techniques Sean Thorpe 1, Indrajit Ray 2, and Tyrone Grandison 3 1 Faculty of Engineering and Computing,

More information

Mining Constant Conditional Functional Dependencies for Improving Data Quality

Mining Constant Conditional Functional Dependencies for Improving Data Quality Mining Constant Conditional Functional Dependencies for Improving Data Quality D Devi Kalyani Department of Computer Science Engineering Gitam School of Technology, Gitam University, Hyderabad, India.

More information

Agile Software Engineering, a proposed extension for in-house software development

Agile Software Engineering, a proposed extension for in-house software development Journal of Information & Communication Technology Vol. 5, No. 2, (Fall 2011) 61-73 Agile Software Engineering, a proposed extension for in-house software development Muhammad Misbahuddin * Institute of

More information

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, fabian.gruening@informatik.uni-oldenburg.de Abstract: Independent

More information

Fault Analysis in Software with the Data Interaction of Classes

Fault Analysis in Software with the Data Interaction of Classes , pp.189-196 http://dx.doi.org/10.14257/ijsia.2015.9.9.17 Fault Analysis in Software with the Data Interaction of Classes Yan Xiaobo 1 and Wang Yichen 2 1 Science & Technology on Reliability & Environmental

More information

PartJoin: An Efficient Storage and Query Execution for Data Warehouses

PartJoin: An Efficient Storage and Query Execution for Data Warehouses PartJoin: An Efficient Storage and Query Execution for Data Warehouses Ladjel Bellatreche 1, Michel Schneider 2, Mukesh Mohania 3, and Bharat Bhargava 4 1 IMERIR, Perpignan, FRANCE ladjel@imerir.com 2

More information

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

Impact of Boolean factorization as preprocessing methods for classification of Boolean data Impact of Boolean factorization as preprocessing methods for classification of Boolean data Radim Belohlavek, Jan Outrata, Martin Trnecka Data Analysis and Modeling Lab (DAMOL) Dept. Computer Science,

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking Creating Synthetic Temporal Document Collections for Web Archive Benchmarking Kjetil Nørvåg and Albert Overskeid Nybø Norwegian University of Science and Technology 7491 Trondheim, Norway Abstract. In

More information

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Relational model. Relational model - practice. Relational Database Definitions 9/27/11. Relational model. Relational Database: Terminology

Relational model. Relational model - practice. Relational Database Definitions 9/27/11. Relational model. Relational Database: Terminology COS 597A: Principles of Database and Information Systems elational model elational model A formal (mathematical) model to represent objects (data/information), relationships between objects Constraints

More information

Database Design and Normalization

Database Design and Normalization Database Design and Normalization CPS352: Database Systems Simon Miner Gordon College Last Revised: 9/27/12 Agenda Check-in Functional Dependencies (continued) Design Project E-R Diagram Presentations

More information

Bridge from Entity Relationship modeling to creating SQL databases, tables, & relations

Bridge from Entity Relationship modeling to creating SQL databases, tables, & relations 1 Topics for this week: 1. Good Design 2. Functional Dependencies 3. Normalization Readings for this week: 1. E&N, Ch. 10.1-10.6; 12.2 2. Quickstart, Ch. 3 3. Complete the tutorial at http://sqlcourse2.com/

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Jordan University of Science & Technology Computer Science Department CS 728: Advanced Database Systems Midterm Exam First 2009/2010

Jordan University of Science & Technology Computer Science Department CS 728: Advanced Database Systems Midterm Exam First 2009/2010 Jordan University of Science & Technology Computer Science Department CS 728: Advanced Database Systems Midterm Exam First 2009/2010 Student Name: ID: Part 1: Multiple-Choice Questions (17 questions, 1

More information

RELATIONAL DATABASE DESIGN

RELATIONAL DATABASE DESIGN RELATIONAL DATABASE DESIGN g Good database design - Avoiding anomalies g Functional Dependencies g Normalization & Decomposition Using Functional Dependencies g 1NF - Atomic Domains and First Normal Form

More information

Discovering (frequent) constant conditional functional dependencies

Discovering (frequent) constant conditional functional dependencies Int. J. Data Mining, Modelling and Management, Vol. 4, No. 3, 2012 205 Discovering (frequent) constant conditional functional dependencies Thierno Diallo Université de Lyon, CNRS, INSA-Lyon, LIRIS, UMR5205,

More information

Chapter 5: FUNCTIONAL DEPENDENCIES AND NORMALIZATION FOR RELATIONAL DATABASES

Chapter 5: FUNCTIONAL DEPENDENCIES AND NORMALIZATION FOR RELATIONAL DATABASES 1 Chapter 5: FUNCTIONAL DEPENDENCIES AND NORMALIZATION FOR RELATIONAL DATABASES INFORMAL DESIGN GUIDELINES FOR RELATION SCHEMAS We discuss four informal measures of quality for relation schema design in

More information

Efficient and Effective Duplicate Detection Evaluating Multiple Data using Genetic Algorithm

Efficient and Effective Duplicate Detection Evaluating Multiple Data using Genetic Algorithm Efficient and Effective Duplicate Detection Evaluating Multiple Data using Genetic Algorithm Dr.M.Mayilvaganan, M.Saipriyanka Associate Professor, Dept. of Computer Science, PSG College of Arts and Science,

More information

Classification On The Clouds Using MapReduce

Classification On The Clouds Using MapReduce Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal simao.martins@tecnico.ulisboa.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt

More information

Mining various patterns in sequential data in an SQL-like manner *

Mining various patterns in sequential data in an SQL-like manner * Mining various patterns in sequential data in an SQL-like manner * Marek Wojciechowski Poznan University of Technology, Institute of Computing Science, ul. Piotrowo 3a, 60-965 Poznan, Poland Marek.Wojciechowski@cs.put.poznan.pl

More information

Physical Design. Meeting the needs of the users is the gold standard against which we measure our success in creating a database.

Physical Design. Meeting the needs of the users is the gold standard against which we measure our success in creating a database. Physical Design Physical Database Design (Defined): Process of producing a description of the implementation of the database on secondary storage; it describes the base relations, file organizations, and

More information

Database Marketing, Business Intelligence and Knowledge Discovery

Database Marketing, Business Intelligence and Knowledge Discovery Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski

More information

DYNAMIC QUERY FORMS WITH NoSQL

DYNAMIC QUERY FORMS WITH NoSQL IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 7, Jul 2014, 157-162 Impact Journals DYNAMIC QUERY FORMS WITH

More information

XML Data Integration

XML Data Integration XML Data Integration Lucja Kot Cornell University 11 November 2010 Lucja Kot (Cornell University) XML Data Integration 11 November 2010 1 / 42 Introduction Data Integration and Query Answering A data integration

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

The Entity-Relationship Model

The Entity-Relationship Model The Entity-Relationship Model 221 After completing this chapter, you should be able to explain the three phases of database design, Why are multiple phases useful? evaluate the significance of the Entity-Relationship

More information

Schema Refinement, Functional Dependencies, Normalization

Schema Refinement, Functional Dependencies, Normalization Schema Refinement, Functional Dependencies, Normalization MSCI 346: Database Systems Güneş Aluç, University of Waterloo Spring 2015 MSCI 346: Database Systems Chapter 19 1 / 42 Outline 1 Introduction Design

More information

Keywords: Regression testing, database applications, and impact analysis. Abstract. 1 Introduction

Keywords: Regression testing, database applications, and impact analysis. Abstract. 1 Introduction Regression Testing of Database Applications Bassel Daou, Ramzi A. Haraty, Nash at Mansour Lebanese American University P.O. Box 13-5053 Beirut, Lebanon Email: rharaty, nmansour@lau.edu.lb Keywords: Regression

More information

Compact Representations and Approximations for Compuation in Games

Compact Representations and Approximations for Compuation in Games Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions

More information

On Development of Fuzzy Relational Database Applications

On Development of Fuzzy Relational Database Applications On Development of Fuzzy Relational Database Applications Srdjan Skrbic Faculty of Science Trg Dositeja Obradovica 3 21000 Novi Sad Serbia shkrba@uns.ns.ac.yu Aleksandar Takači Faculty of Technology Bulevar

More information

International Journal of Engineering Research ISSN: 2348-4039 & Management Technology November-2015 Volume 2, Issue-6

International Journal of Engineering Research ISSN: 2348-4039 & Management Technology November-2015 Volume 2, Issue-6 International Journal of Engineering Research ISSN: 2348-4039 & Management Technology Email: editor@ijermt.org November-2015 Volume 2, Issue-6 www.ijermt.org Modeling Big Data Characteristics for Discovering

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

Chapter 15 Basics of Functional Dependencies and Normalization for Relational Databases

Chapter 15 Basics of Functional Dependencies and Normalization for Relational Databases Chapter 15 Basics of Functional Dependencies and Normalization for Relational Databases Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 15 Outline Informal Design Guidelines

More information

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge Global Journal of Business Management and Information Technology. Volume 1, Number 2 (2011), pp. 85-93 Research India Publications http://www.ripublication.com Static Data Mining Algorithm with Progressive

More information

KEYWORD SEARCH IN RELATIONAL DATABASES

KEYWORD SEARCH IN RELATIONAL DATABASES KEYWORD SEARCH IN RELATIONAL DATABASES N.Divya Bharathi 1 1 PG Scholar, Department of Computer Science and Engineering, ABSTRACT Adhiyamaan College of Engineering, Hosur, (India). Data mining refers to

More information

INTEROPERABILITY IN DATA WAREHOUSES

INTEROPERABILITY IN DATA WAREHOUSES INTEROPERABILITY IN DATA WAREHOUSES Riccardo Torlone Roma Tre University http://torlone.dia.uniroma3.it/ SYNONYMS Data warehouse integration DEFINITION The term refers to the ability of combining the content

More information

Data Quality Problems beyond Consistency and Deduplication

Data Quality Problems beyond Consistency and Deduplication Data Quality Problems beyond Consistency and Deduplication Wenfei Fan 1, Floris Geerts 2, Shuai Ma 3, Nan Tang 4, and Wenyuan Yu 1 1 University of Edinburgh, UK, wenfei@inf.ed.ac.uk,wenyuan.yu@ed.ac.uk

More information

Data Preprocessing. Week 2

Data Preprocessing. Week 2 Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume, Issue, March 201 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com An Efficient Approach

More information