AN EFFICIENT EXTENSION OF CONDITIONAL FUNCTIONAL DEPENDENCIES IN DISTRIBUTED DATABASES



Similar documents
A Fast and Efficient Method to Find the Conditional Functional Dependencies in Databases

M C. Foundations of Data Quality Management. Wenfei Fan University of Edinburgh. Floris Geerts University of Antwerp

CS 377 Database Systems. Database Design Theory and Normalization. Li Xiong Department of Mathematics and Computer Science Emory University

Improving Data Excellence Reliability and Precision of Functional Dependency

Conditional Functional Dependencies for Data Cleaning

Mining Constant Conditional Functional Dependencies for Improving Data Quality

DYNAMIC QUERY FORMS WITH NoSQL

Edinburgh Research Explorer

Addressing internal consistency with multidimensional conditional functional dependencies

Relational Database Design

Database Design and Normalization

Conditional-Functional-Dependencies Discovery

normalisation Goals: Suppose we have a db scheme: is it good? define precise notions of the qualities of a relational database scheme

Week 11: Normal Forms. Logical Database Design. Normal Forms and Normalization. Examples of Redundancy

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

Functional Dependencies and Finding a Minimal Cover

Database System Concepts

Chapter 1: Introduction

Big Data: Study in Structured and Unstructured Data

Schema Design and Normal Forms Sid Name Level Rating Wage Hours

Data Integration and Exchange. L. Libkin 1 Data Integration and Exchange

COSC344 Database Theory and Applications. Lecture 9 Normalisation. COSC344 Lecture 9 1

Chapter 7: Relational Database Design

Query Optimization Approach in SQL to prepare Data Sets for Data Mining Analysis

Chapter 15 Basics of Functional Dependencies and Normalization for Relational Databases

Databases -Normalization III. (N Spadaccini 2010 and W Liu 2012) Databases - Normalization III 1 / 31

SQL Tables, Keys, Views, Indexes

Big Data Cleaning. Nan Tang. Qatar Computing Research Institute, Doha, Qatar

Customer Classification And Prediction Based On Data Mining Technique

Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification

A Survey on Web Mining From Web Server Log

Jordan University of Science & Technology Computer Science Department CS 728: Advanced Database Systems Midterm Exam First 2009/2010

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

How To Improve Performance In A Database

CS143 Notes: Normalization Theory

DATA SECURITY IN CLOUD USING ADVANCED SECURE DE-DUPLICATION

Enhancing A Software Testing Tool to Validate the Web Services

Data Quality Problems beyond Consistency and Deduplication

Relational Database Design: FD s & BCNF

Design of Relational Database Schemas

Database Migration over Network

Evaluation of Sub Query Performance in SQL Server

A Simplified Framework for Data Cleaning and Information Retrieval in Multiple Data Source Problems

Lesson 8: Introduction to Databases E-R Data Modeling

Determination of the normalization level of database schemas through equivalence classes of attributes

Foundations of Information Management

CMS Query Suite. CS4440 Project Proposal. Chris Baker Michael Cook Soumo Gorai

Data Quality in Information Integration and Business Intelligence

Theory of Relational Database Design and Normalization

Introduction to Databases, Fall 2005 IT University of Copenhagen. Lecture 5: Normalization II; Database design case studies. September 26, 2005

Limitations of E-R Designs. Relational Normalization Theory. Redundancy and Other Problems. Redundancy. Anomalies. Example

Database Design and Normal Forms

Dependencies Revisited for Improving Data Quality

A SURVEY ON WEB MINING TOOLS

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Exploitation of Server Log Files of User Behavior in Order to Inform Administrator

æ A collection of interrelated and persistent data èusually referred to as the database èdbèè.

Android based Secured Vehicle Key Finder System

Minimize Response Time Using Distance Based Load Balancer Selection Scheme

Normal Form vs. Non-First Normal Form

International Journal of Advanced Computer Technology (IJACT) ISSN: PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS

Methods for Firewall Policy Detection and Prevention

Incremental Violation Detection of Inconsistencies in Distributed Cloud Data

Bitmap Index an Efficient Approach to Improve Performance of Data Warehouse Queries

Chapter 10. Functional Dependencies and Normalization for Relational Databases

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Active database systems. Triggers. Triggers. Active database systems.

FINANCING OF WORKING CAPITAL IN SELECT CEMENT COMPANIES OF ANDHRA PRADESH

Introduction Decomposition Simple Synthesis Bernstein Synthesis and Beyond. 6. Normalization. Stéphane Bressan. January 28, 2015

CREATING MINIMIZED DATA SETS BY USING HORIZONTAL AGGREGATIONS IN SQL FOR DATA MINING ANALYSIS

RELATIONAL DATABASE DESIGN

Development of Software Dispatcher Based. for Heterogeneous. Cluster Based Web Systems

Operating Stoop for Efficient Parallel Data Processing In Cloud

Schema Refinement and Normalization

Real-Time Analysis of CDN in an Academic Institute: A Simulation Study

Physical Database Design and Tuning

Distributed Framework for Data Mining As a Service on Private Cloud

IT2305 Database Systems I (Compulsory)

CS377: Database Systems Data Security and Privacy. Li Xiong Department of Mathematics and Computer Science Emory University

Optimization of ETL Work Flow in Data Warehouse

Clean Answers over Dirty Databases: A Probabilistic Approach

1. Physical Database Design in Relational Databases (1)

Database Constraints and Design

Evaluation of Open Source Data Cleaning Tools: Open Refine and Data Wrangler

FINDING, MANAGING AND ENFORCING CFDS AND ARS VIA A SEMI-AUTOMATIC LEARNING STRATEGY

Design and Implementation of Firewall Policy Advisor Tools

Chapter 10 Functional Dependencies and Normalization for Relational Databases

Transcription:

IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 3, Mar 2014, 17-24 Impact Journals AN EFFICIENT EXTENSION OF CONDITIONAL FUNCTIONAL DEPENDENCIES IN DISTRIBUTED DATABASES SUCHITRA REYYA 1, M. PRAMEELA 2, G. VASAVI YADAV 3, K. SANDHYA RANI 4 & A. V. BHARGAVI 5 1 Assistant Professor, Department of Computer Science & Engineering, Lendi Institute of Engineering & Technology, Andhra Pradesh, India 2,3,4,5 Department of Computer Science & Engineering, Lendi Institute of Engineering & Technology, Andhra Pradesh, India ABSTRACT This paper propose an efficient data cleaning by using extended conditional functional dependencies (ecfd s) which is an extension of conditional functional dependencies (CFD s) to reduce the inconsistency of data, efficient conditional functional dependencies intend to solve the multi valued inconsistencies to trounce drawbacks of CFD s which use pattern tableau to hold individual tuples in a table for cleaning relational data by supporting only single value attributes. One of the central problems for data quality is inconsistency detection. Given a database D and a set of dependencies as data quality rules, we want to identify tuples in D that violate some rules in. When D is a centralized database, there have been effective ecfd s techniques for finding violations. It is however; far more challenging when data in D is distributed in which inconsistency detection often necessarily requires shipping data from one system to another. KEYWORDS: CFD s, ecfd s, Nested Relational Database, Distributed Database INTRODUCTION Data cleaning is also called as data scrubbing, which deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data. Thus data cleaning plays a major role of preprocessing. One way of achieving this is by using conditional functional dependencies (CFD s). CFD s were developed which aim at capturing violations for individual tuples by using the concept of pattern table 1. When these pattern tables are implied on the entire database the tuples which violate the pattern tables are detected and hence can be easily corrected. CFD s can hold only single value attributes but not for multi valued attributes. To overcome this problem we are using ecfd s for improving data quality which are extended from CFD s to work over the problem of multi valued attributes by using nested relational database[5][6]. Figure 1 Impact Factor(JCC): 1.3268 - This article can be downloaded from www.impactjournals.us

18 Suchitra Reyya, M. Prameela, G. Vasavi Yadav, K. Sandhya Rani & A. V. Bhargavi This paper develops techniques for detecting violations of CFD s in relations that are fragmented and distributed across different systems. Distributed database is nothing but a collection of several different databases that look like a single database to the user. Data in distributed database system is stored across several systems, and each system is typically managed by DBMS that can run independent of each other systems, and ecfd s are also used for detecting violations in multi valued inconsistencies. Here we take the centralized system where we enter the ecfd s data which eliminates the redundant data, where it stores that data in the centralized system and it distributes the data to Centralized Database (D1) and Centralized Database (D2) here if the D1 is cleaned and updated D2 and centralized system also will be cleaned and updated. And if D2 is cleaned and updated both the D1 and the centralized system will also be cleaned and updated automatically. RELATED WORK Our project is worked on the Effective Conditional Functional Dependencies (ECFD s) based on the Conditional functional dependencies (CFD s). Relations that contain redundant information may potentially endure from update anomalies. Functional dependencies are used to detect the redundancies by enforcing integrity constraints at schema level [1]. The constraints hold for all the tupelos in the table. The main idea behind FD is that the value of a particular attribute uniquely determines the value of some other attribute [8] A relation A->B if for every valid instance of A, that value of A uniquely determines the value of B. Data quality is recognized as one of the most important problems for data management [1]. A central technical problem for data quality concerns inconsistency detection, to identify errors in data. More specifically, given a database D and a set of dependencies serving as data quality rules, the detection problem is to find all the violations of in D, i.e., all the tuples in D that violate some rules in. For a data quality tool to be effective in practice, it is Definition of Conditional Functional Dependencies Conditional Functional D were recently introduced for data cleaning. They extend standard FD s by enforcing patterns of semantically related constants. CFD s have been more effective than FD s in detecting and repairing inconsistent data. A CFD ϕ over R is a pair (X->A, tp), where x is a set of attributes in attribute(r), and A is a single attribute in attribute(r), X -> A is a standard functional dependency, referred to as the functional dependency embedded in ϕ and tp is a pattern tuple with attributes in X and A, where for each B in X U{A}, tuple [B] is either a constant a in domain(b), or an unnamed variable - that draws values from domain(b). By using this CFD we can remove the unwanted data, it cannot support the multi valued inconsistencies so we can use the ecfd s. Definition of Efficient Conditional Functional Dependency Let R be some relational schema given as (R:X Y, Tp, Tm) where (1) X, Y, Tp, Tm attr(r); (2) X Y is a Embedded FD; (3) Tp is a pattern tableau consisting of finite number of pattern tuples; (4) Tm is a multi tableau holding multiple values of Y for each X attribute. Conditional functional dependencies were recently introduced for data cleaning they extend standard functional dependencies by enforcing patterns of semantically related constants CFD s is used to detecting inconsistencies of data. Index Copernicus Value: 3.0 - Articles can be sent to editor@impactjournals.us

An Efficient Extension of Conditional Functional Dependencies in Distributed Databases 19 ELIMINATE THE UNWANTED DATA IN DISTRIBUTED DATABASE USING NESTED RELATIONAL DATABASE This paper proposes ecfd s which hold for multi value data. This is achieved using nested Database. The data after corrected will be distributed to different systems. Then overall design of the system is given below Figure 2: A Model for Achieving ecfd s in Distributed Database The model architecture describes the step by step flow of process carried to implement ecfd s. In the first step the data in the database is compared with table Tm to check for multi value inconsistencies. In the second step if the inconsistencies are found then it compares each tuple with matching fields from the database and table Tm with pattern table Tp. In third step the incorrect tuple is updated with matching value from multi tableau and saved in database. In step 4 corrected the data is stored in the Centralized Database and it can be send to the D1 &D2. In step 5 the data is cleaned and updated in D1 and automatically cleaned and updated in D2&Centralized system and vice versa. Example Let us consider a schema cust (id_num, Country code, Area code, Phone number, customer name, Area, city, pin). In the below table we can observe that the city NYC and Vizag has different pin Φ1: City {NYC} ϕ with pin {07974, 01202, 01204}. Φ2: City {VIZAG} ϕ with AC {530013, 530014}. Table 1: Customer Database (CD) Id_Num Country Area Phone Customer Code Code Number Name Area City Pin 1 1 908 111111 Mike Tree ave Nyc 07974 2 1 908 111111 Nick Tree ave Nyc 07974 3 1 212 222222 Joe Elm str Nyc 01202 4 1 212 222222 Jim 5th ave Nyc 01205 5 5 0891 2535015 Ben Eenadu Vizag 530013 6 91 0891 2535015 Teena Arilova Vizag 530014 7 1 215 9885430122 Martin Siripuri Vzm 02349 8 1 216 9185430122 Meena Oka ave Hyd 832108 9 91 0891 2535015 Davidson Troy Vizag 530015 Impact Factor(JCC): 1.3268 - This article can be downloaded from www.impactjournals.us

20 Suchitra Reyya, M. Prameela, G. Vasavi Yadav, K. Sandhya Rani & A. V. Bhargavi Hence in ϕ1 if the city is NYC then the pin must be one of the following values. Similarly in Φ2 if the is VIZAG the pin must be one of the above given values. In the above example tuple t4 contains City as NYC but pin is 01205 which is not one of the pin associated with NYC and tuple t9 has City as Vizag but pin is 530015. This problem is not overcome in CFD s as it does not hold multi-valued attributes. This multi-valued attribute problem can be solved with by less complexity by taking two tables Tm & Tp. Table Tm: Tm is a multi tableau holding all the tuples consisting of multi- value of Y for a single attribute X. The Tm table for the given customer (cust) instance is shown below. Table 2: Multi Tabelau-Tm City Pin Nyc {07974,01202,01204} Vizag {530013,530014} Now to correct the pin in tuple t4 appropriate pin must be selected from multiple values in table Tm. So we use the pattern table Tp which consists of fields, Area and pin to select the pin that match the fields in the tuple. The pattern tableau for the cust relation is shown below From cust relation we observe that in the tuple t4, area 5 th ave and city is NYC which uniquely determines pin as 07974. Thus now the violation showed in tuple t4 can be eliminated by correcting it with pin as 07974 using these two tables. METHODOLOGY TO REDUCE REDUNDANCY Table 3: Pattern Tableau-Tp Area Pin Tree ave 07974 Elm str 01202 5 th ave 07974 Eanadu 530013 Arilova 530014 Troy 530013 Oka ave 832108 We introduce a query for solving multi value inconsistencies in a simplified way with less complexity. This query shows the violate tuples by considering customer relation table(c) and multi tableau(tm) where the city s area code (AC) does not match with the disjunction of options present in the multi tableau Tm. Select All From cust C, table T Where (Tm.City @ ) And Tm.pin [i.in] <> @ ) And (C.City=Tm.City And C.pin<>Tm.pin [i.in]); Once the violations are displayed the pin must be corrected with the appropriate one from the multiple pins. Index Copernicus Value: 3.0 - Articles can be sent to editor@impactjournals.us

An Efficient Extension of Conditional Functional Dependencies in Distributed Databases 21 This below query is used to update the pin by considering customer relation table (C), multi tableau(tm) and pattern tableau (Tp) where the pin number of Tm matches with that of Tp and the street in customer relation matches in that of Tp. Update Cust set C.pin=(select Tp.pin From table2 Tp Algorithm: Eliminating inconsistent tuples in multi-valued attribute in distributed database using nested relational database Input: Inconsistent multi-valued database (CD) Output: Clean the database (CCD) and it is distributed to the different systems. Method: Solving multi-valued inconsistencies in distributed database using Nested Relational database. Begin Consider a table CD with different fields. Take a new table multi tableau(tm) having multituple values for attribute X i.e. X[i.in]. For each tuple in table CD having attr X, compare it with X[i.in] of table Tm. If comparison successful then go to step15. Else Display tuples not satisfying step 4 End for Consider a pattern tableau Tp having prefix constants and unnamed variables For updating the inconsistent values shown in step 7, for each tuple in table CD compare Tm.X[i..in] with Tp.X and CD.Y with Tp.Y respectively. If comparision successful then Update attr X with tp.x End if End for Original data base(cd) is updated as clean Database(CCD) X attr stored in centralized system (CS). This attr are distributed to the CD1and CD2. Impact Factor(JCC): 1.3268 - This article can be downloaded from www.impactjournals.us

22 Suchitra Reyya, M. Prameela, G. Vasavi Yadav, K. Sandhya Rani & A. V. Bhargavi The data will be cleaned and updated in CS automatically cleaned and updated in CD1 and CD2. END Now after applying the above algorithm on Table 1, the consistencies in the table will be replaced by the accurate data, thereby making the database clean and error free. The below is the clean database (Table 4) obtained as a result of the algorithm. The pin in tuple t4 is replaced with 07974 and similarly tuple t9 with erroneous pin is replaced with appropriate pin as 530013. and vice versa vice versa. Table 4: Centralized Database (CD) Id_Num Country Area Phone Customer Code Code Number Name Area City Pin 1 1 908 111111 Mike Tree ave Nyc 07974 2 1 908 111111 Nick Tree ave Nyc 07974 3 1 212 222222 Joe Elm str Nyc 01202 4 1 212 222222 Jim 5th ave Nyc 07974 5 5 0891 2535015 Ben Eanadu Vizag 530013 6 91 0891 2535015 Teena Arilova Vizag 530014 7 1 215 9885430122 Martin Siripuri Vzm 02349 8 1 216 9185430122 Meena Oka ave Hyd 832108 9 91 0891 2535015 Davidson Troy Vizag 530013 Here the D1 is cleaned and update the data and automatically cleaned and update the data on centralized database Table 5: Centralized Database (D1) Id_Num Country Area Phone Customer Code Code Number Name Area City Pin 1 1 908 111111 Mike Tree ave Nyc 07974 2 1 908 111111 Nick Tree ave Nyc 07974 3 1 212 222222 Joe Elm str Nyc 01202 4 1 212 222222 Jim 5th ave Nyc 07974 Table 6: Centralized Database (D2) Id_Num Country Area Phone Customer Code Code Number Name Area City Pin 1 5 0891 2535015 Ben Eanadu Vizag 530013 2 91 0891 2535015 Teena Arilova Vizag 530014 3 1 215 9885430122 Martin Siripuri Vzm 02349 4 1 216 9185430122 Meena Oka ave Hyd 832108 5 91 0891 2535015 Davidson Troy Vizag 530013 Here D2 will be cleaned and updated then automatically cleaned and update the data on centralized system and EXPERIMENTAL ANALYSIS In this section, we present our findings about the performance of our models for reducing redundancies over multi-value databases. of RAM. Setup: For the experiments, we used postgre SQL on Windows XP with 1.86 GHz Power PC dual CPU with 3GB Index Copernicus Value: 3.0 - Articles can be sent to editor@impactjournals.us

An Efficient Extension of Conditional Functional Dependencies in Distributed Databases 23 Efficiency of ecfd s: The efficiency of the ecfd s is graphically displayed by considering the multi-value database. When CFD s are posted on this database all tuples having multiple values for each attribute are changed to only one value and obtains duplicate data. But in ecfd s each attribute can hold multiple data thereby preserving the data. Figure 3: Graph Displaying the Efficiency of ecfd s In this experiment we have analyzed how the redundancy is reduced by considering the space occupied by the pattern table (TP) and multi tableau (Tm) in CFD s and ecfd s respectively. It is observed Tp can hold only one value for each fixed pattern where as Tm can efficiently hold multiple values for each attribute in the pattern thus reducing tuple size. CONCLUSIONS Table 7: Comparison of Size of Tuples in Tp (CFD s) & Tm (ecfd s) No. of Multiple Values Tp Tm Pattern 1 1 2 Pattern 2 1 3 Pattern 3 1 4 We presented that nesting based ecfd s proved to be better than CFD s is eliminating redundancy in nested relational database. ECFDs acquire no extra complexity in inconsistency detection. We have developed ecfd s based technique for avoiding violations. Our experimental results for efficiency and size of the relation proved ecfd s to be more effective compared to CFD s. Finally we conclude that in which we can remove unwanted data by using ecfd s and Distributed to different sites by using Distributed Database. Updates are performed on the centralized database D1, D2 In the centralized database D1 the data will be cleaned and updated automatically the centralized database D2 will be cleaned and vice versa. ACKNOWLEDGEMENTS We express our sincere and profound gratitude to LENDI institute management members of our Chairman Sri P. Madhusudhana Rao, Vice-Chairman Sri P. Srinivasa Rao and Secretary Sri K. Siva Rama Krishna for their valuable support and providing good infrastructure. Our special thanks to Principal Dr. V. V. Rama Reddy for his motivation and encouragement. We also thank our Head of the Department Professor A. Rama Rao for giving guidance and continuous support. Impact Factor(JCC): 1.3268 - This article can be downloaded from www.impactjournals.us

24 Suchitra Reyya, M. Prameela, G. Vasavi Yadav, K. Sandhya Rani & A. V. Bhargavi REFERENCES 1. Philip Bohannon, Wenfei Fan, Floris Geerts and Xibei Jia3 Anastasios Kementsietsidis3: Conditional Functional Dependencies for Data Cleaning. 2. Wenfei Fan, Floris Geerts, Laks V.S. Lakshmanan and Ming Xiong: Discovering Conditional Functional Dependencies, IEEE conference on data engineering. 3. Suchitra Reyya, Sangeeta Viswanadham: Eliminating Data Redundancy through Weak Multi valued Dependencies using Nesting. 4. Suchitra Reyya, M. Sundara Babu, Ratnam Dodda: An Efficient Storage of Spatial Database using Nested Relational Database, International Journal of Engineering Research & Technology (IJERT) Vol. 1 Issue 7, September 2012. 5. Erhard Rahm, Hong Hai Do: Data Cleaning Problems and Current Approaches, IEEE Techn. Bulletetin on Data Engineering, Dec. (2000). 6. Korth, H., Roth, M.: Query languages for Nested Relational Databases, Nested Relations and Complex Objects in Databases, Springer, (1991). Index Copernicus Value: 3.0 - Articles can be sent to editor@impactjournals.us