Implementing evolution: Database migration



Similar documents
IT2305 Database Systems I (Compulsory)

Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification

The Relational Model. Why Study the Relational Model? Relational Database: Definitions. Chapter 3

DATABASE MANAGEMENT SYSTEMS. Question Bank:

IT2304: Database Systems 1 (DBS 1)

Database Design. Marta Jakubowska-Sobczak IT/ADC based on slides prepared by Paula Figueiredo, IT/DB

The Relational Model. Why Study the Relational Model? Relational Database: Definitions

Evolving Hybrid Distributed Databases: Architecture and Methodology

Maintaining Stored Procedures in Database Application

HP Quality Center. Upgrade Preparation Guide

The Relational Model. Ramakrishnan&Gehrke, Chapter 3 CS4320 1

Fundamentals of Database System

Fundamentals of Database Design

Bridge from Entity Relationship modeling to creating SQL databases, tables, & relations

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. The Relational Model. The relational model

DBMS Questions. 3.) For which two constraints are indexes created when the constraint is added?

CS 377 Database Systems. Database Design Theory and Normalization. Li Xiong Department of Mathematics and Computer Science Emory University

Chapter 1: Introduction. Database Management System (DBMS) University Database Example

Databases What the Specification Says

Migration of Legacy Information Systems

Chapter 1: Introduction

Database Concepts. Database & Database Management System. Application examples. Application examples

THE ENTITY- RELATIONSHIP (ER) MODEL CHAPTER 7 (6/E) CHAPTER 3 (5/E)

Review: Participation Constraints


Requirement Analysis & Conceptual Database Design. Problem analysis Entity Relationship notation Integrity constraints Generalization

LiTH, Tekniska högskolan vid Linköpings universitet 1(7) IDA, Institutionen för datavetenskap Juha Takkinen

Database Management. Chapter Objectives

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. Introduction to Databases. Why databases? Why not use XML?

2. Conceptual Modeling using the Entity-Relationship Model

Lecture 6. SQL, Logical DB Design

æ A collection of interrelated and persistent data èusually referred to as the database èdbèè.

Conceptual Design Using the Entity-Relationship (ER) Model

Database Design Overview. Conceptual Design ER Model. Entities and Entity Sets. Entity Set Representation. Keys

Outline. Data Modeling. Conceptual Design. ER Model Basics: Entities. ER Model Basics: Relationships. Ternary Relationships. Yanlei Diao UMass Amherst

Files. Files. Files. Files. Files. File Organisation. What s it all about? What s in a file?

Databases Model the Real World. The Entity- Relationship Model. Conceptual Design. Steps in Database Design. ER Model Basics. ER Model Basics (Contd.

COMP 378 Database Systems Notes for Chapter 7 of Database System Concepts Database Design and the Entity-Relationship Model

Relational Database Basics Review

Reverse Engineering of Relational Databases to Ontologies: An Approach Based on an Analysis of HTML Forms

ECS 165A: Introduction to Database Systems

Lesson 8: Introduction to Databases E-R Data Modeling

The Entity-Relationship Model

Overview RDBMS-ORDBMS- OODBMS

Normal Form vs. Non-First Normal Form

Data Quality in Information Integration and Business Intelligence

not necessarily strictly sequential feedback loops exist, i.e. may need to revisit earlier stages during a later stage

Introduction to Triggers using SQL

1 File Processing Systems

Oracle Database: SQL and PL/SQL Fundamentals NEW

Oracle Database: SQL and PL/SQL Fundamentals

SQL Server. 1. What is RDBMS?

OBJECTS AND DATABASES. CS121: Introduction to Relational Database Systems Fall 2015 Lecture 21

Oracle Database: SQL and PL/SQL Fundamentals

Chapter 5. SQL: Queries, Constraints, Triggers

Extending Data Processing Capabilities of Relational Database Management Systems.

DATABASE INTRODUCTION

- Eliminating redundant data - Ensuring data dependencies makes sense. ie:- data is stored logically

Logical Schema Design: The Relational Data Model

Week 1 Part 1: An Introduction to Database Systems. Databases and DBMSs. Why Use a DBMS? Why Study Databases??

Chapter 7 Data Modeling Using the Entity- Relationship (ER) Model

Core Syllabus. Version 2.6 B BUILD KNOWLEDGE AREA: DEVELOPMENT AND IMPLEMENTATION OF INFORMATION SYSTEMS. June 2006

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Relational model. Relational model - practice. Relational Database Definitions 9/27/11. Relational model. Relational Database: Terminology

Duration Vendor Audience 5 Days Oracle End Users, Developers, Technical Consultants and Support Staff

SQL DATA DEFINITION: KEY CONSTRAINTS. CS121: Introduction to Relational Database Systems Fall 2015 Lecture 7

Course: CSC 222 Database Design and Management I (3 credits Compulsory)

Data Modeling Basics

What is a database? COSC 304 Introduction to Database Systems. Database Introduction. Example Problem. Databases in the Real-World

SEMI AUTOMATIC DATA CLEANING FROM MULTISOURCES BASED ON SEMANTIC HETEROGENOUS

2. Basic Relational Data Model

Foreign and Primary Keys in RDM Embedded SQL: Efficiently Implemented Using the Network Model

ISM 318: Database Systems. Objectives. Database. Dr. Hamid R. Nemati

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

C HAPTER 4 INTRODUCTION. Relational Databases FILE VS. DATABASES FILE VS. DATABASES

The Relational Model. Why Study the Relational Model?

Principles of Database. Management: Summary

Database Design and Normalization

Chapter 5: Logical Database Design and the Relational Model Part 2: Normalization. Introduction to Normalization. Normal Forms.

Designing a Database Schema

Database Systems. National Chiao Tung University Chun-Jen Tsai 05/30/2012

Part A: Data Definition Language (DDL) Schema and Catalog CREAT TABLE. Referential Triggered Actions. CSC 742 Database Management Systems

Database IST400/600. Jian Qin. A collection of data? A computer system? Everything you collected for your group project?

Data Modeling. Database Systems: The Complete Book Ch ,

The Import & Export of Data from a Database

BCA. Database Management System

Foundations of Information Management

DBMS / Business Intelligence, SQL Server

U III 5. networks & operating system o Several competing DOC standards OMG s CORBA, OpenDoc & Microsoft s ActiveX / DCOM. Object request broker (ORB)

Chapter 9: Object-Based Databases

Some Methodological Clues for Defining a Unified Enterprise Modelling Language

DATABASE SYSTEMS. Chapter 7 Normalisation

Normalization in OODB Design

COMP5138 Relational Database Management Systems. Databases are Everywhere!

Data Integration and ETL Process

CIS 631 Database Management Systems Sample Final Exam

MS SQL Performance (Tuning) Best Practices:

3. Relational Model and Relational Algebra

Information Systems Modelling Information Systems I: Databases

COMP 5138 Relational Database Management Systems. Week 5 : Basic SQL. Today s Agenda. Overview. Basic SQL Queries. Joins Queries

Transcription:

2IS55 Software Evolution Sources Implementing evolution: Database migration Alexander Serebrenik / SET / W&I 7-6-2010 PAGE 1 Last week Recapitulation Assignment 8 How is it going? Questions to Marcel: m.f.v.amstel@tue.nl Last week: introduction to DB migration DB migration Data is important, technology is outdated S DB schema D DB data P data manipulation programs Physical conversion / conversion Wrap / Statement rewriting / Logical rewriting / SET / W&I 7-6-2010 PAGE 2 / SET / W&I 7-6-2010 PAGE 3 Schema conversion: Physical vs Schema conversion: Physical Physical Easy to automate Existing work: COBOL relational, hierarchical relational, relational OO Migration as translation vs migration as improvement Semantics is ignored Limitations of COBOL Design decisions in the legacy system Automatic conversion the same design decisions in the new system Risk: compromised flexibility / SET / W&I 7-6-2010 PAGE 4 / SET / W&I 7-6-2010 PAGE 5 1

Schema conversion: Physical vs Schema conceptualization Refinement: Data and code may contain implicit constraints (field refinement, foreign key, cardinality) on the schema ization: Remove implementation details / SET / W&I 7-6-2010 PAGE 6 / SET / W&I 7-6-2010 PAGE 7 ization Untranslation: Foreign keys Preparation: clean up to understand e.g., rename attributes, drop one-element compounds Untranslation: separate logic from limitations of technology De-optimization: separate logic from performance normalization: Entities vs. relations and attributes Explicit IS-A relations COBOL allows direct access via foreign keys ER requires a relationship set to connect two entities What would be the appropriate cardinality? One customer can place multiple orders Every order can be placed only by one customer / SET / W&I 7-6-2010 PAGE 8 / SET / W&I 7-6-2010 PAGE 9 De-optimization normalization Recall: ORD-DETAIL is a complex multi-valued attribute Highly efficient COBOL trick ORD-DETAIL cannot exist without an order What would you like to improve in this schema? Are the cardinality constraints meaningful? Which entities are, in fact, relations? Are there unneeded structures? How would you model this in ER? Weak entity set One-to-many relationship / SET / W&I 7-6-2010 PAGE 10 / SET / W&I 7-6-2010 PAGE 11 2

normalization Logical design: schema concepts DB tables Physical design: e.g., naming conventions / SET / W&I 7-6-2010 PAGE 12 / SET / W&I 7-6-2010 PAGE 13 Hainaut 2009: Before and After Another case study (Ch. 6) Refined schema: decomposed attributes Address = Street, Number, City, ZIP, State Schema refinement: 89 foreign keys, 37 computer foreign keys, 60 redundancies Relational DB2 entities: decomposition of arrays / SET / W&I 7-6-2010 PAGE 14 / SET / W&I 7-6-2010 PAGE 15 Recall So far we have considered DB schemas only Next step: data migration Data migration Strategy depends on the schema migration strategy Physical conversion: straightforward Data format conversion conversion Data may violate implicit constraints Hence, data cleaning is required as preprocessing Once the data has been cleaned up: akin to physical conversion / SET / W&I 7-6-2010 PAGE 16 / SET / W&I 7-6-2010 PAGE 17 3

What should be cleaned? 1 source [Rahm, Do] Schema-level Can be solved with appropriate integrity constraints What should be cleaned? Multiple sources Which DB tuples refer to the same real-world entity? Instance-level Scheme: name and structure conflicts Instance: data representation, duplication, identifiers / SET / W&I 7-6-2010 PAGE 18 / SET / W&I 7-6-2010 PAGE 19 How to clean up data? Data cleaning: Analysis Analyse: Define inconsistencies and detect them Define individual transformations and the workflow Verify correctness and effectiveness Sample/copy of the data Transform Backflow if needed If the old data still will be used, it can benefit from the improvements. Data profiling Instance analysis of individual attributes Min, max, distribution, cardinality, uniqueness, null values max(age) > 150? count(gender) > 2? Data mining Instance analysis of relations between the attributes E.g., detect association rules Confidence(A B) = 99% 1% of the cases might require cleaning / SET / W&I 7-6-2010 PAGE 20 / SET / W&I 7-6-2010 PAGE 21 Data cleaning: Analysis (continued) Define data transformations Record matching problem: Smith Kris L., Smith Kristen L., Smith Christian, Matching based on Simplest case: unique identifiers (primary keys) Approximate matching Different weights for different attributes Strings: Edit distance Keyboard distance Phonetic similarity Very expensive for large data sets / SET / W&I 7-6-2010 PAGE 22 Use transformation languages Proprietary (e.g., DataTransformationService of Microsoft) SQL extended with user-defined functions (UDF): CREATE VIEW Customer2(LName, FName, Street, CID) AS SELECT LastNameExtract(Name), FirstNameExtract(Name), Street, CID) FROM Customer CREATE FUNCTION LastNameExtract(Name VARCHAR(255)) RETURNS VARCHAR(255) RETURN SUBSTRING(Name FROM 28 FOR 15) / SET / W&I 7-6-2010 PAGE 23 4

UDF: advantages and disadvantages Inconsistency resolution Advantages Does not require learning a separate language Disadvantages Suited only for information already in a DB What about COBOL files? Ease of programming depends on availability of specific functions in the chosen SQL dialect Splitting/merging are supported but have to be reimplemented for every separate field Folding/unfolding of complex attributes not supported at all. If inconsistency has been detected, the offending instances Are removed Are modified so the offending data becomes NULL Are modified by following user-defined preferences One table might be more reliable than the other One attribute may be more reliable than the other Are modified to reduce the (total) number of modifications required to restore consistency / SET / W&I 7-6-2010 PAGE 24 / SET / W&I 7-6-2010 PAGE 25 Inconsistency resolution From data to programs If multiple inconsistencies are detected Which one has to be selected to be resolved? Usually resolutions are not independent [Wijsen 2006]: Single database instead of multiple uprepairs Special kind of updates Which one is more important? So far: schemas and data Next : programs Wrapping Statement rewriting Program rewriting / SET / W&I 7-6-2010 PAGE 26 / SET / W&I 7-6-2010 PAGE 27 Wrappers Wrappers with wrapped operations Legacy code? Legacy code Wrapper Actual implementation of READ Legacy data representation New data representation Start wrapping action READ / SET / W&I 7-6-2010 PAGE 28 / SET / W&I 7-6-2010 PAGE 29 5

Wrappers [Thiran, Hainaut]: wrapper code can be reused Cannot be expressed in the DB itself Upper wrapper Model wrapper Common to all DMS in the family: cursor, transaction Instance wrapper Specific to the given DB: query translation, access optimization Manually written Automatically generated Wrapping: Pro and Contra Wrapping Preserves logic of the legacy system Can be (partially) automated Physical + wrapper: Almost automatic (cheap and fast) Quality is poor, unless the legacy DB is well-structured + wrapper: More complex/expensive Quality is reasonable: First schema, then code Possible performance penalty due to complexity of wrappers Mismatch: DB-like schema and COBOL like code / SET / W&I 7-6-2010 PAGE 30 / SET / W&I 7-6-2010 PAGE 31 Wrapping in practice Cursor?.. Control structure for the successive traversal of records in a query result Cursor declaration Wrappers 159 wrappers 450 KLOC What will this cursor return? O_CUST = J12 CUS_CODE J11 12 J12 11 J13 14 K01 15 CODE Why would you like to use such a cursor? COBOL READ: Sequential reading starting from the first tuple with the given key / SET / W&I 7-6-2010 PAGE 32 / SET / W&I 7-6-2010 PAGE 33 Cursor?.. Control structure for the successive traversal of records in a query result Cursor declaration Legacy code? Legacy code Using cursors Opening a cursor Retrieving data Closing cursor Legacy data representation New data representation / SET / W&I 7-6-2010 PAGE 34 / SET / W&I 7-6-2010 PAGE 35 6

O-CUST does not appear in ORDERS / SET / W&I 7-6-2010 PAGE 36 / SET / W&I 7-6-2010 PAGE 37 Files can have multiple keys and multiple READ commands We need to remember which key/read is used! Prepare the cursor for READing READ the data / SET / W&I 7-6-2010 PAGE 38 / SET / W&I 7-6-2010 PAGE 39 We need additional cursor and procedure to read the order details: Legacy DB New DB : Pro and Contra Preserves logic of the legacy system Intertwines legacy code with new access techniques Detrimental for maintainability Physical + statement Inexpensive and popular Blows up the program: from 390 to ~1000 LOC Worst strategy possible + statement Good quality DB, unreadable code: First schema, then code Meaningful if the application will be rewritten on the short term / SET / W&I 7-6-2010 PAGE 40 / SET / W&I 7-6-2010 PAGE 41 7

Alternative 3: Logic Rewriting Alternative 3: Logic Rewriting Akin to conceptual conversion e.g., COBOL loop SQL join And meaningful only in combination with it Otherwise: high effort with poor results Manual transformation with automatic support Identify file access statements Identify and understand data and statements that depend on these statements Rewrite these statements and redefine the objects Legacy DB New DB / SET / W&I 7-6-2010 PAGE 42 / SET / W&I 7-6-2010 PAGE 43 Logic rewriting: Pro and Contra Logic rewriting + physical Low quality DB High costs due to logic rewriting Unfeasible Logic rewriting + conceptual High quality Highest costs Conc. Putting it all together Scheme All combinations are possible Not all are desirable, Wrapping, Statement, Logic Phys. Physical, Wrapping Physical, Statement Physical, Logic Wrapping Statement Logic Code / SET / W&I 7-6-2010 PAGE 44 / SET / W&I 7-6-2010 PAGE 45 Conc. Phys. Putting it all together Scheme All combinations are possible Not all are desirable Better DB, performance penalty Zero time Wrapping Better DB, the application is rewritten later Popular, $, Bad Statement Best but also $$$ Very bad, $$ Logic Code Tools DB-MAIN CASE tool (University of Namur, ReVeR) DDL extraction Schema storage, analysis and manipulation Implicit constraint validation Schema mapping management Data analysis & migration Wrapper generation (COBOL-to-SQL, CODASYL-to- SQL) Transformations Eclipse Modelling Framework: ATL ASF+SDF Meta-Environment (CWI, Amsterdam) / SET / W&I 7-6-2010 PAGE 46 / SET / W&I 7-6-2010 PAGE 47 8

People ((ex)-fundp) Anthony Cleve Conclusions 3 levels of DB migration: schema, data, code Jean-Luc Hainaut Schema: physical/conceptual Data: determined by schema Code: wrapper/statement rewriting/logical rewriting Philippe Thiran Popular but bad: physical + statement Expensive but good: conceptual + logic Alternatives to consider: conceptual + wrapping/statement physical + wrapping (zero time) / SET / W&I 7-6-2010 PAGE 48 / SET / W&I 7-6-2010 PAGE 49 9