Frühjahrssemester 2010 Data Integration and Data Cleaning in DWH Dr. Diego Milano
Organization Motivation: Data Integration and DWH Data Integration Schema (intensional) Level Instance (extensional) Level: Data Cleaning Building an DWH: Data Integration & Cleaning in DWH Design (Introduction to Data Quality)
Background Knowledge & Tools If you don't master some of these tools, let me know immediately: Database & basics: RDBMs concepts Relational model Entity Relationship Model (and Possibly UML) Database design: from a conceptual model to the logical model
What is a DW? A collection of data from different sources Integrated Persistent Dynamically Evolving Focused Used for Decision Support
DWH Operational data (from production/sales OLTP environments) External data (e.g. exchange rates, prices from other sales chains etc.) We focus on what happens here DWH OLAP Data Mining Reporting
Data Integration Given a set of data sources, data integration is the task of presenting them to the user as a single data source. Local Schemas Sources S 1 S 2 S 3... Integrated DB G Global Schema
Two approaches: virtual/materialized Virtual integration: Data stays at the sources, the extension of the global schema is not materialized Queries on the global schema answered using data at sources Pros/cons: + updates on the local sources immediately reflected on the (virtual) integrated DB + No redundancy/no conflicts due to lack of synchronization Enforcing constraints on the global schema not alway possible. Depending on the relationships between the global and the local schema, answering queries may be hard (and inefficient) Propagating updates from the global schema to the global sources is hard Solving inconsistencies at the extensional level is hard
Two approaches: virtual/materialized Materialized integration: Data is copied to a single integrated database Pros/cons: + Queries on the integrated repository are more efficient + Possible/Easier to apply complex transformations to the original data: Integrated schema can be very different from source Instance level transformations made easier Integrated DB goes out of sync with sources, needs periodical refreshing Less storage-efficient, potential inconsistencies due to redundancy A Data Warehouse is first of all a data integration system adopting the materialized approach
Heterogeneity The main issue in data integration tasks is heterogeneity Data residing at different sources present differences on a number of aspects. These differences make it more complex to reduce these data to a single, integrated view It is not easy to classify heterogenity in a crisp way. Some differences relate to syntactic aspects (the specific language/technology used to represent reality), other relate to semantic aspects (how a certain representation captures reality, its meaning), but these differences coexist and it is not always easy or possible to draw lines between what is syntax and what is semantics.
Heterogeneity (Systems/Technology/Syntax) Legacy systems (ad-hoc interfaces) Flat files Web-sources XML files/databases Different DBMSs (e.g. RDBMS, OODBMS...) DBMS with the same flavour (e.g. RDBMS) but with differences in proprietary syntax
Heterogeneity (Data Representation) Intensional Level (schema): Data Model (modeling language): Relational, object-oriented, reticular, semi-structured etc. Structure (representation choices): Different designers have different views of the world (and different application needs), and may use different constructs/data types to represent the same concepts/reality: e.g. Date represented as attribute/standalone concept e.g. Attribute 'sex' encoded as String / Acronym / Integer (0,1) Different views of the world include/exclude portions of information: e.g. Record marital status of employees. Linguistics/terminology: Different designers may use different terms to denote the same concept or use the same term to mean different concepts, at various levels: e.g. attribute 'price': $ Data Warehousing (CS242)
Heterogeneity (Data Representation) Extensional level (instances) Unmappable or partially mappable domains Non-overlapping domains e.g. All students in basel, only students enrolled after 2000. Domains with different granularity: e.g. Sales per day/per month Application-specific domains: e.g. custom identifiers (like employee_code, color_code) meaningful only within a certain application domain. Inconsistencies between semantically equivalent instances Due to errors or other Data Quality problems Data Warehousing (CS242)
Solving heterogeneity issues: Systems/Model level: Wrapper-based architectures Intensional Level: Schema Integration Extensional Level: Instance Identification Instance Reconciliation
Wrapper-based Architectures A wrapper is a piece of software that encapsulates another softwaresystem and acts as an interpreter for it. Allows to: Hide technological differences Hide (to a certain extent) model differences, presenting all sources in a single canonical language. Canonical Model/Language Wrapper Wrapper Wrapper? Legacy RDBMS XML data <xsd:schema> <xsd:element> <xsd:cheneso>... </xsd:cheneso> <xsd:<schema>
Schema Integration Given n data source schemas L1,..,Ln, integrating them means: Identifying correspondences among them Designing a new, integrated schema G that abstracts over all of them and is possibly tailored to some specific application (e.g. for Data Warehousing) Formally specifying mappings between the integrated schema and the source schemas. There are tools to semi-automatically perform some of the activities in schema integration, but these are mostly research-level prototypes. Schema integration is still a (complex) design task for human. Requires expertise in database modeling, and a deep knowledge of the application domains of the schemas to integrate.
Wrapper-Mediator Mediator A mediator interacts with the wrappers, and presents to the users a unified global view over the local schemas Mapping Wrapper Wrapper Wrapper? Legacy RDBMS XML data <xsd:schema> <xsd:element> <xsd:cheneso>... </xsd:cheneso> <xsd:<schema>
Schema Integration Steps 1.Analysis, Normalization, Abstraction to a common conceptual modeling language 2.Choice of integration strategy 3.Schema Matching: Identify relationships among local schemas 4.Schema Alignment: solve conflicts 5.Schema Fusion: create the Global schema The result of this process is a mapping between the source schemas and the integrated schema
1. Analysis For each data source in isolation, the designer must acquire a deep understanding of the application domain: In-depth analysis of the schema(s) interaction with domain experts The result of this phases is to produce a conceptual schema in the canonical language of choice, which: Reflects in the most accurate and complete way possible the domain of interest. Is well-understood Is well-documented
Analysis: Know Your Enemy Gathering knowledge about complex application domains is difficult: Business rules covered by secret/not well-documented (Cooperative) domain experts are key elements Understanding the IS of an enterprise is difficult: Legacy systems requires ad-hoc knowledge (e.g. No database schema but data in flat files with custom format) Even if the DB is relational: Software/System documentation is often poor. The domain conceptualization steps that lead to a certain database design, and many design choices, may be lost. Reverse-engineering of the logical schemas and associated applications is sometimes required. This might involve:» Normalization: For efficiency reasons, or bad design, logical schemas are sometimes denormalized» Inferring constraints: not all contraints of the domain are always enforced at the level of logical schema (e.g. not enforced at all, enforced at the application level) Systems are not always well designed/schemas become old. Sometimes corrections to the schema are required
Analysis, Normalization, Abstraction CREATE TABLE product( cat_desc VARCHAR(255), cat_name VARCHAR(255), cat_code INTEGER, prod_desc VARCHAR(255), prod_name VARCHAR(255), prod_code INTEGER PRIMARY KEY ); cat_desc cat_name cat_code Product prod_desc prod_name prod_code CREATE TABLE category( cat_desc VARCHAR(255), cat_name VARCHAR(255), cat_code INTEGER PRIMARY_KEY, ); CREATE TABLE product( prod_desc VARCHAR(255), prod_name VARCHAR(255), prod_code INTEGER PRIMARY KEY cat_code INTEGER REFERENCES category(cat_code) ); normalization/correction: the original logical schema is unnormalized AND does not enforce all constraints holding in the application domain. Product (1,1) belongs_to (0,n) Category description/string Name/String Code / integer Description / String Name/String Code / String
2. Choice of Integration Strategy Comparing at the same time too many schemas is not always easy/feasible Integration process binary n-ary ladder balanced single step iterative
3. Schema Matching Schemas are comparatively analyzed to identify: common concepts and relationships among them differences and structural/semantic conflicts interschema properties
Structural Conflicts on Concepts Book is a common concept Publisher and its relationship to book have a structural conflict: the designers used different language constructs to model the same reality an entity set+relationship in one schema, attributes in the other one Book title ISBN title ISBN Book published_by Publisher Publisher_address Publisher Address Name
Semantic Conflicts on Concepts The attributes Age and Birthdate clearly model two semantically different concepts. However, it is rather easy to solve this conflict because there is an obvious dependency among then. Solving the conflict means being able to restructure one of the schemas (and thus applying to the data some transformation) to make the two concepts identical. Birthdate SSN Citizen SSN Age Citizen
Pitfalls in language: stat rosa pristina nomen... Homonimy: two concepts have the same name but different semantics Synonimy: two concepts have the same semantics but different name Equivalent, with linguistic conflicts: synonims Employee Worker Teacher (1,1) (1,1) (1,1) assigned_to assigned_to assigned_to (1,n) (1,n) (1,n) Department Department Department Identical Non-equivalent, homonims!
Scheme Comparison Identity: the concept is modeled in the same way both from the point of view of structure and that of semantics Equivalence: the concept have the same semantics (same view of the world) but there are structural conflicts Comparability: concepts are modelled with different structure/semantics but the views of the world do not conflict Incomparability: The view of the world differs producing a conflict that is not (easily) solvable
Different, but comparable views Employee Employee (1,1) (1,1) participates_in assigned_to (1,n) (1,n) Project Department (1,1) belongs_to (1,n) Department
Incomparable views The semantics of the two schemes look the same. However, there is a conflict in the integrity constraints which makes the schemas incompatible. Professor Name Professor Name (0,1) (2,n) teaches teaches (1,1) (1,1) Course Course_ID Course Course_ID
Inter-schema properties Schema 1 Schema 2 title title ISBN Book Book ISBN published_by written_by Address Address Name Publisher works_for Author Name
4. Schema Alignment The goal of this phase is to solve differences/conflicts identified at the previous step Obtained by applying transformations to the local schemas: names and types of attributes functional dependencies integrity constraints Issues: Not all conflicts can be solved, e.g. they derive by a substantial differences in how different information systems are designed (how they model the application domain). In this case, users/domain experts must give hints on which is the intepretation of the world they prefer In case of uncertainty, priority is given to those schemas which are more important in the system (e.g., for DWH, schemas with central concepts in the data mart)
5. Schema Fusion Aligned schemas are merged to obtain a single integrated schema. Overlap common concepts Add all other concepts, connecting them to the common concepts
Alignment and Fusion Alignment and fusion, are applied in an iterative way: Solve some conflicts produce temporary integrated schema To solve new conflicts, apply transformations to either the schemas or to the temporary integrated schema
Mappings A mapping is a set of assertions about correspondencies that hold between two schemas. For very different schemas, mappings are hardly formalizable As the integration process proceeds, it becomes possible to express relationships about the extensions of the schemas. At the conceptual level: as set-relationships At the logical level: as queries (in the simplest case) or as transformations The goal is to link every concept in the integrted schema to some concept in the initial schemas through a chain of transformations
Questions & Answers