CONCEPTUAL DATA VAULT MODEL



Similar documents
6NF CONCEPTUAL MODELS AND DATA WAREHOUSING 2.0

Trends in Data Warehouse Data Modeling: Data Vault and Anchor Modeling

Data Warehousing Systems: Foundations and Architectures

Dimensional Modeling for Data Warehouse

Multidimensional Modeling with UML Package Diagrams

Information Management Metamodel

Optimization of ETL Work Flow in Data Warehouse

Data Vault and The Truth about the Enterprise Data Warehouse

Dimensional Modeling and E-R Modeling In. Joseph M. Firestone, Ph.D. White Paper No. Eight. June 22, 1998

Metadata Management for Data Warehouse Projects

Deductive Data Warehouses and Aggregate (Derived) Tables

Lecture 12: Entity Relationship Modelling

DEVELOPING REQUIREMENTS FOR DATA WAREHOUSE SYSTEMS WITH USE CASES

Application Of Business Intelligence In Agriculture 2020 System to Improve Efficiency And Support Decision Making in Investments.

Data warehouse life-cycle and design

Subject Description Form

Automating Data Warehouse Conceptual Schema Design and Evaluation

A Direct Approach to Physical Data Vault Design

Trivadis White Paper. Comparison of Data Modeling Methods for a Core Data Warehouse. Dani Schnider Adriano Martino Maren Eschermann

A Model-based Software Architecture for XML Data and Metadata Integration in Data Warehouse Systems

Methodology Framework for Analysis and Design of Business Intelligence Systems

Key organizational factors in data warehouse architecture selection

TOWARDS A FRAMEWORK INCORPORATING FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTS FOR DATAWAREHOUSE CONCEPTUAL DESIGN

An Introduction to Data Warehousing. An organization manages information in two dominant forms: operational systems of

Goal-Driven Design of a Data Warehouse-Based Business Process Analysis System

Master Data Management and Data Warehousing. Zahra Mansoori

SENG 520, Experience with a high-level programming language. (304) , Jeff.Edgell@comcast.net

Overview. DW Source Integration, Tools, and Architecture. End User Applications (EUA) EUA Concepts. DW Front End Tools. Source Integration

Encyclopedia of Data Warehousing and Mining

Data Warehouse: Introduction

Data warehouse Architectures and processes

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Data Warehouse Design

Sales and Operations Planning in Company Supply Chain Based on Heuristics and Data Warehousing Technology

A Survey on Data Warehouse Architecture

Data warehouses. Data Mining. Abraham Otero. Data Mining. Agenda

Common Warehouse Metamodel (CWM): Extending UML for Data Warehousing and Business Intelligence

Curriculum of the research and teaching activities. Matteo Golfarelli

Improving your Data Warehouse s IQ

BUILDING OLAP TOOLS OVER LARGE DATABASES

A Design and implementation of a data warehouse for research administration universities

Lection 3-4 WAREHOUSING

Sizing Logical Data in a Data Warehouse A Consistent and Auditable Approach

A Methodology for the Conceptual Modeling of ETL Processes

SDWM: An Enhanced Spatial Data Warehouse Metamodel

Data Vault at work. Does Data Vault fulfill its promise? GDF SUEZ Energie Nederland

An Enterprise Data Hub, the Next Gen Operational Data Store

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

Chapter 7 Data Modeling Using the Entity- Relationship (ER) Model

Conceptual Workflow for Complex Data Integration using AXML

Formal Methods for Preserving Privacy for Big Data Extraction Software

Data-Warehouse-, Data-Mining- und OLAP-Technologien

Aligning Data Warehouse Requirements with Business Goals

CDC UNIFIED PROCESS PRACTICES GUIDE

A Data Warehouse Design for A Typical University Information System

How To Write A Diagram

Enabling Better Business Intelligence and Information Architecture With SAP PowerDesigner Software

Reflections on Agile DW by a Business Analytics Practitioner. Werner Engelen Principal Business Analytics Architect

Data Vault Modeling in a Day

Adapting an Enterprise Architecture for Business Intelligence

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Data Warehousing. Jens Teubner, TU Dortmund Winter 2015/16. Jens Teubner Data Warehousing Winter 2015/16 1

Course Design Document. IS417: Data Warehousing and Business Analytics

Appendix B Data Quality Dimensions

Upon successful completion of this course, a student will meet the following outcomes:

Applying MDA and universal data models for data warehouse modeling

INTEGRATION OF HETEROGENEOUS DATABASES IN ACADEMIC ENVIRONMENT USING OPEN SOURCE ETL TOOLS

Automating the process of building. with BPM Systems

DATA WAREHOUSE DESIGN

Evaluating Data Warehousing Methodologies: Objectives and Criteria

Transcription:

CONCEPTUAL DATA VAULT MODEL Vladan Jovanovic Georgia Southern University vladan@georgiasouthern.edu Ivan Bojicic Belgrade University bojicici@fon.bg.ac.rs ABSTRACT In this paper a novel conceptual data warehouse (DW) data model, the Conceptual Data Vault (C-DV) Model is presented. The C-DV extends logical and physical Data Vault (DV) Models, to enable conceptual modeling of source data worthy of preserving in a DW staging area as a System of Records. The advantages of DV model, namely flexibility of evolving DW structure using only additions to the model, and high performance of prompt loading, are now extended with a model appropriate during conceptual data discovery and/or analysis phase. Specifically we provided formalization for the C-DV Model as a platform independent conceptual DW model using meta-modeling, as well as defined a patterns based mappings from source data (conceptual entity relationship) models into the C-DV model. Keywords Conceptual data warehouse model, data vault model, source data mapping patterns INTRODUCTION Nowadays database developers would not attempt DB design without first using some conceptual data model, but the same practice does not generally extends to the DW design. Prominent researchers such as (Jarke, 200, Goldfarelli and Rizzi 2009, Malinowski and Zimanyi, 2009) claim that to be one of the major weaknesses of the first generation DW design methods used in industry and thought in academia (Imhoff, Galemmo and Geiger 2003; Kimball and Ross 2002; Thareja 2009; Watson 2002). The practice evolved a Dimensional Model (DM), visualized with a star like schema, into a de facto industry standard for logical design DW representation (Adamson 200, Kimball, 20, Laberge, 20). On the research/academic side, a number of conceptual models had been proposed. The one with a growing industrial adoption, the Dimensional Fact Model (DFM) (Golfarelli, Maio and Rizzi, 998), readily translates into the DM see the internationally acclaimed textbook (Golfarelli and Rizzi 2009). Other conceptual DW models, such as StarER (Tryfona, 999), DWCMD (Franconi, 999), EventERM (Baekgard, 999), ME/R (Husemann, 2000), MAC (Tsois, 200), ODWC (Dori, 2008), GOOMD (Sarkar, 2009) etc. (see: Phipps, 2002; Rizzi, 2008; and Niedrite, 2009 for detail analysis, are also generally compatible with the DM.) Note that in all of the conceptual models listed above there is little or no attention paid to the DW data staging area modeling. The numerous advances in information technology, notably improved price performance of storage, multi-core processors etc. lead to envisioning the next generation of DW. The one clear vision is the new DW 2.0 architecture (Inmon, 2008). In addition, the significance of a permanent data staging area for future enterprise DW is recognized in (Damhof. 2008) and particularly well by (Linstedt, 20) since the 90. The state of the art DW design models and methods are now expected to address requirements for the next generation of DW. The two recent temporal models, grounded in sixth Normal Form (6NF), are the Anchor Model (Romback, 200), still researching physical implementation consequences, and the DV model (Linstedt, 20) represent promising ideas within the DW 2.0 and beyond. The DV model is particularly significant for practice due to a proven implementation record and evolved patterns of transformations see for example (Casters 200). The DV model is now perfecting its logical model, see specification of the.0.9 version (Linstedt, 20). The DV modeling practice continue utilizing DM as the logical DW model of the publishing area- reporting to users (Linstedt, 20) but does not have a defined conceptual model (for the staging area). If we are to sum up the state of the art in conceptual and logical DW modeling, we may be justified in recommending the DFM as an adequate conceptual model for reporting side of DW, the DV Model as preferred logical and physical model for the DW staging area, and the Dimensional Model as the standard logical model for DW reporting area. This leaves research into conceptual model for DW staging area fairly open (the only contender being the Anchor Model). The research in the area of physical models and performance improvements for DW reporting area is outside the scope of this paper but remains wide open within the DW 2.0 (Inmon, 2008) compliant DW design technology space. Before explaining the specific innovation presented in this paper, namely a formalized C-DV, it is beneficial to review the advantages of the DV model itself (more information on logical data vault modeling can be found in Wikipedia). A number of problems with the first generation of DW including the narrowing of DW update windows are consequences of required complex transformations of inputs going into an enterprise scale DW and Data marts for presentation. Traditional DW Proceedings of the Southern Association for Information Systems Conference, Atlanta, GA, USA March 23 rd -24 th, 202 3

architectures, presented in (Goldfareli, 2008), share most of the following problems: a) complex updates, b) difficulties in source data reconciliation and integration (of substantive viewpoints), c) growing significance of delays, caused by data extracts not ready for loading and upfront reconciliation, d) slow loading, e) brittle transformations (due to integration close to volatile sources and even more so among reporting viewpoints), f) lack of an integrated system of record (data staging is not persistent), g) lack of traceability from sources to archives, and h) complex reconstructions of schemas (evolving content/semantics not really addressed at all). Those challenges are certainly some of the important requirements for the next generation data warehouse modeling technology. The requirements a, b, c, d and h represent real architectural problems for DW designs based on traditional Entity Relationship Models and/or Dimensional Models. The f and the g are requirements that were not fully appreciated in most of the early DW systems, nor emphasized by design leaders i.e. Kimbal s or Inmon s schools of DW architecture and design (Kimball 2002, Kimball 200), (Imhoff, 2003), etc., and even in the contemporary state of the practice see (Mundy 20), (Adamson 200), and (Laberge, 20). The DV model enables structural changes by addition only, effectively resolving the c, e, and f requirements. It allows loading from the source into a persistent staging area (the data vault) without delays and by this successfully addresses a, e, and h while relegating b to downstream oriented transformations related to the presentation of data for various users. The DV (Linstedt, 20) include empirically defined set of metadata such as time, data source, and even input error references (rarely seen in traditional logical DW designs). Because of the all aforementioned, the DV model is recognized as the optimal choice for DW2.0 architecture. Given the significant benefits of the DV approach for flexible DW, a development of a formalized conceptual data model compatible with DV captured our attention. The main reasons for explicating C-DV model are: a) to represent and analyze data requirements early, b) to elaborate domain ontology regardless of readily available data sources, and c) to establish formal transformations from source data models to a C-DV and later on from C-DV to logical DV Models. In case of a DV model work is done on transformations of logical and physical DW (staging area) DV models into the logical Dimensional Model and variations of physical models, for example open source Pentaho Kettle (Casters, 200) and what is left open is a front end elaboration of an appropriate conceptual model. Among the uses of a conceptual model for the DW staging area i.e. a data vault is analysis of true identities (identifiers) and essential compatibility of source data (establishing new links between source systems) before DW design and during major expansions in terms of scope. CONCEPTUAL DATA VALUT MODEL (C-DV) The C-DV Model is defined here in a tradition of model driven development using an appropriate platform independent metamodel and a few metaconcepts supporting clear explanation. The metamodel is targeted to express semantics of the complex DW systems and to mediate various transformations. C-DV Metamodel is depicted in Figure using UML notation. Time Subject-System Time, Subject, Convention, Place illustrate metaconcept context for an CDV instance concept i.e. CDV-Concept * CDV-Concept 0.. P-Concept 2..* * Generalization Convention Satellite Hub * Link Association Place Attribute Key ReifiedAssociatiion 0.. Dependency Composite Surogate Figure. Conceptual Data Vault Metamodel Proceedings of the Southern Association for Information Systems Conference, Atlanta, GA, USA March 23 rd -24 th, 202 32

All CDV-Concepts are situated in an implicit real context of Time (typically bi-temporal), a Place (local) for a System- Subject, according to a Convention (standard, law, contract etc.)- reference meta entities to be exposed in C-DV instances. The primary CDV-Concepts, P-Concepts, are defined as follows. The P-Concept is a notion of interest whether it is real or abstract (as always only observable concepts make sense for real DW) and it may exist either as a Hub or as a Link. The Hub is intended to represent major identifiable concepts-entities of interest from the real world. It is required that every Hub entity can be denoted by a unique identifier. For a C-DV we represent this identity by a Key, and during logical DW modeling those natural keys are typically complemented (functionally substituted) with surrogate keys. This Key property differentiates concrete entity instance from all other instances and may serve as the only (permanent) truth about the entity. All other properties including connections to other entities are changeable over time, and according to the essential DV requirement, the only allowable updates will be additions. Link represents relationship among Concepts. Both, Hubs and Links may be involved in such relationships. Dependency is a special form of a Link which may express has a relationship in UML (or weak entity in various Entity Relationship notations). Another type of Link is the Generalization which represents is a relationship. At last, Association is type of Link used to represent all other relationships among Hubs. Furthermore, in the context of C-DV modeling n-ary (i.e. n>2) relationships are allowed, enabling mapping of semantic models not restricted to binary associations (n=2). A Hub or Link context is provided by Satellites. Essentially in a Satellite is used to associate a Hub (or a Link) with (a data model) attribute. A Hub instance has to be related to at least one Satellite instance, which is not mandatory for a Link. On the other hand, every Link is associated with min two Hub instances. In a C-DV Satellite contains one Attribute. This restriction satisfies the 6NF (Date), but is relaxed in Data Vault models (Linstedt, 20). Concepts may be related to more than one Satellite. The Key may be Composite Key if it is composed of several Values. Any (business) Key instance identifies one and only one Hub. The term Value stands for a datum (singular data value) in a traditional sense. The set of pattern transformations from data model to the C-DV concepts is shown using platform independent UML model (to describe source concepts). The Figure 2 is sufficiently (structurally) rich while simple and recognizable example depicting EU rental car operation, presented using the UML Class Diagram notation. Country -mechconrequirements -emissionsrequirements -cartax Type Car -registrationnumber TransferAgreement -distance -expectedtime pickup dropoff AssignedCar -expectedpreparedtime RentalAgreement -basicprice -bestprice -lastmodification -onrentinterval EU_RentPerson -id -birthdate -address -telephone OpenedRental -pickuptime Reservation -reservationdate DrivingLicence -number -issue -expiration Figure 2. EU Rental Car Model Proceedings of the Southern Association for Information Systems Conference, Atlanta, GA, USA March 23 rd -24 th, 202 33

Entity and Attribute(s) Pattern: Every entity is converted to a Hub related to Satellite(s) which contain(s) the source entity s attribute(s): Figure 3. Entity and Attribute (s) Pattern To save space only single Satellites are going to be shown, with an understanding that to maintain the 6NF expectation, each attribute would require a separate satellite; this will assure principle of independent updates by additions only. A logical and physical DV models with inseparable multi-attributes will use Satellite for such sets of attributes instead of one for each. Association Pattern: Associations are transformed to Link entity (many-to-many cardinalities), regardless of the association cardinalities in the source model. TYPE LNK_BRANCH _BRANCHTYPE Type TYPE Figure 4. Association Pattern Aggregation (Composition) Pattern: Both Classes involved becomes Hubs, and Aggregation (Composition) is transformed to Link concept. All attributes are assigned Satellites (total in this example would be 6 if id and number are respective natural keys of Hubs). EU_RentPerson -id -birthdate -address -telephone DrivingLicence -number -issue -expiration HUB_EURENT PERSON SAT_EURENT PERSON LNK_PERSON _LICENCE HUB_DRIVING LICENCE SAT_DRIVING LICENCE Figure 5. Aggregation (Composition) Pattern Association Class Pattern: Classes involved in association are transformed to Hubs while the Link which relates these Hubs, gets Satellite(s) with additional attributes (of the Association Class). Note that a reified Association Class (with a proper identifier) that participates in associations of its own would be handled as a Hub connected to the Link representing the original Association Class. Car -registrationnumber RentalAgreement -basicprice -bestprice -lastmodification -onrentinterval HUB_CAR LNK_ ASSIGNEDCAR HUB_RENTAL AssignedCar -expectedpreparedtime SAT_CAR SAT_ ASSIGNEDCAR SAT_RENTAL Figure 6. Association Class Pattern Proceedings of the Southern Association for Information Systems Conference, Atlanta, GA, USA March 23 rd -24 th, 202 34

Generalization Pattern: Every specialization relation is converted to a corresponding Link: RentalAgreement -basicprice -bestprice -lastmodification -onrentinterval HUB_ RESERVATION LNK_RENTAGR RESERVATION HUB_RENTAL Reservation -reservationdate SAT_ RESERVATION SAT_RENTAL Figure 7. Generalization Pattern Recursive Association Pattern: Link related to Hub two times, once for each role. LNK_TRANSFER Figure 8. Recursive Association Pattern Note that we recommend using IdefX standard notation (instead of the UML) for efficient reengineering of physical Relational DB models prior to C-DV modeling, due to the IdefX s smallest cognitive distance from the Relational Model. CONCLUSION What is achieved by explicit definition of a DV compatible conceptual model? First, a platform independent representation appropriate for conceptual modeling of a DW, seen as a system of record i.e. persistent staging area, is now made available at least to DV practitioners. Second, C-DV is a 6NF compliant model effectively eliminating updating dependencies among attributes. The third accomplishment is conceptual recognition of Place and Convention context metadata categories (in addition to the DV implicit Time and Source-System) providing a lot more sophisticated analysis and traceability (compared with traditional DW models). We intend to expand experimental work and collect evidence for that claim. Furthermore, a systematic algorithm for transforming conceptual source data models into a C- DV is provided in a form of patterns that may serve as clear guidance for DW modelers and DW design tool builders. The original Data Vault resolved major (flexibility and performance related) DW problems by elevating staging area into a persistent system of records. Providing an explicit C-DV model to facilitate developers working on DW design will allow them to start with a platform independent conceptualization appropriate for requirements and/or architectural analysis as well as for design automation. How C-DV compares with the Anchor model (we intend to elaborate on this elsewhere)? In a nutshell, C-DV uses only three concepts i.e. (Hub, Link and Satellite) vs. Anchor model s four (Romback, 200). Furthermore the C-DV (see Fig. ) provides three additional explicit categories for the context i.e. metadata reference types, Anchor model recognize (implicitly) only Time, thus severely limiting its usefulness for desirable system of record DW in industrial multi-source enterprise scale systems which is typically needed in complex environments. Our work is progressing in extending DV practices to C-DV design integrated into a complete DW design automation and will include transformations of: a) C-DV model into a Logical Data Vault model (L-DV), b) DV constraints and pragmatic workarounds into a relational data model, c) L-DV into logical DW model for presentation (that is the Dimensional Model), d) C-DV to DFM, and f) C-DV to Dimensional Model. Furthermore we are working on defining extensions/revision for standard DW interchange model (Poole 2002) appropriate for the second generation DW architecture (i.e. DW 2.0). The last question for some may be: is the data vault a DW? A persistent staging area is a (portion of) on enterprise DW. Proceedings of the Southern Association for Information Systems Conference, Atlanta, GA, USA March 23 rd -24 th, 202 35

REFERENCES. Adamson, C. (200) Star schema the complete reference, McGraw Hill. 2. Baekgard, L. (999) Event-Entity-Relationship modeling in data warehouse environments, Proceedings of the ACM DOLAP99 Workshop, Missouri. 3. Casters, M. (200) Pentaho Kettle solutions, John Wiley. 4. Damhof, R. (2008) The next generation EDW, posted August 2008 at: http://www.prudenza.nl, 5. Dori, D., Feldman, R., Sturm, A. (2008) From conceptual models to schemata: An object-process-based data warehouse construction method, Information Systems, V33, Issue 6, Elsevier. 6. Franconi, E. and Sattler, U. (999) A data warehouse conceptual data model for multidimensional aggregation, Intl. Workshop on Design and Management of Data Warehouses, Heidelberg. 7. Golfarelli M. and Rizzi S. (2009) Data warehouse design, McGraw Hill 2009, 8. Golfarelli M., Maio, D. and Rizzi S. (998) The dimensional fact model: a conceptual model for data warehouses, International Journal of Cooperative Information Systems, Vol.7, Nos. 2 and 3 pp. 25-247. 9. Husemann, B., Lechtenborger, J. and Vossen, G (2000) Conceptual data warehouse design, Proceedings of the International Workshop on Design and management of Data Warehouses DMDW 2000, Stockholm, Sweden, June. 0. Inmon, W., Strauss, D. and Neushloss, G. (2008) DW 2.0-The Architecture for the next generation of data warehousing, Morgan Kaufman.. Imhoff, C., Galemmo, N. and Geiger, J. (2003) Mastering data warehouse design, John Wiley. 2. Jarke, M. (200) Fundamentals of data warehousing, 2ed, Springer. 3. Kimball, R. and Ross, M. (2002) The data warehouse toolkit, 2ed. John Wiley. 4. Kimball, R., and Ross, M (200) The Kimball group reader, John Wiley. 5. Laberge, R. (20) The data warehousing mentor, McGraw Hill. 6. Linstedt, D. (20) DV specification.09 at danlinstedt.com/datavaultcat/standards/dv-modeling-specification-v-0-8/ 7. Linstedt, D. (20) Super charge your data warehouse, on Amazon.com. 8. Malinowski E. and Zimanyi, E. (2009) advanced data warehouse design, Springer. 9. Mundy J, Thorthwaite W., and Kimball R. (20) The Microsoft data warehouse toolkit, 2ed. John Wiley. 20. Neil, C, and Ale, J. (2002) Conceptual design of a temporal data warehouse, ASSE Argentine Symposium on Software Engineering. 2. Niedrite, L. (2009) Development of data warehouse conceptual models: method engineering approach, in Progressive Methods in Data Warehousing and Business Intelligence, IGI Global. 22. Phipps, C. Davis, K. (2002) Automating data warehouse conceptual schema design and evaluation, design and management of data warehouses- DMDW Conference. 23. Poole, J, Chang, D., Tolbert, D. and Mellor, D. (2002) Common warehouse metamodel, John Wiley. 24. Rizzi, S. (2008) Conceptual modeling solutions for the data warehouse, in Wang, J. editor, Data Warehousing and Mining, IGI Global, pp. 208-227. 25. Romback, L. L, Regardt O., Bergholtz M. (200) Anchor modeling- agile information modeling in evolving data environments, Data and Knowledge Engineering Vol. 69, Issue 2. 26. Sarkar A., (2009) Conceptual design of object oriented data warehouse: graph semantic based model, INFOCOMP Journal of Computer Science. 27. Thareja, R. (2009) Data warehousing, Oxford University Press. 28. Tsois, A, Karayanis, N, and Sellis T, (200) MAC: conceptual data modeling for OLAP, Proceedings of the International Workshop on DMDW, Interlaken, Sweden. 29. Tryfona, N. et all (999) StarER: a conceptual model for data warehouse design, Proceedings of the ACM DOLAP99 Workshop, Missouri. 30. Watson H. (2002) Recent development in data warehousing, Communication of AIS V8, A. Proceedings of the Southern Association for Information Systems Conference, Atlanta, GA, USA March 23 rd -24 th, 202 36