A Methodology for the Conceptual Modeling of ETL Processes
|
|
|
- Philomena Burke
- 10 years ago
- Views:
Transcription
1 A Methodology for the Conceptual Modeling of ETL Processes Alkis Simitsis 1, Panos Vassiliadis 2 1 National Technical University of Athens, Dept. of Electrical and Computer Eng., Computer Science Division, Iroon Polytechniou 9, , Athens, Greece [email protected] 2 University of Ioannina, Dept. of Computer Science, 45110, Ioannina, Greece [email protected] Abstract. Extraction-Transformation-Loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. In this paper, we propose a methodology for the earliest stages of the data warehouse design, with the goal of tracing the analysis of the structure and content of the existing data sources and their intentional mapping to the common conceptual data warehouse model. The methodology comprises a set of steps that can be summarized as follows: (a) identification of the proper data stores; (b) candidates and active candidates for the involved data stores; (c) attribute mapping between the providers and the consumers, and (d) annotation of the diagram with runtime constraints. 1 Introduction In order to facilitate and manage the data warehouse operational processes, specialized tools are already available in the market, under the general title Extraction-Transformation-Loading (ETL) tools. To give a general idea of the functionality of these tools we mention their most prominent tasks, which include: (a) the identification of relevant information at the source side; (b) the extraction of this information; (c) the customization and integration of the information coming from multiple sources into a common format; (d) the cleaning of the resulting data set, on the basis of database and business rules, and (e) the propagation of the data to the data warehouse and/or data marts. In Fig.1 we abstractly describe the general framework for ETL processes. In the bottom layer we depict the data stores that are involved in the overall process. On the left side, we can observe the original data providers. Typically, data providers are relational databases and files. The data from these sources are extracted (as shown in the upper left part of Fig.1) by extraction routines, which provide either complete snapshots or differentials of the data sources. Then, these data are propagated to the Data Staging Area (DSA) where they are transformed and cleaned before being loaded to the data warehouse. The data warehouse is depicted in the right part of the
2 data store layer and comprises the target data stores, i.e., fact tables for the storage of information and dimension tables with the description and the multidimensional, rollup hierarchies of the stored facts. The loading of the central warehouse is performed from the loading activities depicted on the upper right part of the figure. Ã ([WUDFW 7UDQVIRUP &OHDQ /RDG 6RXUFHV '6$ ': Fig. 1. The environment of Extract-Transform-Load processes In this paper, we are dealing with the earliest stages of the data warehouse design. During this period, the data warehouse designer is concerned with two tasks that are practically executed in parallel. The first of these tasks involves the collection of requirements from the part of the users. The second task, which is of equal importance for the success of the data warehousing project, involves the analysis of the structure and content of the existing data sources and their intentional mapping to the common data warehouse model. Related literature [14, 27] and personal experience suggest that the design of an ETL process aims towards the production of a crucial deliverable: the mapping of the attributes of the data sources to the attributes of the data warehouse tables. The production of this deliverable involves several interviews that result in the revision and redefinition of original assumptions and mappings; thus it is imperative that a simple conceptual model is employed in order to (a) facilitate the smooth redefinition and revision efforts and (b) serve as the means of communication with the rest of the involved parties. In a previous line of work [29], we have proposed a conceptual model for ETL processes. In this paper, we complement this model in a set of design steps, which lead to the basic target, i.e., the attribute interrelationships. These steps constitute the methodology for the design of the conceptual part of the overall ETL process and could be summarized as follows: (a) identification of the proper data stores; (b) candidates and active candidates for the involved data stores; (c) attribute mapping between the providers and the consumers, and (d) annotating the diagram with runtime constraints (e.g., time/event based scheduling, monitoring, logging, exception handling, and error handling). This paper is organized as follows. In Section 2, we give a motivating example, over which our discussion will be based. Section 3 shortly presents an overview of the conceptual model for ETL processes. In Section 4, we demonstrate the methodology for the usage of the conceptual model. Finally, in Section 5 we present related work and in Section 6 we conclude our results.
3 2 Motivating example To motivate our discussion we will introduce an example involving two source databases 6 and 6 as well as a central data warehouse ':. The scenario involves the propagation of data from the concept 3$ (<6833.(<47<&267 of source 6 as well as from the concept 3$ (<'(3$570( (< '$7(47<&267 of source 6 to the data warehouse. In the data warehouse, ':3$ (<6833.(<'$7(47<&267 stores daily ('$7() information for the available quantity (47<) and cost (&267) of parts (3.(<) per supplier (6833.(<). We assume that the first supplier is European and the second is American, thus the data coming from the second source need to be converted to European values and formats. For the first supplier, we need to combine information from two different tables in the source database, which is achieved through an outer join of the concepts and respectively. In Fig. 2 we depict the full-fledged diagram of our motivating example. Throughout all the paper, we will clarify the introduced concepts through their application to our motivating example. 'XHWRDFFFXUDF\ DQGVPDOOVL]H XSGDWHZLQGRZ 1HFHVVDU\SURYLGHUV 6DQG6 ^'XUDWLRQK` 5HFHQW 3DUW6XSS V $QQXDO 3DUW6XSS V 8 ^;25` ':3$ 'DWH 'HSDUWPHQW 6. I 6 6 6'DWH 'DWH 6. I 11 W \ 3 64 R VW 3 6& ZY I $PHULFDQWR ¼ 'DWH 6\V'DWH (XURSH DQ'DWH 3NH\ Fig. 2. The diagram of the conceptual model for our motivating example 3 Conceptual Model for ETL processes In this section, we focus on the conceptual part of the definition of the ETL process. For a detailed presentation of our conceptual model and formal foundations for the representation of ETL processes, we refer the interested reader to [29]. This model has a particular focus on (a) the interrelationships of attributes and concepts, and (b) the
4 necessary transformations that need to take place during the loading of the warehouse. The latter part is directly captured in the proposed metamodel as a first class citizen; we employ transformations as a generic term for the restructuring of schema and values or for the selection and even transformation of data. Attribute interrelationships are captured through provider relationships that map data provider attributes at the sources to data consumers in the warehouse. Apart from these fundamental relationships, the proposed model is able to capture constraints and transformation composition, too. FRQFHSW DWWULEXWH WUDQVIRUPDWLRQ 1RWH (7/BFRQVWUDLQW DFWLYHFDQGLWDWH SDUWRI SURYLGHU SURYLGHU 10 VHULDO FRPSRVLWLRQ FDQGLGDWH FDQGLGDWH Q ^;25` WDUJHW Fig. 3. Notation for the conceptual modeling for ETL activities In Fig. 3 we graphically depict the different entities of the proposed model. We do not employ standard UML notation for concepts and attributes, for the simple reason that we need to treat attributes as first class citizens of our model. We try to be orthogonal to the conceptual models which are available for the modeling of data warehouse star schemata; in fact, any of the proposals for the data warehouse front end can be combined with our approach, which is specifically tailored for the back end of the warehouse. Attributes. A granular module of information. The role of attributes is the same as in the standard ER/dimensional models (e.g., 3.(<, '$7(, 6833.(<, etc.). Concepts. A concept represents an entity in a source database or in the data warehouse (e.g., 6 3$576833, 6 3$576833, ':3$576833). Concept instances are the files in the source databases, the data warehouse fact and dimension tables and so on. A concept is formally defined by a name and a finite set of attributes. Transformations. Transformations are abstractions that represent parts, or full modules of code, executing a single task and include two large categories: (a) filtering or data cleaning operations, and (b) transformation operations, during which the schema of the incoming data is transformed (a surrogate key assignment transformation (6.), a function application (I), a not null (11) check, etc.). Formally, a transformation is defined by (a) a finite set of input attributes; (b) a finite set of output attributes, and (c) a symbol that graphically characterizes the nature of the transformation. A transformation is graphically depicted as a hexagon tagged with its corresponding symbol. ETL Constraints. ETL constraints are used in several occasions when the designer wants to express the fact that the data of a certain concept fulfill several requirements (e.g., to impose a 3. constraint to ':3$ for the attributes 3.(<, 6833.(<, '$7(). This is achieved through the application of ETL constraints, which are formally defined as follows: (a) a finite set of attributes, over which the constraint is
5 imposed, and (b) a single transformation, which implements the enforcement of the constraint. Note, that despite the similarity in the name, ETL constraints are different modeling elements from the well-known UML constraints. An ETL constraint is graphically depicted as a set of solid edges starting from the involved attributes and targeting the facilitator transformation. Notes. Exactly as in UML modeling, notes are informal tags to capture extra comments that the designer wishes to make during the design phase or render UML constraints attached to an element or set of elements [3] (e.g., a runtime constraint specifying that the overall execution time for the loading of ':3$ cannot take longer than KRXUV). Part-of Relationships. We bring up part-of relationships, not to redefine UML partof relationships, but rather to emphasize the fact that a concept is composed of a set of attributes, since we need attributes as first class citizens in the inter-attribute mappings. In general, standard ER modeling does not treat this kind of relationship as a first-class citizen of the metamodel; UML modeling on the other hand, hides attributes inside classes and treats part-of relationships with a much broader meaning. Naturally, we do not preclude the usage of the part-of relationship for other purposes, as in standard UML modeling. Candidate relationships. A set of candidate relationships captures the fact that a certain data warehouse concept (e.g., 6 3$576833) can be populated by more than one candidate source concepts (e.g., $QQXDO3DUW6XSS, 5HFHQW3DUW6XSS). This information is normally discovered quite early in the data warehouse project and should be traced for the possibility of redesign or evolution of the warehouse. Active candidate relationships. An active candidate relationship denotes the fact that out of a set of candidates, a certain one (in our case, 5HFHQW3DUW6XSS) has been selected for the population of the target concept. Provider relationships. A provider relationship can be used at two abstraction levels. At the concept level, it maps provider to target concepts. At the attribute level, it maps a set of input attributes to a set of output attributes through a relevant transformation. Practically, a provider relationship at the concept level implies that the target concept will be populated from the provider one. This kind of provider relationship at the concept level can be further refined to provider relationships at the attribute level (through simple part-of relationships). In the simple case, provider relationships capture the fact that an input attribute in the source side populates an output attribute in the data warehouse side. If the attributes are semantically and physically compatible, no transformation is required. If this is not the case though, we pass this mapping through the appropriate transformation (e.g., European to American data format, not null check, etc.). In general, it is possible that some form of schema restructuring takes place; thus, the formal definition of provider relationships comprises (a) a finite set of input attributes; (b) a finite set of output attributes, and (c) an appropriate transformation (i.e., one whose input and output attributes can be mapped one to one to the respective attributes of the relationship). In the case of 10 relationships the graphical representation obscures the linkage between provider and target attributes. To compensate for this shortcoming, we annotate the link of a provider relationship with each of the involved attributes with a tag, so that there is no disambiguate for the actual provider of a target attribute.
6 Transformation Serial Composition. It is used when we need to combine several transformations in a single provider relationship (e.g., the combination of 6. and in our motivating example). 4 Methodology for the usage of the conceptual model In this section, we will present the proposed methodology, by giving the sequence of steps for a designer to follow, during the construction of the data warehouse. Each step of this methodology will be presented in terms of our motivating example (also hoping that it clarifies the nature of the employed modeling entities). As already mentioned, the ultimate goal of this design process is the production of inter-attribute mappings, along with any relevant auxiliary information. Step 1: Identification of the proper data stores. The first thing that a designer faces during the requirements and analysis period of the data warehouse process is the identification of relative data sources. Assume that for a particular subset of the data warehouse, we have identified the concept ':.3$ which is a fact table of how parts where distributed according to their respective suppliers. Assume also that we have decided that for completeness reasons, we need to populate the target table with data from both sources 6 and 6, i.e., we need the union of the data of these two sources. Moreover, in order to populate the concept 6.3$576833, we need an outer join of the data of the concepts and. Then, the diagram for the identified data sources and their interrelationships is depicted in Fig. 4. ZY 8 ':3$ HFHVVDU\SURYLGHUV 6DQG6 Fig. 4. Identification of the proper data stores Step 2: Candidates and active candidates for the involved data stores. As already mentioned before, during the requirements and analysis stage the designer can possibly find out that more than one data stores could be candidates to populate a certain concept. The way the decision is taken cannot be fully automated; for example, our personal experience suggests that, at least the size, the data quality and the availability of the sources play a major role in this kind of decision. For the purpose of our motivating example, let us assume that source 6 has more than one production systems (e.g., COBOL files), which are candidates for concept 6.3$ Assume that the available sources are: A concept AnnualPartSupp s (practically representing a file F1), that contains the full annual history about part suppliers; it is used basically for reporting purposes
7 and contains a superset of fields than the ones required for the purpose of the data warehouse. A concept RecentPartSupp s (practically representing a file F2), containing only the data of the last month; it used on-line by end-users for the insertion or update of data as well as for some reporting applications. The candidate concepts for concept 6.3$ and the (eventually selected) active candidate 5HFHQW3DUW6XSS V, along with the tracing of the decision for the choice, are depicted in Fig. 5. ZY $QQXDO 3DUW6XSS V ^;25` 8 ':3DUW6XSS 5HFHQW 3DUW6XSS V 'XHWRDFFFXUDF\ DQGVPDOOVL]H XSGDWHZLQGRZ 1HFHV V DU\SURYLGHUV 6DQG6 Fig. 5. Candidates and active candidates for the motivating example Once a decision on the active candidate has been reached, a simplified working copy of the scenario that hides all other candidates can be produced. There are two ways to achieve this (a) by ignoring all the information on candidates for the target concept (Fig. 6), or (b) by replacing the target with the active candidate (Fig. 7). In other words, in order to avoid cluttering the diagram with information not directly useful, the designer can choose to work in a simplified version. The hidden candidates, are not eliminated; on the contrary, they remain part of the conceptual model of the data warehouse, to be exploited in later stages of the warehouse lifecycle (e.g., in the case of changes to the active candidate that may possibly lead to the reconsideration of this choice). The only difference is that they are not shown to the designer. DFWLYHFDQGLWDWH FDQGLGDWH WDUJHW WDUJHW FDQGLGDWH Q ^;25` Fig. 6. Working copy transformation: ignoring candidates
8 DFWLYHFDQGLWDWH FDQGLGDWH WDUJHW DFWLYH FDQGLGDWH ^;25` FDQGLGDWH Q Fig. 7. Working copy transformation: replace targets with active candidates Step 3: Attribute mapping between the providers and the consumers. The most difficult task of the data warehouse designer is to determine the mapping of the attributes of the sources to the ones of the data warehouse. This task involves several discussions with the source administrators to explain the codes and implicit rules or values, which are hidden in the data and the source programs. Moreover, it involves quite a few data preview attempts (in the form of sampling, or simple counting queries) to discover the possible problems of the provided data. Naturally, this process is interactive and error-prone. The support that a tool can provide the designer with lies mainly in the area of inter-attribute mapping, with focus on the tracing of transformation and cleaning operations for this mapping. For each target attribute, a set of provider relationships must be defined. In simple cases, the provider relationships are defined directly among source and target attributes. In the cases where a transformation is required, we pass the mapping through the appropriate transformation. In the cases where more than one transformations are required, we can insert a sequence of transformations between the involved attributes, all linked through composition relationships. ETL constraints are also specified in this part of the design process. An example of inter-attribute relationship for the concepts,, 6 3$576833, 6 3$ and ':3$576833, is depicted in Fig. 8. ':3$ H \ 'DWH 'HSDUWPHQW I Ḥ\ 3 6 S.H\ 66XS 6 ' DWH W\ 806 'DWH I 11 W\ ZY 3 64 & RVW 3 6 I $PHULFDQWR ¼ 'DWH 6\V'DWH (XURSHDQ'DWH 3NH\ 3.H \ Fig. 8. Attribute mapping for the population of recordset ':3$ Step 4: Annotating the diagram with runtime constraints. Apart for the job definition for an ETL scenario, which specifies how the mapping from sources to the
9 data warehouse is performed, along with the appropriate transformations, there are several other parameters that possibly need to be specified for the runtime environment. This kind of runtime constraints include: Time/Event based scheduling. The designer needs to determine the frequency of the ETL process, so that data are fresh and the overall process fits within the refreshment time window. Monitoring. On-line information about the progress/status of the process is necessary, so that the administrator can be aware of what step the load is on, its start time, duration, etc. File dumps, notification messages on the console, , printed pages, or visual demonstration can be employed for this purpose. Logging. Off-line information, presented at the end of the execution of the scenario, on the outcome of the overall process. All the relevant information (e.g., the previous tracing stuff) is shown, by employing the aforementioned techniques. Exception handling. The row-wise treatment of database/business rules violation is necessary for the proper function of the ETL process. Information like which rows are problematic, or how many rows are acceptable (acceptance rate, that is), characterizes this aspect of the ETL process. Error handling. Crash recovery and stop/start capabilities (e.g., committing a transaction every few rows) are absolutely necessary for the robustness and the efficiency of the process. QT 9ˆ h v 1#u QT! ZY T Q6SUTVQQ $QQXDO 3DUW6XSS V ^;25` T!Q6SUTVQQ 8 ':3DUW6XSS 5HFHQW 3DUW6XSS V 9ˆr hpppˆ hp h q hyy v r 1ˆƒqh r v q Irpr h ƒ vqr ) T h qt! Fig. 9. Annotation of Fig. 5 with constraints on the runtime execution of the ETL scenario To capture all this information, we adopt notes, tagged to the appropriate concepts, transformations or relationships. In Fig. 9 we depict the diagram of Fig. 5, annotated with a runtime constraint specifying that the overall execution time for the loading of ':3$ (that involves the loading of 6 3$ and 6 3$576833) cannot take longer than KRXUV. 5 Related Work There have been quite a few attempts around the conceptual modeling of data warehouses. Most of them are focused around the front-end of the data warehouse [4, 5, 8, 9, 10, 13, 14, 17, 19, 23, 24, 25, 26]. A variety of commercial ETL tools exist in the market [11, 12, 16, 20] (see also a recent review [6] for more details). Naturally,
10 there also exist research efforts including [1, 2, 7, 15, 18, 21, 22, 28, 30] mostly targeting logical modeling and implementation issues of the ETL process (like duplicate elimination, logical modeling of ETL workflows, etc.). 6XUURJDWH.H\IRU $PHULFDQWR(XURSHDQ 'DWHDQG *URXSRYHU 'DWH 6XUURJDWH.H\IRU $GG6\VWHP'DWH 1RW1XOO&KHFNIRU -RLQWR JHW DQG FKHFN'DWH 'XUDWLRQK ':3$ Fig. 10. Informal conceptual model following the [14] notation The only methodology that we know of is found in [14], mainly grouped as a set of tips and technical guidelines; on the contrary, in our approach we try to give a more disciplined methodology. The major deficiency of the [14] methodology is the lack of conceptual support for the overall process. According to the [14] proposal (e.g., p.613), the closest thing to a conceptual model for our motivating example is the informal description depicted in Fig. 10. As far as inter-attribute mappings are concerned, tables of matching between source and warehouse attributes are suggested. The spirit of [14] is more focused towards the logical design of ETL workflows; thus, rather than systematically tracing system and user requirements, transformations and constraints, the proposed methodology quickly turns into a description of data flows for common cases of ETL transformations (e.g., surrogate keys, duplicate elimination and so on). 6 Conclusions In this paper, we have proposed a methodology for the earliest stages of the data warehouse design, with the goal of tracing the analysis of the structure and content of the existing data sources and their intentional mapping to the common data warehouse model. As already suggested in previous research efforts [29] the major deliverable of this effort is a model of the attribute interrelationships. In this paper, we present a set of design steps that constitute the methodology for the design of the conceptual part of the overall ETL process. These steps could be summarized as follows: (a) identification of the proper data stores; (b) candidates and active candidates for the involved data stores; (c) attribute mapping between the providers and the consumers, and (d) annotating the diagram with runtime constraints. Clearly, a lot of work remains to be done for the completion of our research approach. The main challenge is the practical application of this disciplined approach in real world cases and its further tuning to accommodate extra practical problems.
11 Moreover, a second challenge involves the linkage of our conceptual model and methodology to a logical model for ETL workflows. References [1] M. Bouzeghoub, F. Fabret, M. Matulovic. Modeling Data Warehouse Refreshment Process as a Workflow Application. In Proc. Intl. Workshop on Design and Management of Data Warehouses (DMDW 99), Heidelberg, Germany, (1999). [2] V. Borkar, K. Deshmuk, S. Sarawagi. Automatically Extracting Structure from Free Text Addresses. Bulletin of the Technical Committee on Data Engineering, 23(4), (2000). [3] G. Booch, I. Jacobson, J. Rumbaugh. The Unified Modeling Language User Guide. Addison-Wesley Pub Co; ISBN: ; 1st edition, October [4] D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, and R. Rosati. Information integration: Conceptual modeling and reasoning support. In Proc. Proceedings of the International Conference on Cooperative Information Systems (COOPIS), New York, USA, pp , (August 1998) [5] D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati. A principled approach to data integration and reconciliation in data warehousing. In Proc. Intl. Workshop on Design and Management of Data Warehouses (DMDW 99), Heidelberg, Germany, (1999). [6] Gartner. ETL Magic Quadrant Update: Market Pressure Increases. Available at [7] H. Galhardas, D. Florescu, D. Shasha and E. Simon. Ajax: An Extensible Data Cleaning Tool. In Proc. ACM SIGMOD Intl. Conf. On the Management of Data, pp. 590, Dallas, Texas, (2000). [8] M. Golfarelli, D. Maio, S. Rizzi. The Dimensional Fact Model: a Conceptual Model for Data Warehouses. Invited Paper, International Journal of Cooperative Information Systems, vol. 7, n. 2&3, [9] M. Golfarelli, S. Rizzi: Methodological Framework for Data Warehouse Design. In ACM First International Workshop on Data Warehousing and OLAP (DOLAP 98), pp. 3-9, November 1998, Bethesda, Maryland, USA. [10] B. Husemann, J. Lechtenborger, G. Vossen. Conceptual data warehouse modeling. In Proc. 2nd Intl. Workshop on Design and Management of Data Warehouses (DMDW), pp , Stockholm, Sweden (2000). [11] IBM. IBM Data Warehouse Manager. Available at data/db2/datawarehouse/ [12] Informatica. PowerCenter 6. Available at: data+integration/powercenter/default.htm [13] R. Kimball. A Dimensional Modeling Manifesto. DBMS Magazine August 1997 [14] R. Kimbal, L. Reeves, M. Ross, W. Thornthwaite. The Data Warehouse Lifecycle Toolkit: Expert Methods for Designing, Developing, and Deploying Data Warehouses. John Wiley & Sons, February [15] W. Labio, J.L. Wiener, H. Garcia-Molina, V. Gorelik. Efficient Resumption of Interrupted Warehouse Loads. In Proc. of the 2000 ACM SIGMOD International Conference on Management of Data, pp , Dallas, Texas, USA (2000). [16] Microsoft Corp. MS Data Transformation Services. Available at [17] D.L. Moody, M.A.R. Kortink: From enterprise models to dimensional models: a methodology for data warehouse and data mart design. Proceedings of the Second Intl.
12 Workshop on Design and Management of Data Warehouses, DMDW 2000, Stockholm, Sweden, June 5-6, Available at http: //sunsite.informatik.rwthaachen.de/publications/ CEUR-WS/Vol-28/ [18] A. Monge. Matching Algorithms Within a Duplicate Detection System. Bulletin of the Technical Committee on Data Engineering, 23(4), (2000). [19] T. B. Nguyen, A Min Tjoa, R. R. Wagner. An Object Oriented Multidimensional Data Model for OLAP. Proc. of the First International Conference on Web-Age Information Management (WAIM-00), Shanghai, China, June [20] Oracle Corp. Oracle9i Warehouse Builder User s Guide, Release November Product web page available at content.html [21] E. Rahm, H. Do. Data Cleaning: Problems and Current Approaches. Bulletin of the Technical Committee on Data Engineering, 23(4), (2000). [22] V. Raman, J. Hellerstein. Potter's Wheel: An Interactive Data Cleaning System. In Proceedings of 27th International Conference on Very Large Data Bases (VLDB), pp , Roma, Italy, (2001). [23] C. Sapia, M. Blaschka, G. Höfling, B. Dinter: Extending the E/R Model for the Multidimensional Paradigm. In ER Workshops 1998, pp Lecture Notes in Computer Science 1552 Springer 1999 [24] J. Trujillo, M. Palomar, J. Gómez. Applying Object-Oriented Conceptual Modeling Techniques to the Design of Multidimensional Databases and OLAP Applications. Proc. of the First International Conference on Web-Age Information Management (WAIM-00) pp , Shanghai, China, June [25] J. Trujillo, M. Palomar, J. Gómez, I. Song. Designing Data Warehouses with OO Conceptual Models. IEEE Computer 34(12), pp , [26] N. Tryfona, F. Busborg, J.G.B. Christiansen. starer: A Conceptual Model for Data Warehouse Design. In ACM Second International Workshop on Data Warehousing and OLAP (DOLAP 99), pp.3-8, November 1999, Missouri, USA. [27] P. Vassiliadis. Gulliver in the land of data warehousing: practical experiences and observations of a researcher. In Proc. 2 nd Intl. Workshop on Design and Management of Data Warehouses (DMDW), pp , Sweden (2000). [28] P. Vassiliadis, A. Simitsis, S. Skiadopoulos. Modeling ETL activities as graphs. In Proc. of Design and Management of Data Warehouses (DMDW'2002) 4th Intl Workshop in conjunction with CAiSE 02, pp , Toronto, Canada, [29] P. Vassiliadis, A. Simitsis, S. Skiadopoulos. Conceptual Modeling for ETL processes. In Proc. of Data Warehousing and OLAP (DOLAP'2002) ACM 5th Intl Workshop in conjunction with CIKM 02, McLean, USA, November 8, [30] P. Vassiliadis, Z. Vagena, S. Skiadopoulos, N. Karayannidis, T. Sellis. ARKTOS: Towards the modeling, design, control and execution of ETL processes. Information Systems, 26(8), pp , December 2001, Elsevier Science Ltd.
Optimization of ETL Work Flow in Data Warehouse
Optimization of ETL Work Flow in Data Warehouse Kommineni Sivaganesh M.Tech Student, CSE Department, Anil Neerukonda Institute of Technology & Science Visakhapatnam, India. [email protected] P Srinivasu
Modeling ETL Activities as Graphs *
Modeling ETL Activities as Graphs * Panos Vassiliadis, Alkis Simitsis, Spiros Skiadopoulos National Technical University of Athens, Dept. of Electrical and Computer Eng., Computer Science Division, Iroon
CONCEPTUAL MODELING FOR ETL PROCESS. A Thesis Presented to the Faculty of California State Polytechnic University, Pomona
CONCEPTUAL MODELING FOR ETL PROCESS A Thesis Presented to the Faculty of California State Polytechnic University, Pomona In Partial Fulfillment Of the Requirements for the Degree Master of Science In Computer
Dimensional Modeling for Data Warehouse
Modeling for Data Warehouse Umashanker Sharma, Anjana Gosain GGS, Indraprastha University, Delhi Abstract Many surveys indicate that a significant percentage of DWs fail to meet business objectives or
A Review of Contemporary Data Quality Issues in Data Warehouse ETL Environment
DOI: 10.15415/jotitt.2014.22021 A Review of Contemporary Data Quality Issues in Data Warehouse ETL Environment Rupali Gill 1, Jaiteg Singh 2 1 Assistant Professor, School of Computer Sciences, 2 Associate
Multidimensional Modeling with UML Package Diagrams
Multidimensional Modeling with UML Package Diagrams Sergio Luján-Mora 1, Juan Trujillo 1, and Il-Yeol Song 2 1 Dept. de Lenguajes y Sistemas Informáticos Universidad de Alicante (Spain) {slujan,jtrujillo}@dlsi.ua.es
Data warehouse life-cycle and design
SYNONYMS Data Warehouse design methodology Data warehouse life-cycle and design Matteo Golfarelli DEIS University of Bologna Via Sacchi, 3 Cesena Italy [email protected] DEFINITION The term data
Data warehouses. Data Mining. Abraham Otero. Data Mining. Agenda
Data warehouses 1/36 Agenda Why do I need a data warehouse? ETL systems Real-Time Data Warehousing Open problems 2/36 1 Why do I need a data warehouse? Why do I need a data warehouse? Maybe you do not
Metadata Management for Data Warehouse Projects
Metadata Management for Data Warehouse Projects Stefano Cazzella Datamat S.p.A. [email protected] Abstract Metadata management has been identified as one of the major critical success factor
Analysis of Data Cleansing Approaches regarding Dirty Data A Comparative Study
Analysis of Data Cleansing Approaches regarding Dirty Data A Comparative Study Kofi Adu-Manu Sarpong Institute of Computer Science Valley View University, Accra-Ghana P.O. Box VV 44, Oyibi-Accra ABSTRACT
Optimizing ETL Processes in Data Warehouses
Optimizing ETL Processes in Data Warehouses Alkis Simitsis Nat. Tech. Univ. of Athens [email protected] Panos Vassiliadis University of Ioannina [email protected] Timos Sellis Nat. Tech. Univ. of
BUSINESS RULES AS PART OF INFORMATION SYSTEMS LIFE CYCLE: POSSIBLE SCENARIOS Kestutis Kapocius 1,2,3, Gintautas Garsva 1,2,4
International Conference 20th EURO Mini Conference Continuous Optimization and Knowledge-Based Technologies (EurOPT-2008) May 20 23, 2008, Neringa, LITHUANIA ISBN 978-9955-28-283-9 L. Sakalauskas, G.W.
DATA WAREHOUSE DESIGN
DATA WAREHOUSE DESIGN ICDE 2001 Tutorial Stefano Rizzi, Matteo Golfarelli DEIS - University of Bologna, Italy 1 Motivation Building a data warehouse for an enterprise is a huge and complex task, which
A Quality-Based Framework for Physical Data Warehouse Design
A Quality-Based Framework for Physical Data Warehouse Design Mokrane Bouzeghoub, Zoubida Kedad Laboratoire PRiSM, Université de Versailles 45, avenue des Etats-Unis 78035 Versailles Cedex, France [email protected],
The Role of Metadata for Effective Data Warehouse
ISSN: 1991-8941 The Role of Metadata for Effective Data Warehouse Murtadha M. Hamad Alaa Abdulqahar Jihad University of Anbar - College of computer Abstract: Metadata efficient method for managing Data
Object Oriented Data Modeling for Data Warehousing (An Extension of UML approach to study Hajj pilgrim s private tour as a Case Study)
International Arab Journal of e-technology, Vol. 1, No. 2, June 2009 37 Object Oriented Data Modeling for Data Warehousing (An Extension of UML approach to study Hajj pilgrim s private tour as a Case Study)
BUILDING OLAP TOOLS OVER LARGE DATABASES
BUILDING OLAP TOOLS OVER LARGE DATABASES Rui Oliveira, Jorge Bernardino ISEC Instituto Superior de Engenharia de Coimbra, Polytechnic Institute of Coimbra Quinta da Nora, Rua Pedro Nunes, P-3030-199 Coimbra,
IBM WebSphere DataStage Online training from Yes-M Systems
Yes-M Systems offers the unique opportunity to aspiring fresher s and experienced professionals to get real time experience in ETL Data warehouse tool IBM DataStage. Course Description With this training
DIMENSION HIERARCHIES UPDATES IN DATA WAREHOUSES A User-driven Approach
DIMENSION HIERARCHIES UPDATES IN DATA WAREHOUSES A User-driven Approach Cécile Favre, Fadila Bentayeb, Omar Boussaid ERIC Laboratory, University of Lyon, 5 av. Pierre Mendès-France, 69676 Bron Cedex, France
Goal-Driven Design of a Data Warehouse-Based Business Process Analysis System
Proceedings of the 6th WSEAS Int. Conf. on Artificial Intelligence, Knowledge Engineering and Data Bases, Corfu Island, Greece, February 16-19, 2007 243 Goal-Driven Design of a Data Warehouse-Based Business
A proposed model for data warehouse ETL processes
Journal of King Saud University Computer and Information Sciences (2011) 23, 91 104 King Saud University Journal of King Saud University Computer and Information Sciences www.ksu.edu.sa www.sciencedirect.com
SQL Server 2012 Business Intelligence Boot Camp
SQL Server 2012 Business Intelligence Boot Camp Length: 5 Days Technology: Microsoft SQL Server 2012 Delivery Method: Instructor-led (classroom) About this Course Data warehousing is a solution organizations
Index Terms: Business Intelligence, Data warehouse, ETL tools, Enterprise data, Data Integration. I. INTRODUCTION
ETL Tools in Enterprise Data Warehouse *Amanpartap Singh Pall, **Dr. Jaiteg Singh E-mail: [email protected] * Assistant professor, School of Information Technology, APJIMTC, Jalandhar ** Associate Professor,
Data Integration and ETL Process
Data Integration and ETL Process Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, second
A Critical Review of Data Warehouse
Global Journal of Business Management and Information Technology. Volume 1, Number 2 (2011), pp. 95-103 Research India Publications http://www.ripublication.com A Critical Review of Data Warehouse Sachin
Data Warehouse Snowflake Design and Performance Considerations in Business Analytics
Journal of Advances in Information Technology Vol. 6, No. 4, November 2015 Data Warehouse Snowflake Design and Performance Considerations in Business Analytics Jiangping Wang and Janet L. Kourik Walker
Conceptual Workflow for Complex Data Integration using AXML
Conceptual Workflow for Complex Data Integration using AXML Rashed Salem, Omar Boussaïd and Jérôme Darmont Université de Lyon (ERIC Lyon 2) 5 av. P. Mendès-France, 69676 Bron Cedex, France Email: [email protected]
Data Warehousing and Mining: Concepts, Methodologies, Tools, and Applications
Data Warehousing and Mining: Concepts, Methodologies, Tools, and Applications John Wang Montclair State University, USA Information Science reference Hershey New York Acquisitions Editor: Development Editor:
A Framework for Developing the Web-based Data Integration Tool for Web-Oriented Data Warehousing
A Framework for Developing the Web-based Integration Tool for Web-Oriented Warehousing PATRAVADEE VONGSUMEDH School of Science and Technology Bangkok University Rama IV road, Klong-Toey, BKK, 10110, THAILAND
Bringing Business Objects into ETL Technology
Bringing Business Objects into ETL Technology Jing Shan Ryan Wisnesky Phay Lau Eugene Kawamoto Huong Morris Sriram Srinivasn Hui Liao 1. Northeastern University, [email protected] 2. Stanford University,
Data Warehouse Requirements Analysis Framework: Business-Object Based Approach
Data Warehouse Requirements Analysis Framework: Business-Object Based Approach Anirban Sarkar Department of Computer Applications National Institute of Technology, Durgapur West Bengal, India Abstract
Datawarehousing and Analytics. Data-Warehouse-, Data-Mining- und OLAP-Technologien. Advanced Information Management
Anwendersoftware a Datawarehousing and Analytics Data-Warehouse-, Data-Mining- und OLAP-Technologien Advanced Information Management Bernhard Mitschang, Holger Schwarz Universität Stuttgart Winter Term
How To Develop Software
Software Engineering Prof. N.L. Sarda Computer Science & Engineering Indian Institute of Technology, Bombay Lecture-4 Overview of Phases (Part - II) We studied the problem definition phase, with which
Trivadis White Paper. Comparison of Data Modeling Methods for a Core Data Warehouse. Dani Schnider Adriano Martino Maren Eschermann
Trivadis White Paper Comparison of Data Modeling Methods for a Core Data Warehouse Dani Schnider Adriano Martino Maren Eschermann June 2014 Table of Contents 1. Introduction... 3 2. Aspects of Data Warehouse
A Design and implementation of a data warehouse for research administration universities
A Design and implementation of a data warehouse for research administration universities André Flory 1, Pierre Soupirot 2, and Anne Tchounikine 3 1 CRI : Centre de Ressources Informatiques INSA de Lyon
An Approach for Facilating Knowledge Data Warehouse
International Journal of Soft Computing Applications ISSN: 1453-2277 Issue 4 (2009), pp.35-40 EuroJournals Publishing, Inc. 2009 http://www.eurojournals.com/ijsca.htm An Approach for Facilating Knowledge
B.Sc (Computer Science) Database Management Systems UNIT-V
1 B.Sc (Computer Science) Database Management Systems UNIT-V Business Intelligence? Business intelligence is a term used to describe a comprehensive cohesive and integrated set of tools and process used
Data Warehousing Systems: Foundations and Architectures
Data Warehousing Systems: Foundations and Architectures Il-Yeol Song Drexel University, http://www.ischool.drexel.edu/faculty/song/ SYNONYMS None DEFINITION A data warehouse (DW) is an integrated repository
An Overview of Data Warehousing, Data mining, OLAP and OLTP Technologies
An Overview of Data Warehousing, Data mining, OLAP and OLTP Technologies Ashish Gahlot, Manoj Yadav Dronacharya college of engineering Farrukhnagar, Gurgaon,Haryana Abstract- Data warehousing, Data Mining,
Automating Data Warehouse Conceptual Schema Design and Evaluation
Automating Data Warehouse Conceptual Schema Design and Evaluation Cassandra Phipps Karen C. Davis ABB Inc. ECECS Dept. 650 Ackerman Rd. University of Cincinnati Columbus, OH 43202 Cincinnati, OH 45221-0030
Query Optimizer for the ETL Process in Data Warehouses
2015 IJSRSET Volume 1 Issue 3 Print ISSN : 2395-1990 Online ISSN : 2394-4099 Themed Section: Engineering and Technology Query Optimizer for the ETL Process in Data Warehouses Bhadresh Pandya 1, Dr. Sanjay
Modeling Data Warehouse Refreshment Process as a Workflow Application
Modeling Data Warehouse Refreshment Process as a Workflow Application Mokrane Bouzeghoub (*)(**), Françoise Fabret (*), Maja Matulovic-Broqué (*) (*) INRIA Rocquencourt, France (**) Laboratoire PRiSM,
Data Warehouse Schema Design
Data Warehouse Schema Design Jens Lechtenbörger Dept. of Information Systems University of Münster Leonardo-Campus 3 D-48149 Münster, Germany [email protected] 1 Introduction A data warehouse
DATA VERIFICATION IN ETL PROCESSES
KNOWLEDGE ENGINEERING: PRINCIPLES AND TECHNIQUES Proceedings of the International Conference on Knowledge Engineering, Principles and Techniques, KEPT2007 Cluj-Napoca (Romania), June 6 8, 2007, pp. 282
POLAR IT SERVICES. Business Intelligence Project Methodology
POLAR IT SERVICES Business Intelligence Project Methodology Table of Contents 1. Overview... 2 2. Visualize... 3 3. Planning and Architecture... 4 3.1 Define Requirements... 4 3.1.1 Define Attributes...
A Knowledge Management Framework Using Business Intelligence Solutions
www.ijcsi.org 102 A Knowledge Management Framework Using Business Intelligence Solutions Marwa Gadu 1 and Prof. Dr. Nashaat El-Khameesy 2 1 Computer and Information Systems Department, Sadat Academy For
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING Ramesh Babu Palepu 1, Dr K V Sambasiva Rao 2 Dept of IT, Amrita Sai Institute of Science & Technology 1 MVR College of Engineering 2 [email protected]
Business Intelligence. A Presentation of the Current Lead Solutions and a Comparative Analysis of the Main Providers
60 Business Intelligence. A Presentation of the Current Lead Solutions and a Comparative Analysis of the Main Providers Business Intelligence. A Presentation of the Current Lead Solutions and a Comparative
Deriving Business Intelligence from Unstructured Data
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 9 (2013), pp. 971-976 International Research Publications House http://www. irphouse.com /ijict.htm Deriving
Modelling of Data Extraction in ETL Processes Using UML 2.0
DESIDOC Bulletin of Information Technology, Vol. 26, No. 5, September 2006, pp. 39 O 2006, DESIDOC Modelling of Data Extraction in ETL Processes Using UML 2.0 M. Mrunalini, T.V. Suresh Kumar, D. Evangelin
DEVELOPING REQUIREMENTS FOR DATA WAREHOUSE SYSTEMS WITH USE CASES
DEVELOPING REQUIREMENTS FOR DATA WAREHOUSE SYSTEMS WITH USE CASES Robert M. Bruckner Vienna University of Technology [email protected] Beate List Vienna University of Technology [email protected]
Fluency With Information Technology CSE100/IMT100
Fluency With Information Technology CSE100/IMT100 ),7 Larry Snyder & Mel Oyler, Instructors Ariel Kemp, Isaac Kunen, Gerome Miklau & Sean Squires, Teaching Assistants University of Washington, Autumn 1999
An Overview of Data Warehouse Design Approaches and Techniques
An Overview of Data Warehouse Design Approaches and Techniques Alejandro Gutiérrez, Adriana Marotta Instituto de Computación, Facultad de Ingenieria, Universidad de la República, Montevideo, Uruguay October
TOWARDS A FRAMEWORK INCORPORATING FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTS FOR DATAWAREHOUSE CONCEPTUAL DESIGN
IADIS International Journal on Computer Science and Information Systems Vol. 9, No. 1, pp. 43-54 ISSN: 1646-3692 TOWARDS A FRAMEWORK INCORPORATING FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTS FOR DATAWAREHOUSE
Development models. 1 Introduction. 2 Analyzing development models. R. Kuiper and E.J. Luit
Development models R. Kuiper and E.J. Luit 1 Introduction We reconsider the classical development models: the Waterfall Model [Bo76], the V-Model [Ro86], the Spiral Model [Bo88], together with the further
Oracle Warehouse Builder 10g
Oracle Warehouse Builder 10g Architectural White paper February 2004 Table of contents INTRODUCTION... 3 OVERVIEW... 4 THE DESIGN COMPONENT... 4 THE RUNTIME COMPONENT... 5 THE DESIGN ARCHITECTURE... 6
Deductive Data Warehouses and Aggregate (Derived) Tables
Deductive Data Warehouses and Aggregate (Derived) Tables Kornelije Rabuzin, Mirko Malekovic, Mirko Cubrilo Faculty of Organization and Informatics University of Zagreb Varazdin, Croatia {kornelije.rabuzin,
Near Real-time Data Warehousing with Multi-stage Trickle & Flip
Near Real-time Data Warehousing with Multi-stage Trickle & Flip Janis Zuters University of Latvia, 19 Raina blvd., LV-1586 Riga, Latvia [email protected] Abstract. A data warehouse typically is a collection
Implementing a Data Warehouse with Microsoft SQL Server
This course describes how to implement a data warehouse platform to support a BI solution. Students will learn how to create a data warehouse 2014, implement ETL with SQL Server Integration Services, and
SAS BI Course Content; Introduction to DWH / BI Concepts
SAS BI Course Content; Introduction to DWH / BI Concepts SAS Web Report Studio 4.2 SAS EG 4.2 SAS Information Delivery Portal 4.2 SAS Data Integration Studio 4.2 SAS BI Dashboard 4.2 SAS Management Console
Real-Time Data Warehouse Loading Methodology
Real-Time Data Warehouse Loading Methodology Ricardo Jorge Santos CISUC Centre of Informatics and Systems DEI FCT University of Coimbra Coimbra, Portugal [email protected] Jorge Bernardino
Data Warehouse: Introduction
Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of base and data mining group,
Lection 3-4 WAREHOUSING
Lection 3-4 DATA WAREHOUSING Learning Objectives Understand d the basic definitions iti and concepts of data warehouses Understand data warehousing architectures Describe the processes used in developing
DATA WAREHOUSING AND OLAP TECHNOLOGY
DATA WAREHOUSING AND OLAP TECHNOLOGY Manya Sethi MCA Final Year Amity University, Uttar Pradesh Under Guidance of Ms. Shruti Nagpal Abstract DATA WAREHOUSING and Online Analytical Processing (OLAP) are
www.ijreat.org Published by: PIONEER RESEARCH & DEVELOPMENT GROUP (www.prdg.org) 28
Data Warehousing - Essential Element To Support Decision- Making Process In Industries Ashima Bhasin 1, Mr Manoj Kumar 2 1 Computer Science Engineering Department, 2 Associate Professor, CSE Abstract SGT
Aspect Oriented Strategy to model the Examination Management Systems
Aspect Oriented Strategy to model the Examination Management Systems P.Durga 1, S.Jeevitha 2, A.Poomalai 3, Prof.M.Sowmiya 4 and Prof.S.Balamurugan 5 Department of IT, Kalaignar Karunanidhi Institute of
Striving towards Near Real-Time Data Integration for Data Warehouses
Striving towards Near Real-Time Data Integration for Data Warehouses Robert M. Bruckner 1, Beate List 1, and Josef Schiefer 2 1 Institute of Software Technology Vienna University of Technology Favoritenstr.
Understanding Data Warehousing. [by Alex Kriegel]
Understanding Data Warehousing 2008 [by Alex Kriegel] Things to Discuss Who Needs a Data Warehouse? OLTP vs. Data Warehouse Business Intelligence Industrial Landscape Which Data Warehouse: Bill Inmon vs.
Indexing Techniques for Data Warehouses Queries. Abstract
Indexing Techniques for Data Warehouses Queries Sirirut Vanichayobon Le Gruenwald The University of Oklahoma School of Computer Science Norman, OK, 739 [email protected] [email protected] Abstract Recently,
Implement a Data Warehouse with Microsoft SQL Server 20463C; 5 days
Lincoln Land Community College Capital City Training Center 130 West Mason Springfield, IL 62702 217-782-7436 www.llcc.edu/cctc Implement a Data Warehouse with Microsoft SQL Server 20463C; 5 days Course
COURSE 20463C: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER
Page 1 of 8 ABOUT THIS COURSE This 5 day course describes how to implement a data warehouse platform to support a BI solution. Students will learn how to create a data warehouse with Microsoft SQL Server
Implementing a Data Warehouse with Microsoft SQL Server
Page 1 of 7 Overview This course describes how to implement a data warehouse platform to support a BI solution. Students will learn how to create a data warehouse with Microsoft SQL 2014, implement ETL
Data Warehousing and OLAP Technology for Knowledge Discovery
542 Data Warehousing and OLAP Technology for Knowledge Discovery Aparajita Suman Abstract Since time immemorial, libraries have been generating services using the knowledge stored in various repositories
