Modeling ETL Activities as Graphs *
|
|
|
- Deborah Sophie Hudson
- 9 years ago
- Views:
Transcription
1 Modeling ETL Activities as Graphs * Panos Vassiliadis, Alkis Simitsis, Spiros Skiadopoulos National Technical University of Athens, Dept. of Electrical and Computer Eng., Computer Science Division, Iroon Polytechniou 9, , Athens, Greece {pvassil,asimi,spiros}@dbnet.ece.ntua.gr Abstract. Extraction-Transformation-Loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. In this paper, we focus on the logical design of the ETL scenario of a data warehouse. Based on a formal logical model that includes the data stores, activities and their constituent parts, we model an ETL scenario as a graph, which we call the Architecture Graph. We model all the aforementioned entities as nodes and four different kinds of relationships (instance-of, part-of, regulator and provider relationships) as edges. In addition, we provide simple graph transformations that reduce the complexity of the graph. Finally, in order to support the engineering of the design and the evolution of the warehouse, we introduce specific importance metrics, namely dependence and responsibility, to measure the degree to which entities are bound to each other. 1. Introduction Related literature has characterized data warehouse processes as complex [BoFM99], costly and critical (covering thirty to eighty percent of effort and expenses of the overall data warehouse construction) [ShTy98, Vass00]. In order to facilitate and manage these data warehouse operational processes, specialized tools are already available in the market [Arde01, Data01, Micr01, ETI01], under the general title Extraction-Transformation-Loading (ETL) tools. To give a general idea of the functionality of these tools we mention their most prominent tasks, which include (a) the identification of relevant information at the source side; (b) the extraction of this information; (c) the customization and integration of the information coming from multiple sources into a common format; (d) the cleaning of the resulting data set, on the basis of database and business rules, and (e) the propagation of the data to the data warehouse and/or data marts. In the sequel, we will not discriminate between the tasks of ETL and Data Cleaning and adopt the name ETL for both these kinds of activities. [KRRT98] gives an informal but detailed methodology for the management of ETL activities. Research has also provided preliminary results on the modeling and optimization of ETL activities. The AJAX data cleaning tool [GFSS00] deals with typical data quality problems, such as the object identity problem, errors due to mistyping and data inconsistencies between matching records. AJAX provides a framework wherein the logic of a data cleaning program is modeled as a directed graph of mapping, matching, clustering and merging transformations over some input data. [RaHe01] present the Potter s Wheel data cleaning system, which offers the possibility of performing several algebraic operations over an underlying data set in an interactive/iterative way. [RaDo00] provides an extensive overview of the field of data cleaning, along with research issues and a review of some commercial tools. [Mong00] discusses a special case of the data cleaning process, namely the detection of duplicate records and extends previous algorithms on the issue. [BoDS00] focuses on another subproblem, namely the one of breaking address fields into different elements and suggest the training of a Hidden Markov Model to solve the problem. Moreover, in previous lines of research [VQVJ01, VVS+01] there was a first effort to cover the design aspects of ETL by trying (a) to show how data warehouse processes can be linked to a metadata repository; (b) to construct a running tool and (c) to cover some quality aspects of the data warehouse process. * This research has been partially funded by the European Union's Information Society Technologies Programme (IST) under project EDITH (IST ). Page 1
2 We believe that these approaches have not considered the inner structure of the ETL activities in sufficient depth. In this paper, we start from a general framework for the logical design of an ETL scenario and focus on the internal structure of ETL activities (practically exploiting their data centric nature). We employ a uniform paradigm, namely graph modeling, for both the scenario at large and the internals of each activity. Our contributions can be listed as follows: - First, we briefly present a logical model that includes the data stores, ETL activities and their constituent parts. An activity is defined as an entity with (possibly more than one) input schema(ta), an output schema, a rejection schema for the rows that do not pass the criteria of the activity and a parameter schema, so that the activity is populated each time with its proper parameter values. - Second, we show how this model is reduced to a graph, which we call the Architecture Graph. We model all the aforementioned entities as nodes and four different kinds of relationships as edges. These relationships involve (a) type checking information (i.e., which type an entity corresponds to), (b) part-of relationships (e.g., which activity does an attribute belong to), (c) regulator relationships, covering the population of the parameters of the activities from attributes or constant values and (d) provider relationships, covering the flow of data from providers to consumers. - Finally, we provide results on the exploitation of the Architecture Graph. First, we provide several simple graph transformations that reduce the complexity of the graph. For example, we give a simple algorithm for zooming out the graph, a transformation that can be very useful for the visualization of the graph. Second, we measure the importance and vulnerability of the nodes of the graph through specific importance metrics, namely dependence and responsibility. Dependence stands for the degree to which a node is bound to other entities that provide it with data and responsibility measures the degree up to which other nodes of the graph depend on the node under consideration. The reader is strongly encouraged to refer to the long version of the paper [VaSS02a] with further results and discussions that we cannot present here for lack of space. This paper is organized as follows. Section 2 presents how an ETL scenario can be modeled as a graph. Section 3 describes the exploitation of the architecture graph through several useful transformations and treats the design quality of an ETL scenario via this graph. In Section 4 we conclude our results. 2 The Architecture Graph of an ETL Scenario In this section, we first give a formal definition of activities, recordsets and other constituents of an ETL scenario. The full layout of such an ETL scenario can be modeled by a graph, which we call the Architecture Graph. Apart from this static description, the architecture graph also captures the data flow within the ETL environment. At the same time, the information on the typing of the involved entities and the regulation of the execution of a scenario, through specific parameters are also covered. 2.1 Preliminaries Being a graph the Architecture Graph of an ETL scenario comprises nodes and edges. The involved data types, function types, constants, attributes, activities, recordsets and functions constitute the nodes of the graph. In the sequel, we summarize the definition of these elements and refer the interested reader to [VaSS02a] for more details. - Data types. Each data type T is characterized by a name and a domain, i.e., a countable set of values. The values of the domains are also referred to as constants. - Attributes. Attributes are characterized by their name and data type. Attributes and constants are uniformly referred to as terms. - A Schema is a finite list of attributes. Each entity that is characterized by one or more schemata will be called Structured Entity. - RecordSets. A recordset is characterized by its name, its (logical) schema and its (physical) extension (i.e., a finite set of records under the recordset schema). As mentioned in Page 2
3 [VQVJ01], we can treat any data structure as a record set provided that there are the means to logically restructure it into a flat, typed record schema. In the rest of this paper, we will mainly deal with the two most popular types of recordsets, namely relational tables and record files. - Functions. A Function Type comprises a name, a finite list of parameter data types, and a single return data type. A function is an instance of a function type. - Elementary Activities. In our framework, activities are logical abstractions representing parts, or full modules of code. We employ an abstraction of the source code of an activity, in the form of an SQL statement, in order to avoid dealing with the peculiarities of a particular programming language. An Elementary Activity is formally described by the following elements: - Name: a unique identifier for the activity. - Input Schemata: a finite set of one or more input schemata that receive data from the data providers of the activity. - Output Schema: the schema that describes the placeholder for the rows that pass the check performed by the elementary activity. - Rejections Schema: a schema that describes the placeholder for the rows that do not pass the check performed by the activity, or their values are not appropriate for the performed transformation. - Parameter List: a set of pairs which act as regulators for the functionality of the activity (the target attribute of a foreign key check, for example). The first component of the pair is a name and the second is a schema, an attribute, a function or a constant. - Output Operational Semantics: an SQL statement describing the content passed to the output of the operation, with respect to its input. This SQL statement defines (a) the operation performed on the rows that pass through the activity and (b) an implicit mapping between the attributes of the input schema(ta) and the respective attributes of the output schema. - Rejection Operational Semantics: an SQL statement describing the rejected records, in a sense similar to the Output Operational Semantics. This statement is by default considered to be the negation of the Output Operational Semantics, except if explicitly defined differently. To fully capture the characteristics and interactions of the static entities mentioned previously, we model the different kinds of their relationships as the edges of the graph. Here, we list these types of relationships along with the related terminology that we will employ for the rest of the paper. - Part-of relationships. These relationships involve attributes and parameters and relate them to the respective activity, recordset or function to which they belong. - Instance-of relationships. These relationships are defined among a data/function type and its instances. - Provider relationships. These are 1:N relationships that involve attributes with a providerconsumer relationship. The flow of data from the data sources towards the data warehouse is performed through the composition of activities in a larger scenario. In this context, the input for an activity can be either a persistent data store, or another activity, i.e., any structured entity under a specific schema. Provider relationships capture the mapping between the attributes of the schemata of the involved entities. Note that a consumer attribute can also be populated by a constant, in certain cases. - Regulator relationships. These relationships are defined among the parameters of activities and the terms that populate these activities. - Derived provider relationships. A special case of provider relationships that occurs whenever output attributes are computed through the composition of input attributes and parameters. Derived provider relationships can be deduced from a simple rule and do not originally constitute a part of the graph. We assume the infinitely countable, mutually disjoint sets of names of column Model-specific in Fig As far as a specific scenario is concerned, we assume their respective finite subsets, depicted in column Scenario-Specific in Fig Data types, function types and constants are considered Built-in s of the system, whereas the rest of the entities are provided by the user (User Provided). Formally, let Page 3
4 G(V,E) be the Architecture Graph of an ETL scenario. Then, V = D F C Ω Φ S RS A and E = Pr Po Io Rr Dr. The graphical notation for the Architecture Graph is depicted in Fig Built-in User-provided Entity Model-specific Scenario-specific Data Types D I D Function Types F I F Constants C I C Attributes Ω I Ω Functions Φ I Φ Schemata S I S RecordSets RS I RS Activities A I A Provider Relationships Pr I Pr Part-Of Relationships Po I Po Instance-Of Relationships Io I Io Regulator Relationships Rr I Rr Derived Provider Relationships Dr I Dr Fig. 2.1 Formal definition of domains and notation Data Types Black ellipsis Integer RecordSets Cylinders R Function Types Black squares $2 Functions Gray squares my$2 Constants Black cycles 1 Parameters White squares rate Attributes Hollow ellipsoid nodes Activities Triangles SK Part-Of Relationships Instance-Of Relationships Regulator Relationships Simple edges annotated with diamond* Dotted arrows (from instance towards the type) Dotted edges Provider Relationships Derived Provider Relationships Bold solid rows (from provider to consumer) Bold dotted arrows (from provider to consumer) * We annotate the part-of relationship among a function and its return type with a directed edge, to distinguish it from the rest of the parameters. Fig. 2.2 Graphical notation for the Architecture Graph. 2.2 Constructing the Architecture graph In this subsection we will describe the precise structure of the architecture graph, based on the theoretical foundations and the graphical notation of the previous subsection. Clearly, we do not anticipate a manual construction of the entire graph by the designer, but rather, we anticipate that the designer will need to specify only the high level parts of the Architecture Graph during the construction of an ETL scenario. In general, this process can be facilitated by a graphical tool or a declarative language; we refer the interested reader to [VVS+01] for an example of both these alternatives. To motivate our discussion we will present an example involving the existence of (a) two source databases S 1 and S 2 ; (b) a central data warehouse DW and (c) a Data Staging Area (DSA), where all the transformations take place. The scenario involves the propagation of data from the table PARTSUPP(,,, ) of source S 1 as well as from the table PARTSUPP(,, Page 4
5 ) of source S 2 to the data warehouse. Table DW.PARTSUPP (, SUP,,, ) stores information for the available quantity () and cost () of parts () per supplier (SUP). Practically, the two data sources S 1 and S 2 stand for the two suppliers of the data warehouse. The full-scale example can be found in the long version of the paper [VaSS02a]. Here we will employ only a small part of it, where the data from source S 1 have just been transferred to the DSA in an intermediate table DS.PS 1 (,,,). We will particularly focus on two activities of the flow from table DS.PS 1 towards the target table DW.PARTSUPP. 1. In order to keep track of the supplier of each row (and ultimately populate attribute SUP of the table DW.PARTSUPP), we need to add a flag attribute, namely SUP, indicating 1 or 2 for the respective supplier. In our case, this is achieved through the activity Add_SPK Then, we need to assign a surrogate key on attribute. In the data warehouse context, it is common tactics to replace the keys of the production systems with a uniform key, which we call a surrogate key [KRRT98]. The basic reasons for this replacement are performance and semantic homogeneity. Textual attributes are not the best candidates for indexed keys and thus need to be replaced by integer keys. At the same time, different production systems might use different keys for the same object, or the same key for different objects, resulting in the need for a global replacement of these values in the data warehouse. This replacement is performed through a lookup table of the form L(PRODKEY,, SKEY). The column is due to the fact that there can be synonyms in the different sources, which are mapped to different objects in the data warehouse. In our case, the activity that performs the surrogate key assignment for the attribute is SK 1, using the lookup table LOOKUP_PS(,, SKEY). Attributes and part-of relationships. The first thing to incorporate in the architecture graph is the structured entities (activities and recordsets) along with all the attributes of their schemata. We choose to avoid overloading the notation by incorporating the schemata per se; instead, we apply a direct part-of relationship between an activity node and the respective attributes. We annotate each such relationship with the name of the schema (by default, we assume a,, PAR, REJ tag to denote whether the attribute belongs to the input, output, parameter or rejection schema of the activity). Naturally, if the activity involves more than one input schemata, the relationship is tagged with an i tag for the i-th input schema. Then, we incorporate the functions along with their respective parameters and the part-of relationships among the former and the latter. We annotate the part-of relationship with the return type with a directed edge, to distinguish it from the rest of the parameters. Fig. 2.3a depicts the decomposition of the recordsets DS.PS 1, LOOKUP_PS and the activities Add_SPK 1 and SK 1 into the attributes of their input and output schemata. Note the tagging of the schemata of the involved activities. We do not consider the rejection schemata in order to avoid crowding the picture. At the same time, the function Add_const 1 is decomposed into its parameters. This function belongs to the function type ADD_CONST and comprises two parameters: in and out. The former receives an integer as input and the latter propagates it towards the SUP attribute, in order to trace the fact that the rows come from source S 1. Note also, how the parameters of the two activities are also incorporated in the architecture graph. For the case of activity Add_SPK 1 the involved parameters are the parameters in and out of the employed function. For the case of activity SK 1 we have five parameters: (a), which stands for the production key to be replaced; (b), which stands for an integer value that characterizes which source s data are processed; (c) LU_, which stands for the attribute of the lookup table which contains the production keys; (d) LU_, which stands for the attribute of the lookup table which contains the source value (corresponding to the aforementioned parameter); (e) LU_SKEY, which stands for the attribute of the lookup table which contains the surrogate keys. Data types and instance-of relationships. Instantiation relationships are depicted as dotted arrows that stem from the instances and head towards their data/function types. In Fig. 2.3b, we observe the attributes of the two activities of our example and their correspondence to two data types, namely Integer and US_Date. For reasons of presentation, we merge several instantiation edges so that the figure does not become too crowded. At the bottom of Fig. 2.3b, we can also see the fact that function Add_const 1 is an instance of the function type ADD_CONST. Parameters and regulator relationships. In this case, we link the parameters of the activities to the terms (attributes or constants) that populate them. We depict regulator relationships with simple dotted Page 5
6 edges. In the example of Fig. 2.3a we can observe how the parameters of the two activities are populated. First, we can see that activity Add_SPK 1 receives an integer (1) as its input and uses the function Add_const 1 to populate its attribute SUP. The parameters in and out are mapped to the respective terms through regulator relationships. The same applies also for activity SK 1. All its parameters, namely,, LU_, LU_ and LU_SKEY, are mapped to the respective attributes of either the activity s input schema or the employed lookup table LOOKUP_PS. AddSPK1 AddSPK1 PAR PAR PAR DS.PS1 Add_ const1 SUP SUP SUP SKEY SUP SUP SUP in out Add_ const1 SKEY 1 LOOKUP_ PS (a) SK LU_ LU_ LU_SKEY Fig. 2.3 Relationships of the architecture graph: (a) Part-of, provider and regulator relationships; (b) Instance-of relationships The parameter LU_SKEY deserves particular attention. This parameter is (a) populated from the attribute SKEY of the lookup table and (b) used to populate the attribute SKEY of the output schema of the activity. Thus, two regulator relationships are related with parameter LU_SKEY, one for each of the aforementioned attributes. The existence of a regulator relationship among a parameter and an output attribute of an activity normally denotes that some external data provider is employed in order to derive a new attribute through the activity, through the respective parameter. Provider relationships. As already mentioned, the provider relationships capture the data flow from the sources towards the target recordsets in the data warehouse. Provider relationships are depicted with bold solid arrows that stem from the provider and end in the consumer attribute. Observe Fig. 2.3a. The flow starts from table DS.PS 1 of the data staging area. Each of the attributes of this table is mapped to an attribute of the input schema of activity Add_SPK 1. The attributes of the input schema of the latter are subsequently mapped to the attributes of the output schema of the activity. The flow continues from activity Add_SPK 1 towards the activity SK 1 in a similar manner. Note that, for the moment, we have not covered how the output of function Add_Const 1 populates the output attribute SUP for the activity AddSPK 1, or how the parameters of activity SK 1 populate the output attribute SKEY. This shortcoming is compensated through the usage of derived provider relationships, which we will introduce in the sequel. Another interesting thing is that during the data flow, new attributes are generated, resulting on new streams of data, whereas the flow seems to stop for other attributes. Observe the rightmost part of Fig. 2.3a where the values of attribute are not further propagated (remember that the reason for the application of a surrogate key transformation is to replace the production keys of the source data to a homogeneous surrogate for the records of the data warehouse, which is independent of the source they have been collected from). Instead of the values of the production key, the values from the attribute SKEY will be used to denote the unique identifier for a part in the rest of the flow. Derived provider relationships. As we have already mentioned, there are certain output attributes that are computed through the composition of input attributes and parameters. A derived provider relationship is another form of provider relationship that captures the flow from the input to the US Date ADD_ CONST (b) Integer Page 6
7 respective output attributes. Formally, assume that source is a term in the architecture graph, target is an attribute of the output schema of an activity A and x,y are parameters in the parameter list of A. The parameters x and y need not necessarily be different with each other. Then, a derived provider relationship pr(source,target) exists iff the following regulator relationships (i.e., edges) exist: rr 1 (source,x) and rr 2 (y, target). AddSPK1 PAR Legend SUP SUP SUP SKEY SUP LU_ LOOKUP_ PS SKEY LOOKUP_ PS SK LU_ LU_SKEY SK Fig. 2.4 Derived provider relationships of the architecture graph Intuitively, the case of derived relationships models the situation where the activity computes a new attribute in its output. In this case, the produced output depends on all the attributes that populate the parameters of the activity, resulting in the definition of the corresponding derived relationship. Observe Fig. 2.4, where we depict a small part of our running example. The legend in the left side of Fig. 2.4 depicts how the attributes that populate the parameters of the activity are related through derived provider relationships with the computed output attribute SKEY. The meaning of these five relationships is that SK 1..SKEY is not computed only from attribute LOOKUP_PS.SKEY, but from the combination of all the attributes that populate the parameters. 3. Exploitation of the Architecture Graph In this section, we provide results on the exploitation of the Architecture Graph for several tasks. Specifically, in Section 3.1, we give a simple algorithm for zooming out the graph, a transformation that can be very useful for it visualization. Also, we give a simple algorithm that returns a subgraph involving only the critical entities for the population of the target recordsets of the scenario. Then, in Section 3.2, we measure the importance and vulnerability of the nodes of the graph through specific importance metrics, namely dependence and responsibility. Dependence stands for the degree to which an entity is bound to other entities that provide it with data and responsibility measures the degree up to which other nodes of the graph depend on the node under consideration. Dependence and responsibility are crucial measures for the engineering of the evolution of the ETL environment. 3.1 Graph Transformations In this subsection, we will show how we can employ trivial transformations in order to eliminate the detailed information on the attributes involved in an ETL scenario. Each transformation that we discuss is providing a view on the graph. These views are simply subgraphs with a specific purpose. One would normally expect that the simplest view is produced by restricting the subgraph to only one of the four major types of relationships (provider, part-of, instance-of, regulator). We consider this as trivial, and we proceed to present two transformations, involving (a) how we can zoom out the architecture Page 7
8 graph, in order to eliminate the information overflow, which can be caused by the vast number of involved attributes in a scenario and (b) how we can obtain a critical subgraph of the Architecture Graph that includes only the entities necessary for the population of the target recordsets of the scenario. Zooming In and Out the Architecture Graph. We give a practical zoom out transformation that involves provider and regulator relationships. We constraint the algorithm of Fig. 3.1 to a local transformation, i.e., we consider zooming out only a single activity or recordset. This can easily be generalized for the whole scenario, too. Assume a given structured entity (activity or recordset) A. The transformation Zoom_Out of Fig. 3.1, detects all the edges of its attributes. Then all these edges are transferred to link the structured entity A (instead of its attributes) with the corresponding nodes. We consider only edges that link an attribute of A to some node external to A, in order avoid local cycles in the graph. Finally, we remove the attribute nodes of A and the remaining internal edges. Note that there is no loss of information due to this relationship, in terms of interdependencies between objects. Moreover, we can apply this local transformation to the full extent of the graph, involving the attributes of all the recordsets and activities of the scenario. Major Flow. In a different kind of zooming, we can follow the major flow of data from sources to the targets. We follow a backward technique. Assume the set of recordsets T, containing a set of target recordsets. Then, by recursively following the provider and regulator edges we can deduce the critical subgraph that models the flow of data from sources towards the critical part of the data warehouse. We incorporate the part-of relationships too, but we choose to ignore instantiation information. The transformation Major_Flow is shown in Fig Transformation Zoom_Out Input: the architecture graph G(V,E) and a structured entity A Output: a new architecture graph G (V,E ) Begin G = G; node t V, s.t. edge (A,t) Po edge (A,x) Po { edge (t,x) Pr : Pr = Pr (t,a)-(t,x); edge (x,t) Pr : Pr = Pr (A,t)-(x,t); edge (t,x) Rr : Rr = Rr (t,a)-(t,x); } node t V, s.t. edges (A,t) Po, (A,x) Po { edge (t,x) Pr : Pr = Pr -(t,x); edge (x,t) Pr : Pr = Pr -(x,t); edge (t,x) Rr : Rr = Rr -(t,x); remove t; } End Fig. 3.1 Zoom_Out transformation Transformation Major_Flow Input: the architecture graph G(V,E) and the set of target recordsets T. Output: a sub graph G (V,E ) containing information for the major flow of data from sources to targets. Begin Let TΩ be the set of attributes of all the recordsets of T; V =T TΩ; do { V"= ; t V",a V,e(a,t) Pr:{ V"=V" {t};e =E {e} }; t V",a V,e(a,t) Po:{ V"=V" {t};e =E {e} }; t V",a V,e(t,a) Po:{ V"=V" {t};e =E {e} }; t V",a V,e(a,t) Rr:{ V"=V" {t};e =E {e} }; V =V V"; } while V" ; End Fig. 3.2 Major_Flow transformation 3.2 Importance Metrics One of the major contributions that our graph-modeling approach offers is the ability to treat the scenario as the skeleton of the overall environment. If we treat the problem from its software engineering perspective, the interesting problem is how to design the scenario in order to achieve effectiveness, efficiency and tolerance of the impacts of evolution. In this subsection, we will assign simple importance metrics to the nodes of the graph, in order to measure how crucial their existence is for the successful execution of the scenario. Consider the subgraph G (V,E ) that includes only the part-of, provider and derived provider relationships among attributes. In the rest of this subsection, we will not discriminate between provider and derived provider relationships and will use the term provider for both. For each node A, we can define the following measures: - Local dependency: the in-degree of the node with respect to the provider edges; - Local responsibility: the out-degree of the node with respect to the provider edges; Page 8
9 - Local degree: the degree of the node with respect to the provider edges (i.e., the sum of the previous two entries). Intuitively, the local dependency characterizes the number of nodes that have to be activated in order to populate a certain node. The local responsibility has the reverse meaning, i.e., how many nodes wait for the node under consideration to be activated, in order to receive data. The sum of the aforementioned quantities characterizes the total involvement of the node in the scenario. Except for the local interrelationships we can always consider what happens with the transitive closure of relationships. Assume the set (Pr Dr) +, which contains the transitive closure of provider edges. We define the following measures: - Transitive dependency: the in-degree of the node with respect to the provider edges; - Transitive responsibility: the out-degree of the node with respect to the provider edges; - Transitive degree: the degree of the node with respect to the provider edges. - Total dependency: the sum of local and transitive dependency measures; - Total responsibility: the sum of local and transitive responsibility measures; - Total degree: the sum of local and transitive degrees. AddSPK1 SUP SUP LOOKUP_ PS SUP SKEY SK LOCAL TRANSITIVE TOTAL DEG DEG DEG AddSPK SUP SUM AVG/attribute SUP SUP SKEY SUM AVG/attribute LOOKUP_PS SK SUM AVG/attribute SUM AVG/attribute AVG/entity Fig. 3.3 Transitive closure for the entities of Fig. 2.4 and corresponding importance metrics Consider the example of Fig. 3.3, where we have computed the transitive closure of provider edges for the example of Fig We use standard notation for provider and derived provider edges and depict the edges of transitive relationships with simple dotted arrows. Figure 3.3 depicts also the aforementioned metrics for the involved attributes. By comparing the individual values with the average ones, one can clearly see that attribute..skey, for example, is the most vulnerable attribute of all, since it depends directly on several provider attributes. Other interesting usages of the aforementioned measures include: - Detection of inconsistencies for attribute population. For example, if some output attribute in a union activity is not populated by two input attributes, or, more importantly, if an integrity constraint is violated with regards to the population of an attribute of a target data store (i.e., having in-degree equal to zero). - Detection of important data stores. Clearly, data stores whose attributes have a positive out-degree are used for data propagation. Data stores with attributes having positive both in and out degrees, are transitive recordsets, which we use during the flow. Once a zoom-out operation has been performed, data stores with in-degree greater than one, are hot-spots since they are related with more than one applications. - Detection of useless (source) attributes. Any attribute having total responsibility equal to zero is useless for the cause of data propagation towards the sources. Page 9
10 4. Conclusions In this paper, we have focused on the logical design of the ETL scenario of a data warehouse. Based on a formal logical model that includes the data stores, activities and their constituent parts, we model an ETL scenario as a graph, which we call the Architecture Graph. We model all the aforementioned entities as nodes and four different kinds of relationships (instance-of, part-of, regulator and provider relationships) as edges. We have provided several simple graph transformations that reduce the complexity of the graph. Finally, we have introduced specific importance metrics, namely dependence and responsibility, to measure the degree to which an entity is bound to other entities that provide it with data and the degree up to which other nodes of the graph depend on the node under consideration. The results of this paper are part of a larger project on the design and management of ETL activities [VaSS02]. As future work, we already have preliminary results for the optimization of ETL scenario under certain time and throughput constraints. A set of loosely coupled tools is also under construction for the purposes of visualization of the ETL scenarios and optimization of their execution. References [Arde01] Ardent Software. DataStage Suite. Available at [BoDS00] V. Borkar, K. Deshmuk, S. Sarawagi. Automatically Extracting Structure from Free Text Addresses. Bulletin of the Technical Committee on Data Engineering, 23(4), (2000). [BoFM99] M. Bouzeghoub, F. Fabret, M. Matulovic. Modeling Data Warehouse Refreshment Process as a Workflow Application. In Proc. Intl. Workshop on Design and Management of Data Warehouses (DMDW 99), Heidelberg, Germany, (1999). [Data01] DataMirror Corporation. Transformation Server. Available at [ETI01] Evolutionary Technologies Intl. ETI*EXTRACT. Available at [GFSS00] H. Galhardas, D. Florescu, D. Shasha and E. Simon. Ajax: An Extensible Data Cleaning Tool. In Proc. ACM SIGMOD Intl. Conf. On the Management of Data, pp. 590, Dallas, Texas, (2000). [KRRT98] R. Kimbal, L. Reeves, M. Ross, W. Thornthwaite. The Data Warehouse Lifecycle Toolkit: Expert Methods for Designing, Developing, and Deploying Data Warehouses. John Wiley & Sons, February [Micr01] Microsoft Corp. MS Data Transformation Services. Available at [Mong00] A. Monge. Matching Algorithms Within a Duplicate Detection System. Bulletin of the Technical Committee on Data Engineering, 23(4), (2000). [RaDo00] E. Rahm, H. Do. Data Cleaning: Problems and Current Approaches. Bulletin of the Technical Committee on Data Engineering, 23(4), (2000). [RaHe01] V. Raman, J. Hellerstein. Potter's Wheel: An Interactive Data Cleaning System. Proceedings of 27 th International Conference on Very Large Data Bases (VLDB), pp , Roma, Italy (2001). [ShTy98] C. Shilakes, J. Tylman. Enterprise Information Portals. Enterprise Software Team. Available at (1998). [Vass00] P. Vassiliadis. Gulliver in the land of data warehousing: practical experiences and observations of a researcher. In Proc. 2 nd Intl. Workshop on Design and Management of Data Warehouses (DMDW), pp , Stockholm, Sweden (2000). [VaSS02] P. Vassiliadis, A. Simitsis, S. Skiadopoulos. The Arktos II Project. [VaSS02a] P. Vassiliadis, A. Simitsis, S. Skiadopoulos. Modeling ETL Activities as Graphs (Extended version), available at [VQVJ01] P. Vassiliadis, C. Quix, Y. Vassiliou, M. Jarke. Data Warehouse Process Management. Information Systems, 26(3), pp , June [VVS+01] P. Vassiliadis, Z. Vagena, S. Skiadopoulos, N. Karayannidis, T. Sellis. ARKTOS: Towards the modeling, design, control and execution of ETL processes. Information Systems, 26(8), pp , December 2001, Elsevier Science Ltd. Page 10
A Methodology for the Conceptual Modeling of ETL Processes
A Methodology for the Conceptual Modeling of ETL Processes Alkis Simitsis 1, Panos Vassiliadis 2 1 National Technical University of Athens, Dept. of Electrical and Computer Eng., Computer Science Division,
Optimization of ETL Work Flow in Data Warehouse
Optimization of ETL Work Flow in Data Warehouse Kommineni Sivaganesh M.Tech Student, CSE Department, Anil Neerukonda Institute of Technology & Science Visakhapatnam, India. [email protected] P Srinivasu
Analysis of Data Cleansing Approaches regarding Dirty Data A Comparative Study
Analysis of Data Cleansing Approaches regarding Dirty Data A Comparative Study Kofi Adu-Manu Sarpong Institute of Computer Science Valley View University, Accra-Ghana P.O. Box VV 44, Oyibi-Accra ABSTRACT
Optimizing ETL Processes in Data Warehouses
Optimizing ETL Processes in Data Warehouses Alkis Simitsis Nat. Tech. Univ. of Athens [email protected] Panos Vassiliadis University of Ioannina [email protected] Timos Sellis Nat. Tech. Univ. of
Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
PartJoin: An Efficient Storage and Query Execution for Data Warehouses
PartJoin: An Efficient Storage and Query Execution for Data Warehouses Ladjel Bellatreche 1, Michel Schneider 2, Mukesh Mohania 3, and Bharat Bhargava 4 1 IMERIR, Perpignan, FRANCE [email protected] 2
A proposed model for data warehouse ETL processes
Journal of King Saud University Computer and Information Sciences (2011) 23, 91 104 King Saud University Journal of King Saud University Computer and Information Sciences www.ksu.edu.sa www.sciencedirect.com
CONCEPTUAL MODELING FOR ETL PROCESS. A Thesis Presented to the Faculty of California State Polytechnic University, Pomona
CONCEPTUAL MODELING FOR ETL PROCESS A Thesis Presented to the Faculty of California State Polytechnic University, Pomona In Partial Fulfillment Of the Requirements for the Degree Master of Science In Computer
DATA VERIFICATION IN ETL PROCESSES
KNOWLEDGE ENGINEERING: PRINCIPLES AND TECHNIQUES Proceedings of the International Conference on Knowledge Engineering, Principles and Techniques, KEPT2007 Cluj-Napoca (Romania), June 6 8, 2007, pp. 282
Index Terms: Business Intelligence, Data warehouse, ETL tools, Enterprise data, Data Integration. I. INTRODUCTION
ETL Tools in Enterprise Data Warehouse *Amanpartap Singh Pall, **Dr. Jaiteg Singh E-mail: [email protected] * Assistant professor, School of Information Technology, APJIMTC, Jalandhar ** Associate Professor,
Indexing Techniques for Data Warehouses Queries. Abstract
Indexing Techniques for Data Warehouses Queries Sirirut Vanichayobon Le Gruenwald The University of Oklahoma School of Computer Science Norman, OK, 739 [email protected] [email protected] Abstract Recently,
Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner
24 Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner Rekha S. Nyaykhor M. Tech, Dept. Of CSE, Priyadarshini Bhagwati College of Engineering, Nagpur, India
Semantic Analysis of Business Process Executions
Semantic Analysis of Business Process Executions Fabio Casati, Ming-Chien Shan Software Technology Laboratory HP Laboratories Palo Alto HPL-2001-328 December 17 th, 2001* E-mail: [casati, shan] @hpl.hp.com
Modelling of Data Extraction in ETL Processes Using UML 2.0
DESIDOC Bulletin of Information Technology, Vol. 26, No. 5, September 2006, pp. 39 O 2006, DESIDOC Modelling of Data Extraction in ETL Processes Using UML 2.0 M. Mrunalini, T.V. Suresh Kumar, D. Evangelin
Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya
Chapter 6 Basics of Data Integration Fundamentals of Business Analytics Learning Objectives and Learning Outcomes Learning Objectives 1. Concepts of data integration 2. Needs and advantages of using data
A Review of Contemporary Data Quality Issues in Data Warehouse ETL Environment
DOI: 10.15415/jotitt.2014.22021 A Review of Contemporary Data Quality Issues in Data Warehouse ETL Environment Rupali Gill 1, Jaiteg Singh 2 1 Assistant Professor, School of Computer Sciences, 2 Associate
Key organizational factors in data warehouse architecture selection
Key organizational factors in data warehouse architecture selection Ravi Kumar Choudhary ABSTRACT Deciding the most suitable architecture is the most crucial activity in the Data warehouse life cycle.
Course 20463:Implementing a Data Warehouse with Microsoft SQL Server
Course 20463:Implementing a Data Warehouse with Microsoft SQL Server Type:Course Audience(s):IT Professionals Technology:Microsoft SQL Server Level:300 This Revision:C Delivery method: Instructor-led (classroom)
Implement a Data Warehouse with Microsoft SQL Server 20463C; 5 days
Lincoln Land Community College Capital City Training Center 130 West Mason Springfield, IL 62702 217-782-7436 www.llcc.edu/cctc Implement a Data Warehouse with Microsoft SQL Server 20463C; 5 days Course
Data Integration and ETL Process
Data Integration and ETL Process Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, second
COURSE 20463C: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER
Page 1 of 8 ABOUT THIS COURSE This 5 day course describes how to implement a data warehouse platform to support a BI solution. Students will learn how to create a data warehouse with Microsoft SQL Server
How To Develop Software
Software Engineering Prof. N.L. Sarda Computer Science & Engineering Indian Institute of Technology, Bombay Lecture-4 Overview of Phases (Part - II) We studied the problem definition phase, with which
Implementing a Data Warehouse with Microsoft SQL Server
Page 1 of 7 Overview This course describes how to implement a data warehouse platform to support a BI solution. Students will learn how to create a data warehouse with Microsoft SQL 2014, implement ETL
AN INTEGRATION APPROACH FOR THE STATISTICAL INFORMATION SYSTEM OF ISTAT USING SDMX STANDARDS
Distr. GENERAL Working Paper No.2 26 April 2007 ENGLISH ONLY UNITED NATIONS STATISTICAL COMMISSION and ECONOMIC COMMISSION FOR EUROPE CONFERENCE OF EUROPEAN STATISTICIANS EUROPEAN COMMISSION STATISTICAL
Implementing a Data Warehouse with Microsoft SQL Server MOC 20463
Implementing a Data Warehouse with Microsoft SQL Server MOC 20463 Course Outline Module 1: Introduction to Data Warehousing This module provides an introduction to the key components of a data warehousing
COURSE OUTLINE MOC 20463: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER
COURSE OUTLINE MOC 20463: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER MODULE 1: INTRODUCTION TO DATA WAREHOUSING This module provides an introduction to the key components of a data warehousing
Deductive Data Warehouses and Aggregate (Derived) Tables
Deductive Data Warehouses and Aggregate (Derived) Tables Kornelije Rabuzin, Mirko Malekovic, Mirko Cubrilo Faculty of Organization and Informatics University of Zagreb Varazdin, Croatia {kornelije.rabuzin,
Rational Reporting. Module 3: IBM Rational Insight and IBM Cognos Data Manager
Rational Reporting Module 3: IBM Rational Insight and IBM Cognos Data Manager 1 Copyright IBM Corporation 2012 What s next? Module 1: RRDI and IBM Rational Insight Introduction Module 2: IBM Rational Insight
Bringing Business Objects into ETL Technology
Bringing Business Objects into ETL Technology Jing Shan Ryan Wisnesky Phay Lau Eugene Kawamoto Huong Morris Sriram Srinivasn Hui Liao 1. Northeastern University, [email protected] 2. Stanford University,
Object Oriented Programming. Risk Management
Section V: Object Oriented Programming Risk Management In theory, there is no difference between theory and practice. But, in practice, there is. - Jan van de Snepscheut 427 Chapter 21: Unified Modeling
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES Contents 1. Random variables and measurable functions 2. Cumulative distribution functions 3. Discrete
Implementing a Data Warehouse with Microsoft SQL Server
This course describes how to implement a data warehouse platform to support a BI solution. Students will learn how to create a data warehouse 2014, implement ETL with SQL Server Integration Services, and
Formal Languages and Automata Theory - Regular Expressions and Finite Automata -
Formal Languages and Automata Theory - Regular Expressions and Finite Automata - Samarjit Chakraborty Computer Engineering and Networks Laboratory Swiss Federal Institute of Technology (ETH) Zürich March
Data Cleaning and Transformation - Tools. Helena Galhardas DEI IST
Data Cleaning and Transformation - Tools Helena Galhardas DEI IST Agenda ETL/Data Quality tools Generic functionalities Categories of tools The AJAX data cleaning and transformation framework Typical architecture
Chapter 5. Learning Objectives. DW Development and ETL
Chapter 5 DW Development and ETL Learning Objectives Explain data integration and the extraction, transformation, and load (ETL) processes Basic DW development methodologies Describe real-time (active)
Collated Food Requirements. Received orders. Resolved orders. 4 Check for discrepancies * Unmatched orders
Introduction to Data Flow Diagrams What are Data Flow Diagrams? Data Flow Diagrams (DFDs) model that perspective of the system that is most readily understood by users the flow of information around the
An Overview of Data Warehousing, Data mining, OLAP and OLTP Technologies
An Overview of Data Warehousing, Data mining, OLAP and OLTP Technologies Ashish Gahlot, Manoj Yadav Dronacharya college of engineering Farrukhnagar, Gurgaon,Haryana Abstract- Data warehousing, Data Mining,
ETL Process in Data Warehouse. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT
ETL Process in Data Warehouse G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT Outline ETL Extraction Transformation Loading ETL Overview Extraction Transformation Loading ETL To get data out of
Data warehouse Architectures and processes
Database and data mining group, Data warehouse Architectures and processes DATA WAREHOUSE: ARCHITECTURES AND PROCESSES - 1 Database and data mining group, Data warehouse architectures Separation between
East Asia Network Sdn Bhd
Course: Analyzing, Designing, and Implementing a Data Warehouse with Microsoft SQL Server 2014 Elements of this syllabus may be change to cater to the participants background & knowledge. This course describes
Data Integration with Talend Open Studio Robert A. Nisbet, Ph.D.
Data Integration with Talend Open Studio Robert A. Nisbet, Ph.D. Most college courses in statistical analysis and data mining are focus on the mathematical techniques for analyzing data structures, rather
The Relational Data Model: Structure
The Relational Data Model: Structure 1 Overview By far the most likely data model in which you ll implement a database application today. Of historical interest: the relational model is not the first implementation
Modeling Data Warehouse Refreshment Process as a Workflow Application
Modeling Data Warehouse Refreshment Process as a Workflow Application Mokrane Bouzeghoub (*)(**), Françoise Fabret (*), Maja Matulovic-Broqué (*) (*) INRIA Rocquencourt, France (**) Laboratoire PRiSM,
Chapter 4 Software Lifecycle and Performance Analysis
Chapter 4 Software Lifecycle and Performance Analysis This chapter is aimed at illustrating performance modeling and analysis issues within the software lifecycle. After having introduced software and
Deriving Business Intelligence from Unstructured Data
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 9 (2013), pp. 971-976 International Research Publications House http://www. irphouse.com /ijict.htm Deriving
Dimensional Modeling for Data Warehouse
Modeling for Data Warehouse Umashanker Sharma, Anjana Gosain GGS, Indraprastha University, Delhi Abstract Many surveys indicate that a significant percentage of DWs fail to meet business objectives or
Extraction Transformation Loading ETL Get data out of sources and load into the DW
Lection 5 ETL Definition Extraction Transformation Loading ETL Get data out of sources and load into the DW Data is extracted from OLTP database, transformed to match the DW schema and loaded into the
Database Application Developer Tools Using Static Analysis and Dynamic Profiling
Database Application Developer Tools Using Static Analysis and Dynamic Profiling Surajit Chaudhuri, Vivek Narasayya, Manoj Syamala Microsoft Research {surajitc,viveknar,manojsy}@microsoft.com Abstract
Continued Fractions and the Euclidean Algorithm
Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction
Course Outline. Module 1: Introduction to Data Warehousing
Course Outline Module 1: Introduction to Data Warehousing This module provides an introduction to the key components of a data warehousing solution and the highlevel considerations you must take into account
COMP 378 Database Systems Notes for Chapter 7 of Database System Concepts Database Design and the Entity-Relationship Model
COMP 378 Database Systems Notes for Chapter 7 of Database System Concepts Database Design and the Entity-Relationship Model The entity-relationship (E-R) model is a a data model in which information stored
a 11 x 1 + a 12 x 2 + + a 1n x n = b 1 a 21 x 1 + a 22 x 2 + + a 2n x n = b 2.
Chapter 1 LINEAR EQUATIONS 1.1 Introduction to linear equations A linear equation in n unknowns x 1, x,, x n is an equation of the form a 1 x 1 + a x + + a n x n = b, where a 1, a,..., a n, b are given
A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1
A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1 Yannis Stavrakas Vassilis Plachouras IMIS / RC ATHENA Athens, Greece {yannis, vplachouras}@imis.athena-innovation.gr Abstract.
Metadata Management for Data Warehouse Projects
Metadata Management for Data Warehouse Projects Stefano Cazzella Datamat S.p.A. [email protected] Abstract Metadata management has been identified as one of the major critical success factor
SIP Service Providers and The Spam Problem
SIP Service Providers and The Spam Problem Y. Rebahi, D. Sisalem Fraunhofer Institut Fokus Kaiserin-Augusta-Allee 1 10589 Berlin, Germany {rebahi, sisalem}@fokus.fraunhofer.de Abstract The Session Initiation
Business Intelligence in Oracle Fusion Applications
Business Intelligence in Oracle Fusion Applications Brahmaiah Yepuri Kumar Paloji Poorna Rekha Copyright 2012. Apps Associates LLC. 1 Agenda Overview Evolution of BI Features and Benefits of BI in Fusion
Course 10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012
Course 10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 OVERVIEW About this Course Data warehousing is a solution organizations use to centralize business data for reporting and analysis.
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING Ramesh Babu Palepu 1, Dr K V Sambasiva Rao 2 Dept of IT, Amrita Sai Institute of Science & Technology 1 MVR College of Engineering 2 [email protected]
1 if 1 x 0 1 if 0 x 1
Chapter 3 Continuity In this chapter we begin by defining the fundamental notion of continuity for real valued functions of a single real variable. When trying to decide whether a given function is or
Microsoft. Course 20463C: Implementing a Data Warehouse with Microsoft SQL Server
Course 20463C: Implementing a Data Warehouse with Microsoft SQL Server Length : 5 Days Audience(s) : IT Professionals Level : 300 Technology : Microsoft SQL Server 2014 Delivery Method : Instructor-led
A Quality-Based Framework for Physical Data Warehouse Design
A Quality-Based Framework for Physical Data Warehouse Design Mokrane Bouzeghoub, Zoubida Kedad Laboratoire PRiSM, Université de Versailles 45, avenue des Etats-Unis 78035 Versailles Cedex, France [email protected],
Implementing a Data Warehouse with Microsoft SQL Server 2012
Course 10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012 Length: Audience(s): 5 Days Level: 200 IT Professionals Technology: Microsoft SQL Server 2012 Type: Delivery Method: Course Instructor-led
Managing Variability in Software Architectures 1 Felix Bachmann*
Managing Variability in Software Architectures Felix Bachmann* Carnegie Bosch Institute Carnegie Mellon University Pittsburgh, Pa 523, USA [email protected] Len Bass Software Engineering Institute Carnegie
Acquisition Lesson Planning Form Key Standards addressed in this Lesson: MM2A3d,e Time allotted for this Lesson: 4 Hours
Acquisition Lesson Planning Form Key Standards addressed in this Lesson: MM2A3d,e Time allotted for this Lesson: 4 Hours Essential Question: LESSON 4 FINITE ARITHMETIC SERIES AND RELATIONSHIP TO QUADRATIC
Data warehousing with PostgreSQL
Data warehousing with PostgreSQL Gabriele Bartolini http://www.2ndquadrant.it/ European PostgreSQL Day 2009 6 November, ParisTech Telecom, Paris, France Audience
ETL-EXTRACT, TRANSFORM & LOAD TESTING
ETL-EXTRACT, TRANSFORM & LOAD TESTING Rajesh Popli Manager (Quality), Nagarro Software Pvt. Ltd., Gurgaon, INDIA [email protected] ABSTRACT Data is most important part in any organization. Data
Testing LTL Formula Translation into Büchi Automata
Testing LTL Formula Translation into Büchi Automata Heikki Tauriainen and Keijo Heljanko Helsinki University of Technology, Laboratory for Theoretical Computer Science, P. O. Box 5400, FIN-02015 HUT, Finland
Implementing a Data Warehouse with Microsoft SQL Server 2012 MOC 10777
Implementing a Data Warehouse with Microsoft SQL Server 2012 MOC 10777 Course Outline Module 1: Introduction to Data Warehousing This module provides an introduction to the key components of a data warehousing
So let us begin our quest to find the holy grail of real analysis.
1 Section 5.2 The Complete Ordered Field: Purpose of Section We present an axiomatic description of the real numbers as a complete ordered field. The axioms which describe the arithmetic of the real numbers
Incremental Data Migration in Multi-database Systems Using ETL Algorithm
Incremental Data Migration in Multi-database Systems Using ETL Algorithm Rajesh Yadav, Prabhakar Patil, Uday Nevase, Kalpak Trivedi, Prof. Bharati Patil Department of Computer Engineering Imperial College
Implementing a Data Warehouse with Microsoft SQL Server 2012
Course 10777 : Implementing a Data Warehouse with Microsoft SQL Server 2012 Page 1 of 8 Implementing a Data Warehouse with Microsoft SQL Server 2012 Course 10777: 4 days; Instructor-Led Introduction Data
Rose/Architect: a tool to visualize architecture
Published in the Proceedings of the 32 nd Annual Hawaii International Conference on Systems Sciences (HICSS 99) Rose/Architect: a tool to visualize architecture Alexander Egyed University of Southern California
Relational Databases
Relational Databases Jan Chomicki University at Buffalo Jan Chomicki () Relational databases 1 / 18 Relational data model Domain domain: predefined set of atomic values: integers, strings,... every attribute
Query Optimizer for the ETL Process in Data Warehouses
2015 IJSRSET Volume 1 Issue 3 Print ISSN : 2395-1990 Online ISSN : 2394-4099 Themed Section: Engineering and Technology Query Optimizer for the ETL Process in Data Warehouses Bhadresh Pandya 1, Dr. Sanjay
Test Data Management Concepts
Test Data Management Concepts BIZDATAX IS AN EKOBIT BRAND Executive Summary Test Data Management (TDM), as a part of the quality assurance (QA) process is more than ever in the focus among IT organizations
Implementing a Data Warehouse with Microsoft SQL Server
Course Code: M20463 Vendor: Microsoft Course Overview Duration: 5 RRP: 2,025 Implementing a Data Warehouse with Microsoft SQL Server Overview This course describes how to implement a data warehouse platform
CHAPTER 4 Data Warehouse Architecture
CHAPTER 4 Data Warehouse Architecture 4.1 Data Warehouse Architecture 4.2 Three-tier data warehouse architecture 4.3 Types of OLAP servers: ROLAP versus MOLAP versus HOLAP 4.4 Further development of Data
Microsoft Data Warehouse in Depth
Microsoft Data Warehouse in Depth 1 P a g e Duration What s new Why attend Who should attend Course format and prerequisites 4 days The course materials have been refreshed to align with the second edition
Basics of Dimensional Modeling
Basics of Dimensional Modeling Data warehouse and OLAP tools are based on a dimensional data model. A dimensional model is based on dimensions, facts, cubes, and schemas such as star and snowflake. Dimensional
DATA WAREHOUSING AND OLAP TECHNOLOGY
DATA WAREHOUSING AND OLAP TECHNOLOGY Manya Sethi MCA Final Year Amity University, Uttar Pradesh Under Guidance of Ms. Shruti Nagpal Abstract DATA WAREHOUSING and Online Analytical Processing (OLAP) are
Research Statement Immanuel Trummer www.itrummer.org
Research Statement Immanuel Trummer www.itrummer.org We are collecting data at unprecedented rates. This data contains valuable insights, but we need complex analytics to extract them. My research focuses
A Model-driven Approach to Predictive Non Functional Analysis of Component-based Systems
A Model-driven Approach to Predictive Non Functional Analysis of Component-based Systems Vincenzo Grassi Università di Roma Tor Vergata, Italy Raffaela Mirandola {vgrassi, mirandola}@info.uniroma2.it Abstract.
Business Intelligence. A Presentation of the Current Lead Solutions and a Comparative Analysis of the Main Providers
60 Business Intelligence. A Presentation of the Current Lead Solutions and a Comparative Analysis of the Main Providers Business Intelligence. A Presentation of the Current Lead Solutions and a Comparative
Requirements engineering
Learning Unit 2 Requirements engineering Contents Introduction............................................... 21 2.1 Important concepts........................................ 21 2.1.1 Stakeholders and
