Capturing Database Transformations for Big Data Analytics

Capturing Database Transformations for Big Data Analytics David Sergio Matusevich University of Houston

Organization Introduction Classification Program Case Study Conclusions

Extending ER Models to Capture Database Transformations to Build Data Sets for Data Mining Analysis Carlos Ordonez Sofian Maabout David Sergio Matusevich Wellington Cabrera Carlos Ordonez, Sofian Maabout, David Sergio Matusevich, Wellington Cabrera. Extending ER Models to Capture Database Transformations to Build Data Sets for Data Mining, Data & Knowledge Engineering (DKE), 2014, Elsevier.

Data Mining Projects Data mining projects usually require the preparation of a dataset specially created for answering the particular question asked by the user. For example, given a database of a cellular phone company, we might ask: What percentage of users will change data plans with the advent of a new smartphone. This will require the creation of a dataset where at least one of the columns will be CHANGED_PLAN and another one could be BEFORE_NEW_DEVICE. Many of the intermediate tables created for this project might remain in the database, using resources.

Motivation: Saving Work Different users might ask similar questions, leading to the creation of tables that are virtually identical, cluttering the system. For instance a researcher might want to answer the question: What percentage of men between the ages of 18 and 35 will change data plans when a new device is introduced. If the researcher is not aware of the previous user project, some of the intermediate tables created might be exact duplicates of the ones created before.

Contribution In this work we present: A classification of the most common transformations user in data mining, We propose an extension of the ER Model to keep track of the intermediate tables created, and We introduce a tool designed to: Simplify the use of naming conventions, Keep track of attributes and keys, and Facilitate the recognition of duplicate tables.

Building a data set for data mining Building a data mining dataset involves successive rounds of aggregation and denormalization.

Note: If the database is not static, the transformation tables must also be updated. This could be resource intensive, and could be left to the last minute, that is, a transformation table is only updated when it is reused. We also limit ourselves to transformations that happen inside the database. Transformations happening outside the, such as those performed by Extract-Transform-Load (ETL) Tools, are not considered here.

Model vs Theory Entity-Relationship (ER) Model Entities Relationships Relational Model Tables Foreign Keys Database: DD SS, II where: SS = and II are integrity constraints. SS 1, SS 2,, SS nn are tables The set of tables SS is one of the inputs for the tool we wrote.

Well Formed Queries We define a well formed query as one that complies with the following requirements: Always produces a table with a primary key and a potentially empty set of non-key attributes. Each join operator is computed based on a foreign key and primary key from the referencing table and the referenced table, respectively.

Database Transformation Queries Goal: To produce a single table XX, that will be used as input for a data mining algorithm. Given SS = SS 1, SS 2,, SS nn a set of source tables and QQ = qq 1, qq 2,, qq mm a sequence of queries (XX is the result of qq mm ). The sequence of queries will produce a set of transformation tables: where XX = TT nn.

Data Sets Clearly different projects will require the creation of different data sets. This results in a number of different XX ii, each associated with a query plan QQ ii : QQ 1 = qq 1, qq 2,, qq mm XX 1 QQ kk = qq 1, qq 2,, qq mm XX kk

The Transformation Tables In order to allow for easy reuse, transformation tables must incorporate into their metadata: The query that created them An indication of whether the entities come from a source table or another transformation table (provenance). PK PK PK PK PK PK PK Aggregation: T9 CustomerID EmailPromotion SalesOrderID OdrMonth StateProvinceCode TerritoryID MakeFlag MaxProductLine... Style SELECT CustomerID, EmailPromotion, SalesOrderID, OdrMonth, StateProvinceCode, TerritoryID,MakeFlag, max(taxamt) AS TaxGrpByOdrID, max(freight) AS FreightGrpByOdrID, max(totaldue) AS TotalDueGrpByOdrID, sum(orderqty) AS OdrQtyCycleGrpByOdrID, sum(linetotal) AS LineTotGrpByOdrID, sum(standardcost) AS StdCostGrpByOdrID, max(productline) AS MaxProductLine, max(size) AS SizeCycle, max(class) AS ClassCycle, max(style) AS Style INTO T9 from T7 group by CustomerID, EmailPromotion, SalesOrderID, OdrMonth, StateProvinceCode, TerritoryID, MakeFlag; /* CustomerID, EmailPromotion, SalesOrderID, OdrMonth, TaxGrpByOdrID, FreightGrpByOdrID, TotalDueGrpByOdrID, StateProvinceCode, TerritoryID, OdrQtyCycleGrpByOdrID, LineTotGrpByOdrID, MakeFlag, StdCostGrpByOdrID, MaxProductLine, SizeCycle, ClassCycle, Style */ SQL

Transformations Proposition: Let TT = TT 1 TT 2 TT nn a transformation on appropriate foreign keys. Every query used to transform TT either: 1. Includes the primary key of TT which comes from some TT ii or 2. it does not include the primary key of TT, but it includes a subset of kk primary keys of kk tables TT ii to later compute group-by aggregations. Proof sketch: All aggregation queries are assumed to have grouping columns in order to identify the object of study. That is, they represent GROUP BY queries in SQL. Therefore every must include the primary key of either joined table in order produce a data set. An aggregation must use keys to group rows (otherwise, records cannot be identified and further processed) and the only available keys are foreign keys.

Classification of transformations We distinguish two mutually exclusive database transformations: 1. Denormalization, which brings attributes from other entities into the transformation entity or simply combines existing attributes. 2. Aggregation, which creates a new attribute grouping rows and computing a summarization. Transformations Denormalization Aggregation Direct Derivation Expression Case Count / Sum Max / Min Arithmetic String Date

The CASE statement Example: SELECT.. CASE WHEN A1='married' or A2='employed' THEN 1 ELSE 0 END AS binaryvariable.. FROM.. The CASE statement does not have a relational algebra translation. It derives a binary attribute nor present before in the database, and might even introduce NULLS.

Sample Database Source: S1 Source: S2 Source: S3 PK FK1 I A1 A2 A3 K1 K2 PK,FK 1 I PK J A4 A5 A6 A7 K3 PK K 2 A8 In this simple example S1 could be a table of transactions, S2 a table pf products and S3 could contain details about the product.

Sample Script Entry point /* q0: T0, universe */ SELECT I, /* I is the record id, or point id mathematically */ CASE WHEN A1= married or A2= employed THEN 1 ELSE 0 END AS Y,/* binary target variable */ A3 AS X1 /* 1st variable */ INTO T0 FROM S1; /* q1: denormalize and filter valid records */ SELECT S2.I,S2.J,A4,A5,A6,A7,K2,K3 INTO T1 FROM S1 JOIN S2 ON S1.I=S2.I WHERE A6>10; /* q2: aggregate */ SELECT I, sum(a4) AS X2,sum(A5) AS X3,max(1) AS k /* k is FK */ INTO T2 FROM T1 GROUP BY I; /* q3: get min, max */ SELECT 1 AS k, min(x3) AS minx3, max(x3) as maxx3 INTO T3 FROM T2; /*q4: math transform */ SELECT I, log(x2) AS X2 /* 2nd variable */ (X3-minX3)/(maxX3-minX3) AS X3 /* 3rd variable range [0,1]*/ INTO T4 FROM T2 JOIN T3 ON T2.K=T3.K; /* get the min/max */ /* q5: denormalize, gather attribute from referenced table S3 */ SELECT I,J,A7,A8 INTO T5 FROM T1 JOIN S3 ON T1.K2=S3.K2; /* q6: aggregate with CASE */ SELECT I, sum(case WHEN A7= Y THEN A8 ELSE 0 END) AS X4 INTO T6 FROM T5 GROUP BY I; /* q7: data set, star join this data set can be used for: logistic regression, decision tree, SVM */ SELECT T0.I,X1,X2,X3,X4,Y INTO X FROM T0 JOIN T4 ON T0.I=T4.I JOIN T6 ON T0.I=T6.I; Output

Transformations Tool The transformations tool consists of: A query parser that determines if a query is a denormalization or an aggregation, An attribute tracker, that determines the provenance of all attributes, A set of rules to determine keys and foreign keys of the transformation tables. The input for the tool is a query script and a list of source tables. The output is a list of transformation tables with type, attributes, keys and provenances clearly marked.

Remarks The tool was written in C++. It requires that the queries be written as regular expressions (for simplicity). The tool does not provide any feedback to the user regarding the queries he is using. In a future iteration we plan to incorporate suggestions regarding naming and existence of similar tables.

Tool Development Start Input Script { q, q 2, } Q =, 1 q m Planned Finished No Parse query q i Execute query Write new query to log No Determine type Check DB for similar table Exists? Update Entity Model with new table Last query? Normalize names The program should create a database of queries. Yes Rename table and entities (if needed) Yes Create Query Plan Flow Chart Finish X

Tool Output Denormalization: T0(I,Y,X1, PK(I), FK(S1.I)); Denormalization: T1(I,J,A4,A5,A6,A7,K2,K3, PK(I,J), FK(S2.I,S2.J),FK(S3.K2)); Aggregation: T2(I,X2,X3,K, PK(I), FK(S1.I)); Aggregation: T3(K,minX3,maxX3,PK(K)); Aggregation: T4(I,X2,X3,PK(I),FK(S1.I)); Denormalization: T5(I,J,A7,A8, PK(I,J), FK(S2.I,S2.J)); Aggregation: T6(I,X4, PK(I), FK(S1.I)); Denormalization: X(I,X1,X2,X3,X4,Y, PK(I), FK(S1.I));

Program Detail Script SELECT I, CASE WHEN (A1= married or A2= employed ) THEN 1 ELSE 0 END AS Y, A3 AS X1 INTO TABLE0 FROM S1; Output Denormalization: T0(I,Y,X1,PK(I),FK(S1.I)); The output of the code identifies the type of transformation (denormalization or aggregation), the attributes present in the new table as well as information about keys and foreign keys. Furthermore, it changes the name of the table to a normalized name.

Denormalization PK S1 I Denormalization: T0 PK,FK 1 I S2 A1 A2 A3 K1 PK,FK 1 I PK J FK2 A4 A5 A6 A7 K2 K3 SELECT I,CASE WHEN A1='married' or A2='employed' THEN 1 ELSE 0 END AS Y,A3 AS X1 INTO T0 FROM S1; PK S3 K2 A8 SELECT S2.I, S2.J, A4, A5, A6, A7, K2, K3 INTO T1 FROM S1 JOIN S2 ON S1.I = S2.I WHERE A6>10; Denormalization: T1 PK,FK 1 I PK,FK 1 J FK2 A4 A5 A6 A7 K2 K3 SQL Y X1 SQL

Aggregation PK S1 I A1 A2 A3 K1 Aggregation: T6 PK,FK 1 I Aggregation: T2 PK,FK1 I K X2 X3 SQL S2 PK,FK 1 I PK J FK2 A4 A5 A6 A7 K2 K3 S3 PK K2 A8 X4 SQL SELECT I, sum(a4) AS X2, sum(a5) AS X3, max(1) AS K INTO T2 FROM T1 GROUP BY I; SELECT I, sum(case WHEN A7='Y' THEN A8 ELSE 0 END) AS X4 INTO T6 FROM T5 GROUP BY I;

Future Extensions We need to extend the program to search the database for transformation tables that might have already been created. Incorporate it as a plugin of a major. This would allow considerable savings in time and resources when preparing datasets. Create a plugin for a modeling software to show the new tables created, as well as the metadata stored when using the program. Introduce a work-flow chart for the query plan.

Conclusions Minimal extension to the ER model to represent data transformations in an ER diagram. Introduced an algorithm to extend an existing ER model, keeping the data set in mind as the final goal. Help analysts reuse existing tables or views. Help understanding complex SQL queries at a high level. Our work bridges the gap between a logical database model represented by a standard ER model and a physical database model represented by SQL queries.

The AdventureWorks Database