Prttype f a Web ETL Tl Matija Nvak, Krnelije Rabuzin Faculty f Organizatin and Infrmatics University f Zagreb Varazdin, Cratia Abstract Extract, transfrm and lad (ETL) is a prcess that makes it pssible t extract data frm peratinal data surces, t transfrm data in the way needed fr data warehusing purpses and t lad data int a data warehuse (DW). ETL prcess is the mst imprtant part when building the data warehuse. Because the ETL prcess is a very cmplex and time cnsuming, this paper presents a prttype f a web ETL tl that ffers step-by-step guidance thrugh the entire prcess t the end user. This ETL tl is designed as a web applicatin s users can save time (and space) required fr installatin purpses. Keywrds ETL; data warehuse; web; ETL tl I. INTRODUCTION Databases (DB) have been used fr many years and it is hard t imagine any (transactin) applicatin that wuldn t use sme database. Over time peple realized that databases, althugh they supprt daily peratins, are nt gd surce when cmplex analysis must be made n data. Merging data frm multiple tables, the cmplexity f the mdel (as such), the inability t generate reprts by end users and (in)effectiveness f such apprach resulted with the need t rerganize (transfrm) data int a frm that will be suitable fr analysis. This frm is called a data warehuse [1, p. 85]. The basic idea f data warehuses is t stre data in such a way that users can understand and analyze data. R. Kimball and J. Caserta define the data warehuse as fllws: "A data warehuse is a system that extracts, cleans, cnfrms, and delivers surce data int a dimensinal data stre and then supprts and implements querying and analysis fr the purpse f decisin making." [2, p. 23]. This definitin says that data warehuses are used t supprt decisin-making. The data warehuse wuld nt be gd withut the iterative prcess f extracting, cleaning, cnfrming and lading data (the s called ETL prcess) frm varius surces int the star schema mdel. When we talk abut data rganizatin in the data warehuse, we distinguish between fact and dimensin tables. While dimensin tables cntain large number f attributes that we use when analyzing (filtering) data, fact tables cntain measures t quantify business prcesses (number f prduct units sld, number f rders, number and duratin f calls, etc.). Fr end users such mdel is understandable and they can independently create necessary reprts. Fig. 1. ROLAP mdel [3] Basically there are tw mechanisms (ways) that can be used t stre data in the data warehuse (Fig. 1): relatinal nline analytical prcessing (ROLAP) and multidimensinal nline analytical prcessing (MOLAP) [4, p. 165] While ROLAP stres data in tables, MOLAP stres data in special structures als knwn as cubes. There are advantages and disadvantages but we will nt discuss them in this paper (mre can be fund in [9]). If ne lks at tday s market, ne can find varius Business Intelligence (BI) tls that are used t prduce reprts by using data frm data warehuses. Althugh data warehuses are very valuable surces f data, the main prblem in the cnstructin f the data warehuse is the s called ETL prcess. Systems fr extractin, transfrmatin and lading f data (ETL systems fr shrt) are the fundatin f data warehuses. When cnstructing a data warehuse 70 percent f time and resurces is used fr the ETL purpses (by Inmn 80 percent [5, p. 295]). Building a data warehuse is expensive, time cnsuming and cmplex jb and the ETL phase is the mst critical ne. Because f that the idea f this paper is t present the ETL tl that shuld facilitate and accelerate the prcess f ETL. This ETL tl ffers the user step-by-step guidance thrugh the entire ETL prcess. In additin this ETL tl is designed as a web applicatin and users can save time (and space) required fr installatin. This tl can start frm hetergeneus surces f data and result is a dimensinal mdel stred in a relatinal database which can be used fr ther purpses (primarily fr building reprts by means f sme BI tl). This paper is structured as fllws: the secnd sectin describes the related wrk and the third sectin the basics f ETL. Next, the mdel f the ETL tl is shwn and several screen shts are given. In the end f the paper sme pen questins are addressed (future wrk) and the cnclusin is presented. 97 P a g e
II. ETL TOOLS There are varius prfessinal tls, which can be used t assist user in the ETL prcess; hwever, the prblem f these tls is their cmplexity and/r price. Fr example, if we take free tl Talend Open Studi, its features are great and user can execute cmplex peratins. But, the tl can be very difficult and cnfusing, especially if the user is nt familiar with the ETL prcess. Because the tl has a number f pssibilities, it is necessary t examine what ur individual elements (r rather bjects) allw us t d and what are their attributes. In additin, there is service-riented architecture (SOA) ETL Framewrk described in [7] that tries t split the tightly cupled functinalities f an ETL tl int separate parts that can be used as services. T the authrs knwledge there is n such thing as a cmpletely web based tl that wuld integrate the learning f the ETL prcess int the tl itself. Furthermre, the ETL tl described in this article is cmpletely web based (it can be easily accessed thrugh web brwser, n installatin is needed and multiple users can use it at the same time). T avid prblems with the ETL prcess, the created tl guides the user thrugh the ETL prcess and teaches him during the way; s the basic idea is that it can be used by peple nt familiar with the ETL prcess. III. ETL The ETL prcess is a set f activities that are nt visible t the end user and that are taking place in the backgrund. In additin t retrieving infrmatin frm different surces, many activities need t be perfrmed n data [2, p. xxi]: mistakes have t be crrected, data needs t be structured, etc. The ETL prcess (Fig. 2) has three steps [6, p. 139]: Data extractin accessing data surces in rder t retrieve (required) data. Data transfrmatin - In this step data cllected frm varius surces is checked, cleaned and cnfrmed, i.e. data underges a series f activities in rder t imprve the quality f data. [4, p. 375] Data lading - extracted and transfrmed data is laded int the data warehuse (dimensin and fact tables). While extractin and lading nly transfer data, transfrmatins are really changing data. Kimball and Caserta prpse the s-called Extracting, Cleaning, Cnfrming, and Delivering (ECCD) instead f the ETL, but either way in the end data has t be laded in the data warehuse. ECCD cnsists f fur steps [2, pp. 18-19]: Extractin the first step is t take data frm different surces and stre it in the ETL envirnment in rder t make the necessary prcessing. Cleaning perfrming the first transfrmatin f data in rder t enhance the quality f the riginal data. Cnfrming This step is necessary if there are tw r mre data surces. Varius surces tend t have Data surces Flat files DB XML Files Fig. 2. ETL prcess steps differently shaped and stred data and there is a need t synchrnize data (reslve cnflicts when different s are used, reslve the prblem f duplicates, etc.). Delivery The last step is the same as fr the ETL (lading data int the data warehuse). This prcess f lading data can be further divided int tw parts [2, pp. 161-254]: Extract Transfrm Date frmat/ Attribute merge/ Nt null/ Lad Lading data int dimensin tables they cntain infrmatin that allws understanding (interpreting) the data in fact tables Lading data int fact tables central tables that cntain numerical values. There is als the fifth step (the s called management) which is nt part f the flw f data prcessing, but it is used fr system and prcess management f the ETL envirnment. ETL and ECCD describe the same data prcessing activities and the end result is the same, yet the ECCD is smewhat better because the steps define in detail the activities t be carried ut in the prcessing f the riginal data and it separates activities related t a single surce f data and activities that include multiple data surces. Nevertheless the term ETL is s "dmestic" that it is nt reasnable t expect that it is replaced in the near future. A. Metadata During the ETL prcess varius metadata is generated. The ETL metadata is divided int fur main categries [2, pp. 367-368]: ETL jb metadata is a cntainer f transfrmatins that manipulate the data. Every ETL task is captured here. Transfrmatin metadata cntains infrmatin abut every transfrmatin that is used inside f the ETL jbs. Batch metadata in the ETL prcess batches are used t run cllectins f jbs tgether. Batches can cntain sub-batches and schedules can be made t run batches peridically. All that infrmatin is stred in batch metadata. Prcess metadata is generated when batches are executed. Prcess metadata has infrmatin n whether lading f data (int the DW) was successful r nt. DW 98 P a g e
1) Lgical data map At the beginning f the ETL prcess, it is necessary t make a lgical data map. The lgical data map dcuments the links between the clumns (fields) in the surce and the clumns in the destinatin table (in the data warehuse). Lgical data map is ne f the mst imprtant and mst useful metadata generated by the ETL. Header f the lgical data map is shwn in the fllwing table [2, pp. 56-71]. Once created, the lgical data map prvides infrmatin abut what needs t be extracted, frm where, hw t prcess data and where it needs t be saved after prcessing. The lgical data map is useful thrughut the entire ETL prcess. Target Table Surce Database TABLE I. HEADER OF THE LOGICAL DATA MAP [2, P. 60] Clumn Table Data type Clumn Table type Data type SCD type Transfrmatin 2) Data surces A data warehuse ften uses different data surces (Enterprise Resurce Planning (ERP) Systems, extensible markup language (XML) files, databases and flat files). N matter which surce is used, specific metadata is required. The fllwing metadata attributes are minimally required [2, p. 362]: Database r file system The cmmnly used when referring t a surce system r file. [2, p. 362] Table specificatin The ETL team needs t knw the purpse f the table, its vlume, its primary key and alternate key, and a list f its clumns. [2, p. 362] Exceptin-handling rules Necessary infrmatin related t the quality f data and hw shuld the ETL prcess manage them. Business definitins It's gd t get the business definitins as these tw r three sentences are very useful when yu need t understand data. Business rules Every table shuld cme with a set f business rules. Business rules are required t understand data and t test fr anmalies. [2, p. 362] Types f data surces can be: Flat Files - In mst data warehuses regular files can t be avided. Flat files can be used in the ETL system fr at least three reasns [2, pp. 90-91]: delivery f surce data, wrking/staging tables r preparatin fr bulk lad. There are tw types f files [2, pp. 91-93]: fixed length flat files and delimited flat files. XML files - In recent years the XML is used very much. XML files are gd fr the ETL prcess because they are self-dcumented unlike rdinary files that are nt. XML files are ften used fr data exchange and prvide independence frm the specific cmputatinal implementatins [4, p. 126]. Operatinal databases - the mst cmmn surce f data fr the data warehuse. Benefits f databases regarding the ETL phase are [2, pp. 40-41]: Apparent metadata, Relatinal abilities (exp. referential integrity), Open repsitry (data can easily be accessed by any structured query language (SQL) cmpliant tl), DBA Supprt (there is a grup respnsible fr data in database management system (DBMS)), SQL interface, etc. Other surces: ERP Systems systems that are quite cmmn in rganizatins. Master data management (MDM) Systems - are centralized resurces designed t hld the main cpy f the key entity, such as a custmer r a prduct. IV. Web lg fr example a cntrl dcument that is autmatically created frm the Web server. THE MODEL OF THE ETL TOOL The fllwing figure (Fig. 3) shws the high level architecture f the prpsed ETL tl. The user uses web interface t define the metadata (i.e. user creates prject, prcess, grup, destinatin, etc.) that the ETL prcessing will use. When all data is entered, user runs the thread that extracts infrmatin frm ne surce, then perfrms defined transfrmatin (as necessary) and finally lads data int the data warehuse. After ne surce is cmpleted, the thread prceeds t the next surce. Pssible imprvement is t implement multithreading in rder t prcess multiple surces at nce. A. ETL thread Data prcessing is made by the thread that starts after the metadata is entered. Fig. 4 shws the class diagram f this part f the tl. When yu start the thread class Main lgic, it is instantiated and it then instantiates classes Extractin, Transfrmatin and Lad. After that the methds f the class Lad are called t create the destinatin (dimensin and fact tables). Then, the lgical data map is read and infrmatin is stred int tw vectrs. The first vectr cntains metadata relating t data fr dimensin tables and secnd vectr stres data fr fact tables. The thread then mves and prcesses dimensins, ne by ne, and SQL query fr extractin is created and run. After that, data is transfrmed as it is described in the metadata entered by the user; after the transfrmatins are dne, the lading starts t lad data int the data warehuse (rw by rw). When dimensins are finished, the fact tables are prcessed in the next step (the prcedure is the same but ne has t have in mind that fact tables have t be cnnected t specific dimensin tables). 99 P a g e
Plain file PstgreSQL MySQL Date Frmat Attribute merge Nt null Upper/Lver Case EXTRACTION SCREEN HTML/JSP Start ETL MAIN ETL LOGIC TRANSFORMATION PstgreSQL (metadata) LOAD PstgreSQL (DW) Fig. 3. ETL tl High-level architecture Thread Main lgic Extractin Lad Transfrmatin <<Interface>> Extractin I <<Interface>> Transfrmatin I Wrk DB ETL Wrk buffer Wrk DB Wrk DW Surce Flat Surce PstgreSQL File Surce MySQL UN merge attr surce Surce Query Executin Fig. 4. Class diagram ETL thread FIL nt null UN cnn DIM_FAC FIL upper lver case FIL frmat date Query executin Glbal params 100 P a g e
B. Dynamic lading In rder t create a flexible tl and have the ptin f upgrading, dynamic lad f classes and JSP files has been implemented in tw places: In surce extractin part fr every surce type ne class has been made; In data transfrmatin part - fr every transfrmatin that the tl can perfrm ne class has been made; Since each type f surce and each transfrmatin have their wn class, it is pssible t add new types f surces r new transfrmatins. All yu need t d is create a class (and if needed a JSP file) and add metadata inf abut it. Three imprtant things enable dynamic lading f classes: Each class f surce type (r transfrmatin) must have a methd that returns an instance f a class within the class itself (Fig.5). The interfaces implemented by surce type r transfrmatin classes (Fig. 6). The class that has methds t search fr required class thrugh its and dynamic lad f the class int memry and methds t search fr functins within the retrieved class that return an instance f the desired class (Fig. 7) In additin t the dynamic class lading, transfrmatins als use dynamic lad f JSP files which cntain fields (if necessary) that the user must fill in when chsing this particular transfrmatin. JSP lading is dne with AJAX. C. Tl cnfiguratin In rder t use the ETL tl, administratr has t precnfigure it. Mst imprtant are the fllwing parts: Surce types (Fig. 8) it refers t the surce types that the tl can wrk with (fr nw PstgreSQL, MySQL and flat file with delimiter) public class Surce_MySQL implements Extractin_I { public static Surce_MySql get_instance(string args[]){ Surce_MySQL instance = new Surce_MySQL(); return instance; public blean lad_parameters( String address, String, int prt, String user, String passwrd){... public Vectr get_table_clumns(){... public Vectr execute_query( String query, Vectr inf){... Fig. 5. Example 1 Example f dynamic lading class public interface Extractin_I { public blean lad_parameters( String address, String, int prt, String user, String passwrd); public Vectr get_table_clumns(); public Vectr execute_query( String query, Vectr inf); Fig. 6. Example 2 Example f the interface that dynamic lading class must implement public class Extractin { private Extractin_I extractin_i; public blean set_class_instance( String src_class) { Thread t = Thread.currentThread(); ClassLader c = t.getcntextclasslader(); Class trun = null; try{trun = c.ladclass("subsys_ext." +src_class);... Methd mainmethd = null; try{mainmethd = findmain(trun,"get_instance");... Object instance = null; try{ instance = mainmethd.invke(null, new Object[]{new String[1]);... extractin_i = (Extractin_I) instance; return true; private Methd findmain( Class my_class, String functin_) { Methd[] methds = my_class.getmethds(); fr (int i = 0; i < methds.length; i++) { if (methds[i].getname().equals(functin_)) return methds[i]; return null; public vid sme_methd() {... extractin_i.lad_parmeters(address, file, prt, user, passwrd);... Fig. 7. Example 3 Example f a class that dynamically lads anther class [8, p. 11] 101 P a g e
Transfrmatins (Fig. 9) defines which transfrmatins des the tl supprt, defines the s f classes that implement sme particular functinality and the crrespnding JSP file which is laded when the user chses this transfrmatin. Checkpints r steps (Fig. 10) administratr has t define steps that user fllws when filling in the metadata (the administratr must define the page (a JSP file) that pens when user is n a particular step as well as the checkpints ); As we mentined earlier, the prgram guides the user thrugh the entire prcess. Fig. 10 shws the steps (checkpints) fr the user; the user has t define hw much surces are ging t be used. After that (Fig. 11) we see the input frm that is used t define a new data surce (there is new PstgreSQL surce defined). It is always pssible t chse frm already existing surces. The tl will use that inf and will cnnect t the surce and will retrieve metadata as well. When we have all (surces) metadata and we have defined dimensin and fact tables with attributes, the user must define all merges f the attributes (Fig. 12) (fr example merge f first and last s int the attribute buff sur). After this is defined, the user can cnnect the attributes frm the surce t destinatin attributes and define transfrmatins that need t be dne. When this step is dne fr all dimensins/fact tables and the crrespnding surces, the last step starts the thread fr ETL prcessing. Befre starting the thread the user can change entered data and g back t previus steps. V. CONCLUSION The ETL prcess is the mst imprtant and mst prblematic part when creating data warehuses. In rder t speed up the whle prcess and in rder t make it easier (fr users), we built a tl that leads the users thrugh the whle prcess. Althugh this ETL tl is far frm being perfect and cannt be measured with prfessinal tls n the market, its majr advantage is that it is web based; n installatin is needed, it is available right away, mre users can use it at the same time and users can learn the ETL prcess when using the tl. The ETL tl is gd fr users wh are nt that familiar with the ETL prcess and wh have n time t analyze new ETL tls but want t summarize data, mve data int the data warehuse and analyze data. The ETL tl is flexible and because f that it can be easily upgraded. VI. FUTURE WORKS Because this tl is nly a prttype, there are many pssible imprvements. Sme parts are already imprved; sme cmplex queries were made that extracted mre data at nce, sme filters were implemented t retrieve relevant data (t speed up the tl) etc. In the future we plan t ptimize the tl (speed, design, surce cde, DB queries, security), add new features (add new data surces, new transfrmatins, etc.) and test the tl with larger set f data and cmpare results t ther tls. Als, it is planned t take data frm tw grcery stres (data frm a small data warehuse that was implemented a few years ag) and test the ETL tl with that data and cmpare it first t manual ETL, and later with ther tls. When this is dne and tl is ptimized, it is planned t d a research with experts where experts shuld give feedback abut usage f the tl in cmparisn t the tls they are using right nw. Fig. 8. Administratin view f surce types Fig. 9. Menu f checkpint (steps) fr the user (left) and frm t select number f surces (right) 102 P a g e
Fig. 10. Administratin view f existing transfrmatins and crrespnding screen dimensin Fig. 12. Frm t define attribute merges Fig. 11. Frm fr entering new surce REFERENCES [1] K. Rabuzin and M. Nvak, Data warehuses and ETL, Methds and Tls fr Infrmatin and Business Systems develpment (Case22), Zagreb, Jun. 2010, pp. 85-89 [2] R. Kimball and J. Caserta, The Data Warehuse ETL Tlkit: Practical Techniques fr Extracting, Cleaning, Cnfrming, and Delivering Data, Indianaplis: Wiley Publishing Inc., 2004. [3] C. White: "OLAP in the Database - Intelligent Business Strategies" June 2003. http://www.infrmatin-management.cm/issues/20030601/6807-1.html?pg=2. [Accessed 3 August 2010]. [4] R. Kimball R., M. Rss, W. Thrnthwaite, J. Mund and B. Becker, The Data Warehuse Lifecycle Tlkit Secnd Editin, Indianaplis: Wiley Publishing, Inc., 2008. [5] H. W. Inmn, Building the Data Warehuse Third Editin, New Yrk: Jhn Wiley & Sns Inc., 2002. [6] F. Silvers, Building and Maintaining a Data Warehuse, Bca Ratn: CRC Press, 2008. [7] I. M. M. Awad, S. M. Abdullah and M. A. B. Ali, "Extending ETL framewrk using service riented architecture", Prcedia Cmputer Science, vl. 3, 2011., pp. 110-114 [8] T. Neward, "Understanding Class.frName - Lading Classes Dynamically frm within Extensins" 2000. http://media.techtarget.cm/tss/static/articles/cntent/dm_classfr/ DynLad.pdf. [Accessed 5 July 2010]. [9] P. Pnniah, Data Warehusing Fundamentals: A Cmprehensive Guide fr IT Prfessinals, New Yrk: Jhn Wiley & Sns Inc., 2001. 103 P a g e