ETL Tools in Enterprise Data Warehouse *Amanpartap Singh Pall, **Dr. Jaiteg Singh E-mail: amanpall@hotmail.com * Assistant professor, School of Information Technology, APJIMTC, Jalandhar ** Associate Professor, Chitkara Institute of Engineering and Technology, Rajpura A B S T R A C T An enterprise data warehouse (EDW) also known as Data warehouse (DW), is a system used for reporting and data analysis. DWs are central repositories of integrated data from one or more disparate sources. Extraction- Transformation-Loading (ETL) processes are responsible for all the operations taking place at the warehouse. These processes are performed by specialized tools known as ETL tools or also called as Data Integration tools. The ETL tools are available in the market either as closed source software or as open source software. An ETL tool from both these categories has their advantages and suffers from some limitations as well. The main objective of this paper is to highlight the importance of the data integration tools or also known as ETL tools in business intelligence environment. Index Terms: Business Intelligence, Data warehouse, ETL tools, Enterprise data, Data Integration. I. INTRODUCTION Extraction-Transformation-Loading (ETL) tools are specialized tools that deal with data warehouse homogeneity, cleaning and loading problems. ETL and Data Cleaning tools are estimated to cost at least one third of the effort and expenses in the budget of the data warehouse [1][2]. First, the data is extracted from the source data stores that can be On-Line Transaction Processing (OLTP) or legacy systems, files under any format, web pages, various kinds of documents (e.g., spreadsheets and text documents) or even data coming in a streaming fashion. After this phase, the extracted data is propagated to a special-purpose area of the warehouse, called the Data Staging Area (DSA), where their transformation, homogenization, and cleansing takes place. The most frequently used transformations include filters and checks to ensure that the data propagated to the warehouse respect business rules and integrity constraints, as well as schema transformations that ensure that data fit the target data warehouse schema. Finally, the data is loaded to the central data warehouse (DW) and all its counterparts (e.g., data marts and views). Nowadays, business necessities and demands require near real-time data warehouse refreshment and significant attention is drawn to this kind of technological advancement [3]. The design, development and deployment of ETL processes, which is currently, performed in an ad-hoc, in house fashion, needs modeling, design and methodological foundations. The most important components during the design and deployment phase in a data warehousing is the design flow of data from the source relations towards the target data warehouse relations, this flow is provided by the ETL tools. Extraction-Transformation-Loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. There are currently many commercial tools available in the market e.g. Oracle Warehouse Builder (OWB), IBM Information Server (Datastage) 9.1, SAS Data Integration Studio 4.21 SAS Institute, SQL Server Integration Services (SSIS) 10 Microsoft, DataFlow Manager 6.5 Pitney Bowes Business Insight, Clover ETL 3.0.1 Javlin, DB2 Warehouse Edition 9.1 IBM, Pentaho Data Integration 4.1 Pentaho. 1 2015, IJAFRSE and ICCICT 2015 All Rights Reserved www.ijafrse.org
II. LITERATURE REVIEW The 90s focused on the straight-away methods of creation of data warehouses. Populating these data warehouses was a tedious test and was generally done by some end-user tools which were not so sophisticated or efficient. This was because of the heterogeneity of the data that is to be populated. There was a lack of good quality tools and the task was performed by system integrators. As a result the task was error-prone, and highly frustrated leaving the task to be abandoned mid-way. This resulted in huge losses to the organization. The data tools that were developed were too primitive in nature and were mainly developed to support OLAP and DSS. Some of the key problems concerning the ETL tools are primarily of complexity, usability, maintainability and price [4]. Owing to the great complexity arising out of the present tools, Raman and Hellerstein [5] gave Potter s Wheel an interactive data cleaning system by integrating discrepancy detection and transformation wherein users can specify transforms through graphical operations or through examples, and see the effect instantaneously. Different methodologies have been used for removing the limitations of the ETL tools. Query-based (QELT) ETL that has the capability to read the mapping guideline defined in the meta-data repository to create the transformation process [6]. Several research areas remain open, one being the efficient and reliable execution and optimization of an ETL scenario and the issue of optimal algorithms for the ETL tasks [7]. Henry et al. [8] studied and identified comprehensive ETL criteria, testing procedures and these were applied to commercial ETL tools. However, they stressed on the fact that companies can use and modify evaluation methods to serve and cater to their needs. Hence, no universal criteria could be reached in choosing a tool; each company can form its own set of criteria s. The tools could further generate accuracy if only they can be incorporated with UML and EMF modeling technology and with the addition of simple Javabased operators to a transformation tool [9].The study on ETL and E-LT [10] is based on three approaches i.e. Full Pushdown, Target Pushdown and Source Pushdown. It was observed that there is no performance difference in terms of running a job to load data into data warehouse tables if complete pushdown powers of E-LT jobs were not used. The existing commercial ETL tools only support the implementation of ETL flows given an existing design. Regarding the optimization of ETL processes, despite its importance, fewer efforts have been proposed at both the logical and the physical level [11]. The current ETL tools propose specific languages for expressing processes, which differ among tools and have different expressive power [12]. It is often argued that incremental loading is more efficient than full reloading unless the operational data sources happen to change dramatically [13]. The ETL process is guided by the domain ontology so that the findings of the data sources could be finished semantically, and the transforming of the data to data warehouse could become more efficient. Reddy et al. [14] presented a GUI based ETL procedure/tool to the continuous loading of the data in the Active Data Warehouse. The tool takes less time in preparing the procedures, functions and triggers only the mappings and transformations are prepared. The weaknesses of of traditional Extract, Transform and Load (ETL) tools architecture were analyzed for its openness and repeatedly development, a three layers-architecture based on metadata was proposed based on this analysis [15]. Commercial ETL tools can not directly load the XML file to extract XML document for the loading of Data Warehouse. However, the analysis was justified [16] through the analysis of the characteristics of Semi-structured data, and following the actual example of 2 2015, IJAFRSE and ICCICT 2015 All Rights Reserved www.ijafrse.org
BokeDataInfo.xml large number of financial data of xml structure was loaded into the Data Warehouse and thus laid the foundations for data integration of different application fields. A metadata driven ETL service model and Metadata-driven ETL service framework was proposed [17] that has strong flexibility, extensibility and can process large scale data efficiently. The model takes full advantage of the platform and variety of metadata, and can effectively design and share the ETL process no that open-source or commercial ETL tools possess. Data integration and data analysis based ETL tool focused on the extraction phase by implementing a technique that semi-automatically defines mappings between a data warehouse schema and a new data source, and on the transformation phase, by proposing a new function based on relevant values, particularly useful for supporting drill down operations. The tool was tested on real world and qualitative and effective results were obtained [18]. However, research is still required for identifying a benchmark and a set of measures in order to perform a complete technique evaluation. Zhao [19] showed through a case study that using the optimization technique for queries will make SETL overtake other programs based existing tools the system is able to generate automatically new transformations; no extra update will be needed to enable evolution. Key issues related with creation, migration and harvesting Knowledge Repositories and harvesters using open source tools and their success lies in awareness among the stakeholders on Open Access and Knowledge Repositories [20]. The extraction, transformation and loading of heterogeneous data sources into data warehouse through SETL [21].SETL has been designed and implemented using PERL subroutine attribute and data partition. SETL can implement ETL job easily and perform ETL job efficiently, and the plug-in design makes SETL with high scalability, and the design that performing one ETL job in one ETL pipeline makes SETL with distribution environment support. III. IMPORTANCE OF ETL TOOLS IN BUSINESS INTELLIGENCE Business intelligence is a broad set of applications, technologies and knowledge for gathering and analyzing data for the purpose of helping users make better business decisions. However, the challenge of BI is to gather and serve all relevant factors that enable the end users to efficiently drive the decision making process. Business intelligence covers data warehousing, ETL process, Reporting, OLAP (Online Analytical Processing on multidimensional data), data cleansing, performance management, data quality management, data mining, statistical analysis and forecasting. The primary role in all these activities is played by ETL tools. As can be seen from the literature review each ETL tool has a different process of working and not all the ETL tools work the same way. ETL tools aggregates, consolidates, cleanses and finally validates the data so it can be used effectively for business based decisions in BI. The use of ETL tool increases the productivity associated with the complexities of load balancing, logging, distribution of data, scalability of system and interfaces. It is because of the ETL tools that large bytes of data (as big as gigabyte) are accessed at a time. The BI produces analysis reports and provides in depth knowledge about certain parameters that are important performance indicators. These parameters are customers, the competitors, operators etc. 3 2015, IJAFRSE and ICCICT 2015 All Rights Reserved www.ijafrse.org
IV. CONCLUSION AND FUTURE WORK The literature review clearly suggests that ETL tool plays a pivotal role in Business Intelligence as the effective analysis and decision making is based on the data populated in data warehouse by the ETL tool. However, no one tool cant suffice the needs of all organizations. There is a lack of standardization and sophisticated ETL tools are quite costly. The future work requires the identification or designing of a standard process that suffices the needs of the organization in data warehousing. V. REFERENCES [1] Shilakes, C., &Tylman, J. (1998).Enterprise Information Portals.Enterprise Software Team. [2] Galhardas H., D. Florescu, D. Shasha and E. Simon.Ajax: An Extensible Data Cleaning Tool. In Proc. ACM SIGMOD (Dallas, Texas, 2000), 590. At http://www.eti.com/ [3] Vassiliadis, P., &Simitsis, A. (2009).Extraction, Transformation, And Loading. Encyclopedia of Database Systems, 32. [4] Vassiliadis, P., Vagena, Z., Skiadopoulos, S., Karayannidis, N., andsellis, T. (2001). ARKTOS: Towards The Modeling, Design, Control And Execution Of ETL Processes.Information Systems, 26(8), 537-561. [5] Raman, V.,& Hellerstein, J. M. (2001).Potter's Wheel: An Interactive Data Cleaning System.In Proceedings of the international conference on Very Large Data Bases (381-390). [6] Rifaieh, R., &Benharkat, N. A. (2002).Query-Based Data Warehousing Tool.In Proceedings of the 5th ACM international workshop on Data Warehousing and OLAP (35-42).ACM. [7] Vassiliadis, P., Simitsis, A., Georgantas, P., Terrovitis, M., andskiadopoulos, S. (2005). A Generic And Customizable Framework For The Design Of ETL Scenarios.Information Systems, 30(7), 492-525. [8] Henry, S., Hoon, S., Hwang, M., Lee, D., anddevore, M. D. (2005).Engineering Trade Study: Extract, Transform, Load Tools For Data Migration. In Systems and Information Engineering Design Symposium, 2005 IEEE( 1-8). IEEE. [9] Morris, H., Liao, H., Sriram, P., Srinivasan, S., Lau, P., Shan, J., andwisnesky, R. (2008).Bringing Business Objects into Extract-Transform-Load (ETL) Technology.In e-business Engineering, 2008.ICEBE'08. IEEE International Conference on (709-714). IEEE. [10] Ranjan, V. (2009).A Comparative Study BetweenETL (Extract, Transform, Load) And ELT (Extract, Load And Transform) Approach For Loading Data Into Data Warehouse.viewed 2010-03-05, http://www. ecst. csuchico. edu/~ juliano/csci693/presentations/2009w/materials/ranjan/ranjan. pdf. [11] Castellanos, M., Simitsis, A., Wilkinson, K., and Dayal, U. (2009).Automating The Loading Of Business Process Data Warehouses. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology : 612-623. ACM. [12] Akkaoui El, Z., &Zimányi, E. (2009).Defining ETL Worfklows Using BPMN AndBPEL. In Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP( 41-48). ACM. [13] Jörg, T., &Dessloch, S. (2009).Formalizing ETL Jobs For Incremental Loading Of Data Warehouses.Business Tech. and Web, 57-64. [14] Reddy, V. M., Jena, S. K., andrao, M. N. (2010).Active Datawarehouse Loading ByGUI Based ETL Procedure. [15] Jian, L., &Bihua, X. (2010).ETL Tool Research And Implementation Based On Drilling Data Warehouse. In Fuzzy Systems and Knowledge Discovery (FSKD), 2010 Seventh International Conference on (Vol. 6, 2567-2569). IEEE. [16] Guohua, Y., & Jingting, W. (2010).The Design AndImplementation Of XML Semi-Structured Data Extraction And Loading Into The Data Warehouse. In Information Technology and Applications (IFITA), 2010 International Forum on (Vol. 3, 30-33). IEEE. [17] Xu, L., Liao, J., Zhao, R., & Wu, B. (2011).A Paas Based Metadata-Driven Etl Framework. In Cloud Computing and Intelligence Systems (CCIS), 2011 IEEE International Conference on (477-481). IEEE. 4 2015, IJAFRSE and ICCICT 2015 All Rights Reserved www.ijafrse.org
[18] Bergamaschi, S., Guerra, F., Orsini, M., Sartori, C., andvincini, M. (2011).A Semantic Approach ToETL Technologies.Data and Knowledge Engineering, 70(8): 717-731. [19] Zhao Chen,., & Zhao, T. (2012, November). A new tool for ETL process. In Image Analysis and Signal Processing (IASP), 2012 International Conference on( 1-5). IEEE. [20] Muthukumar, P., Suresh, P., ShaliniPunithavathani, S., andnafeesa Begum, J. (2012).A Realistic Approach For The Deployment Of National Knowledge Repositories By Leveraging ETL Tools. In Recent Trends In Information Technology (ICRTIT), 2012 International Conference on( 542-547). IEEE. [21] Sun, K., &Lan, Y. (2012). SETL: A Scalable And High Performance ETL System. In System Science, Engineering Design and Manufacturing Informatization (ICSEM), 2012 3rd International Conference on (Vol. 1, 6-9). IEEE 5 2015, IJAFRSE and ICCICT 2015 All Rights Reserved www.ijafrse.org