INTEGRATION OF HETEROGENEOUS DATABASES IN ACADEMIC ENVIRONMENT USING OPEN SOURCE ETL TOOLS



Similar documents
CONCEPTUAL FRAMEWORK OF BUSINESS INTELLIGENCE ANALYSIS IN ACADEMIC ENVIRONMENT USING BIRT

A DATA WAREHOUSE SOLUTION FOR E-GOVERNMENT

Dimensional Modeling for Data Warehouse

Indexing Techniques for Data Warehouses Queries. Abstract

Bussiness Intelligence and Data Warehouse. Tomas Bartos CIS 764, Kansas State University

Business Intelligence in E-Learning

Data Warehousing. Jens Teubner, TU Dortmund Winter 2015/16. Jens Teubner Data Warehousing Winter 2015/16 1

BUILDING OLAP TOOLS OVER LARGE DATABASES

A Model-based Software Architecture for XML Data and Metadata Integration in Data Warehouse Systems

An Introduction to Data Warehousing. An organization manages information in two dominant forms: operational systems of

Data Integration and ETL Process

A Design and implementation of a data warehouse for research administration universities

Meta-data and Data Mart solutions for better understanding for data and information in E-government Monitoring

What is Management Reporting from a Data Warehouse and What Does It Have to Do with Institutional Research?

A FRAMEWORK FOR EDUCATIONAL DATA WAREHOUSE (EDW) ARCHITECTURE USING BUSINESS INTELLIGENCE (BI) TECHNOLOGIES

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

SQL Server 2012 Business Intelligence Boot Camp

A Framework for Developing the Web-based Data Integration Tool for Web-Oriented Data Warehousing

Implementing a Data Warehouse with Microsoft SQL Server 2012 MOC 10777

A Data Warehouse Design for A Typical University Information System

A Knowledge Management Framework Using Business Intelligence Solutions

Data Warehousing Systems: Foundations and Architectures

Course Outline: Course: Implementing a Data Warehouse with Microsoft SQL Server 2012 Learning Method: Instructor-led Classroom Learning

IST722 Data Warehousing

Datawarehousing and Analytics. Data-Warehouse-, Data-Mining- und OLAP-Technologien. Advanced Information Management

Turkish Journal of Engineering, Science and Technology

Integrating Ingres in the Information System: An Open Source Approach

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Data Warehousing and OLAP Technology for Knowledge Discovery

Deriving Business Intelligence from Unstructured Data

Integrating data in the Information System An Open Source approach

Data Warehousing and Data Mining

Enabling Better Business Intelligence and Information Architecture With SAP Sybase PowerDesigner Software

CONCEPTUALIZING BUSINESS INTELLIGENCE ARCHITECTURE MOHAMMAD SHARIAT, Florida A&M University ROSCOE HIGHTOWER, JR., Florida A&M University

Proper study of Data Warehousing and Data Mining Intelligence Application in Education Domain

ETL-EXTRACT, TRANSFORM & LOAD TESTING

Data Warehouse: Introduction

Data Warehouses and Business Intelligence ITP 487 (3 Units) Fall Objective

SQL Server 2012 End-to-End Business Intelligence Workshop

Enabling Better Business Intelligence and Information Architecture With SAP PowerDesigner Software

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Open Source Business Intelligence Intro

Open Source Business Intelligence Tools: A Review

SAS BI Course Content; Introduction to DWH / BI Concepts

A Critical Review of Data Warehouse

Methodology Framework for Analysis and Design of Business Intelligence Systems

DATA MINING AND WAREHOUSING CONCEPTS

Hybrid Support Systems: a Business Intelligence Approach

A Survey on Data Warehouse Architecture

A Survey: Data Warehouse Architecture

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Course Outline. Module 1: Introduction to Data Warehousing

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 28

Evaluation of Business Intelligence Systems

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

Paper SAS Visual Analytics: Emerging Trend in Institutional Research Sivakumar Jaganathan, Thulasi Kumar University of Connecticut

Talend Metadata Manager. Reduce Risk and Friction in your Information Supply Chain

Business Intelligence. A Presentation of the Current Lead Solutions and a Comparative Analysis of the Main Providers

Metadata Management for Data Warehouse Projects

Migrating a Discoverer System to Oracle Business Intelligence Enterprise Edition

Monitoring Genebanks using Datamarts based in an Open Source Tool

Implementing a Data Warehouse with Microsoft SQL Server 2012 (70-463)

Data Warehousing and Data Mining in Business Applications

Subject Description Form

A Survey of ETL Tools

Deductive Data Warehouses and Aggregate (Derived) Tables

East Asia Network Sdn Bhd

Beta: Implementing a Data Warehouse with Microsoft SQL Server 2012

Microsoft Services Exceed your business with Microsoft SharePoint Server 2010

MIS636 AWS Data Warehousing and Business Intelligence Course Syllabus

The Integration of Agent Technology and Data Warehouse into Executive Banking Information System (EBIS) Architecture

Data Integration with Talend Open Studio Robert A. Nisbet, Ph.D.

Extensibility of Oracle BI Applications

CASE PROJECTS IN DATA WAREHOUSING AND DATA MINING

DBTech Pro Workshop. Knowledge Discovery from Databases (KDD) Including Data Warehousing and Data Mining. Georgios Evangelidis

COURSE 20463C: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

Implementing a Data Warehouse with Microsoft SQL Server

Course Design Document. IS417: Data Warehousing and Business Analytics

BUILDING A WEB-ENABLED DATA WAREHOUSE FOR DECISION SUPPORT IN CONSTRUCTION EQUIPMENT MANAGEMENT

Course 10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012

Data warehouse life-cycle and design

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

SimCorp Solution Guide

Implementing Data Models and Reports with Microsoft SQL Server 20466C; 5 Days

Ezgi Dinçerden. Marmara University, Istanbul, Turkey

Transcription:

INTEGRATION OF HETEROGENEOUS DATABASES IN ACADEMIC ENVIRONMENT USING OPEN SOURCE ETL TOOLS Azwa A. Aziz, Abdul Hafiz Abdul Wahid, Nazirah Abd. Hamid, Azilawati Rozaimee Fakulti Informatik, Universiti Sultan ZainalAbidin (UniSZA), 21300 Kuala Terengganu, Terengganu, Malaysia {azwaaziz,nazirah,azila}@unisza.edu.my hafizwahid@live.it ABSTRACT Data warehouse (DW) can be considered as the most significant tool for strategic decision making. However, in academic environment, the components of data warehouse still have not been completely utilized. The aim of this paper is to design and implement ETL processes through the data integration of heterogeneous databases using TOS in academic environment. The usage of TOS for designing and semi automatically implementing ETL tasks in Java enables a fast way of adapting to new data sources. We already manage to integrate various sources of heterogeneous academic data sample and populate them in one central repository using TOS. It is important to ensure the capabilities of open source ETL are equal to any commercial products. Consequently, it will help in implementing DW projects with lower cost. KEYWORDS Extraction, transformation, and loading(etl), Business Intelligence (BI), Data Warehouse (DW), Heterogeneous DBMS, Talend Open Studio (TOS) 1 INTRODUCTION Data warehouse (DW) can be considered as the most significant tool for strategic decision making in business. A welldeveloped DW can dramatically improve an organization s decision-making capabilities. In the early years, the costs for the development of a DW were very expensive. However, lately the costs for developing and maintaining a DW has significantly lowered, thus it has progressively becoming a functional tool that can be used as repository of information to support managerial decision making [1], [2], [3], [4]. In academic environment, however, only recently show some interest in integrating DW in decision making processes. Academic institutions still exploring the possibilities and benefits of data warehouse, therefore, in the development of decision support systems, the components of data warehouse still have not been utilized completely [5]. The factors affecting the optimal management of the institution especially in decision making are the same factors that involve in the business processes, so the management of an academic institution can be considered as critical as the management of a large business company [6]. In ensuring to achieve the optimal management of the institution, data warehouse can be integrated with Business Intelligence (BI). The goal of data warehouse is not only to utilize BI but also to do it effectively. BI is a term that refers to a variety of software applications that can be used to analyze an organization s raw data. BI is made of several related activities, including data mining, online analytical processing, querying and reporting [7]. The main goal of BI is to produce correct and accurate information in order to make effective decisions. BI gives the users the ability to transform data into usable 433

information, thus taking apparently useless data and turning it into valuable information. The aim of this paper is to explicate the design and implementation of extraction, transformation, and loading (ETL) processes during the initial design and deployment stage through the data integration of heterogeneous databases using open source tools in academic environment. In academic environment, ETL can be considered as a valuable process because information from academic institutions involve data that are coming from many dissimilar sources such as academic systems, co-curriculum systems, hostel system and many more. In this research, ETL tools are used to extract data from different sources and then clean the data and make it uniform for the transformation process. The output from transformation process is loaded into the data mart. The data is merged into the data mart to give the decision makers the power to look through the data from different locations. This will increase the ability of decision makers to filter the data [8]. The paper is organized as follows. Section 2 describes some related research on components of data warehouse. Section 3 explains the design of the proposed system and the experimental design is described in Section 4. We present the result analysis and discussion in Section 5 and lastly we conclude this work in Section 6. 2 LITERATURE REVIEW There are many research that have been done to discuss and explain the practices, tools and standards that involve in ETL, DW, BI and any related technologies. There are many strategies that can be applied in the deployment of DW. In [9], authors proposed a framework for design, development and deployment of a DW. The framework involved combining meta-model with an ontology. The main outcome from the framework showed that it could upgrade the interoperability function in ETL processes. Reed et al. [10] proposed a robust yet economical means for undertaking any stated goal by utilizing Pentaho tools for ETL. The goal of this research was to combine different databases into a single data repository. This data repository could be applied to view and examine domestic violence victim and offender data across organizations to provide reports. Other than that, by using Pentaho, the authors intended to remove data conflicts and to generate a demographic profile throughout the criminal justice system. Final output of data mart signified the integrated and reliable information from different data sources. Dell aquila et al. [6] explained the practices in designing and modeling of an academic DW. The objectives of this specific academic DW were to produce an exclusive structure of analysis and report for administrative structures, such as departments for the students and also to supply real time data for external agencies. The outcome of this research was the DW provided a centralized source of information accessible through diverse academic units. Piedade and Santos [11] discussed the concepts, practices and architectures of Student Relationship Management (SRM) system. The main objective of this research was to provide a technical tool that could support higher education institution to gain knowledge that was vital to the decisionmaking procedure. In order to validate the proposed concepts and activities, they adopted a research methodology which involved the comprehension of a set of interviews. The results of this research involved two stages. The first stage verified 434

that there was no suitable technology that existed to support SRM concepts and practices. The second stage proved that the proposed framework permitted the definition of the SRM system s architecture and its main functionalities. Sahay and Mehta, [12] developed a system to support higher education institutions in evaluating and predicting critical matters related to student success. The objective of this study was to use data mining techniques for classification, categorization, estimation, and visualization. They also intended to use predictive models to predict the critical student issues in order to determine a prioritized list of critical factors. Thomsen and Pedersen conducted two investigations on ETL tools in the years of 2005 and 2008 [13], [14]. From the investigations, the authors discovered that there were many existed open source ETL tools, such as OpenSrcETL, OpenETL, CloverETL, KETL, Kettle, Octopus and Talend. Most of the tools could meet the fundamental requirements of data processing such as could support the function of extracting data from heterogeneous data sources and load the data into ROLAP or MOLAP system. The authors stated that most of these open source ETL tools were not very powerful except for Talend Open Studio (TOS). Talend operates in an open source model, where services and ancillary features are offered on a subscription basis. From an affordability standpoint, Talend opens up the marketplace for transformation and integration to all customers, regardless of size and data integration needs [15]. 3 CONCEPTUAL FRAMEWORK Our proposed framework is based on previous works that discussed about concepts and practices of DW and ETL. In DW, it is a common practice to separate between back room and front room entities. The back room is holding and handling the data while the front room allows for data accession. The back room can be labeled as data management or data preparation section of the related material in an academic environment. In the existing application, several Databases Management System (DBMS) were used to support transactions systems such as MySQL, Informix, Oracle, and Microsoft Access. To integrate those heterogeneous DBMS was a complex task to do especially by using 3GL languages. To enforce system developer to use a single DBMS for all applications also not an option as they had to choose their DBMS for specific purpose. Thus, ETL plays a vital role in every integration projects in order to speed up the development and moreover to achieve good results. For ETL tasks, an open source application known as Talend Open Studio (TOS) was used to build tasks. The tasks included transferred data from multiple external heterogeneous data sources, then transformed, cleaned, and loaded the data into the application s repository [16]. In this environment, the front room section enabled a user or client application to access the data that had been detained in the warehouse. The key task of the front room was to map the heterogeneous low level data, usually stored in a DW to other forms [17].The front room managed the queries that performed at the outside and then scheduled and planned them in order to accomplish the results for performance issues or can be referred as Business Intelligence. Golfarelli et al. [18], described Business intelligence as the process of turning data into information and then into knowledge. The front room may offer techniques of data mining, text mining or classical statistical methods that can be 435

performed on Data Marts (DM) and multidimensional cubes. planed, to enable an early user interaction in establishing the warehouse. 4 SIMULATION DESIGN Figure 1: The main components of the proposed framework. The proposed ETL framework consists of several main features. One of the features is the conversion of given file formats or database formats. This feature needs to be fitted to the structure that is needed by the loading processes, which store the data in the repository. Figure 1 shows ETL process embedded in the environment, excluding any parts of the front room. In data source stage, data comes from multiple data sources such as student personal details and academic records. This information can be accessed in different file format, including simple flat files, more complex XML files or as a database including Microsoft Excel and Access, MySQL and IBM Informix. The loading section makes use of Java code and create ETL job by TOS tool which is used in order to create a simple appliance of importing data from particular available columns as identified in data preparation. These files can be later reused in the ETL module to read, customize alteration and store the data. The requirement for data conversion consumes about 70-80% of the time used to build a DW [19], meanwhile the conversion and transformation steps including Java classes created by TOS, are the first software components to be considered and The purpose of this simulation is to prove that TOS is as capable to perform jobs as any commercial ETL tools. To prove the proposed framework, we use dummy data for simulation from two different DBMS which are Oracle and MYSQL to test ETL process in academic environment. Other than that, a text file in Microsoft Excel format has been added as source data to integrate with both DBMS. Oracle has been chosen as target database because most of enterprise companies are using Oracle in their enterprise applications. In this simulation, we have designed a multidimensional table that consists of fact and dimension tables. aim of the design is to perform analysis on student result based on ongoing assessments and demographic analysis using BI tools. A fact table is created in target database contains of students results, specifically programming subjects, known as fctstu. Two dimension tables are connected to fctstu which are dimpro and dimasses. dimpro consist of student personal information such as name and gender and also geographic information while dimassescontain of detail assessment results of particulars subject. Figure 2 shows Entity Relationship Diagram (ERD) of target systems. The design in DBMS involves Oracle and MYSQL. In MYSQL, a table known as r2is created to store a record of students results with information such as student names and gender. Meanwhile, in Oracle, a table known as stuinfo is created to store students personal information such as address, state and parent incomes. Whereas, Microsoft 436

Excel file contains of student assessment marks for the hold semester. Table 1: Sources system attributes DBMS DB TABLE ATTRIBUTE NAME MYSQL RPS r1 name matricno gender semester subjectcode gred ORACLE STUPRO stuinfo matricno name gender address state spmres parinc ExcelFile finalmark Assesme nt Mark matricno Section A Section B Section C Figure 3: Creating a connection to ORACLE and Figure 2: Multidimensional design for target system (Student results fact table). These three heterogeneous data sources then populated to the target databases in star schema by using TOS. Table 1shows detail of source column in each DBMS and text files. The first step in developing ETL jobs is to ensure successful connection has been established between TOS to respective databases. A GUI interface is developed to help in performing task as in Figure 3. Then, by using Structured Query Language (SQL) Builder, a test connection can be conducted by viewing data through TOS. The SQL statement can be manipulated to choose entities with particular attribute to perform analysis. Figure 4 shows the results in stuinfo table that have been generated using SQL statement to Oracle DBMS. The result is shows all data in. MYSQL DBMS In Figure 5 shows that SQL Builder interfaces generate result from query made to MYSQL DBMS. It shows a detail of r1records from sources systems. 437

perform data transformation when extracting data from sources. Figure 7 shows tmap interface when populating ETL jobs from simulation in academic environment. Figure 4: Result from stuinfo (ORACLE DBMS) Once connection has been successfully established, ETL jobs can be designed by using ETL jobs menu. This menu provides friendly interface where we can simply drag and drop to map a data from data sources to the target database/dw as shown in Figure 6. Figure 7: Mapping using tmap The final process is to compile and execute the jobs. Logs files are provided to give feedbacks either successful or fail run on the running jobs. In this simulation, all data from heterogeneous sources have been extracted to target database for respective DW tables. 5 EXPECTED OUTCOME Figure 6: ETL job from source to target TOS provide several functionalities to perform data extraction and transformation. One of basic functionalities is tmap which is used to develop a simple mapping from source to target. tmap provides a feature to The implementation of DW is crucial when dealing with various application systems that have their own characteristics. Nowadays, most organizations need one central repository that able to summarize all transactions data that will be used as guidelines for make decision. A successful project of DW implementation has been proven in various sectors. However, that main challenge of DW projects is the cost that involve in implementing the technology. Open source ETL is an option to reduce the cost for DW projects. The usage of TOS for designing 438

and semi automatically implementing ETL tasks in Java enables a fast way of adapting to new data sources. We already manage to integrate various sources of heterogeneous academic data sample and populate them in one central repository using TOS. We hope the same results will be achieved if the real life data is used. It is important to ensure the capabilities of open source ETL are equal to any commercial products. Consequently, it will help in implementing DW projects with lower cost. 6 CONCLUSION & FUTURE WORK This paper explained the design and implementation of an open source ETL tools in heterogeneous databases integration. It is our contribution to provide DW architecture with open source technologies in academic environment. We expect the architecture to evolve as the project matures, which should help to fit open source technologies into data warehouse. The main challenge to complete this research is to implement the framework in real life cases and to accommodate extra practical problems including BI. 7 REFERENCES 1. Immon, W. H.: Building the Data Warehouse. John Wiley & Sons, 1996. 2. Chaundhuri, S., Dayal, Ganti, U. and V.: Database technology for decision support systems, IEEE Computer, Vol. 34, No 12, 2001. 3. Jarke, M., Lenzerini, Vassiliou, M. Y. and Vassiliadis, P.: Fundamentals of Data Warehouses. Springer-Verlag, 2003. 4. Kimball, R. and Ross, M.: The Data Warehouse Toolkit, 2nd edition. John Wiley & Sons, 2002. 5. Wierschem, D., McMillen, J. and McBroom, R. :What Academia Can Gain from Building a Data Warehouse. EDUCAUSE Quarterly, Vol. 26, No. 1, 2003. 6. Dell aquila, C., Tria, F. D., Lefons, E. and Tangorra, F.: An Academic Data Warehouse. In :Proceedings of the 7th WSEAS International Conference on Applied Informatics and Communications, Athens, Greece, August 24-26, 2007. 7. Mulcahy, Ryan.: Business Intelligence Definition and Solutions. CIO.com. N.p., n.d. Web. 10 Oct 2011. 8. Kimball, R., Reeves, L., Ross, M., and Thronthwaite, W. :The Data Warehouse Lifecycle Toolkit. Wiley, New York, 1998. 9. Hoang, A. T. and Nguyen, B.: An Integrated Use of CWM and Ontological modeling approaches towards ETL Processes. In: IEEE International Conference on e-business Engineering. 2008. 10. Reed, S. E., Na, D. Y., Mayo, T. C., Shapiro, L. W. Joseph, Duty, B., Conklin, J. H. Donald Brown, E. : Implementing and Analyzing a Data Mart for the Arlington County Initiative to Manage Domestic Violence Offenders. In :Proceedings of the 2010 IEEE Systems and Information Engineering Design Symposium University of Virginia, Charlottesville, VA, USA, April 23, 2010. 11. Piedade, M. B. and Santos, M. Y.: Student Relationship Management: Concept. Practice and Technological Support. 978-1-4244-2289- 0/08/2008 IEEE. 2008. 12. Sahay, A. and Mehta, K..: Assisting Higher Education in Assessing, Predicting, and Managing Issues Related to Student Success: A Web-based Software using Data Mining and Quality Function Deployment. Academic and Business Research Institute Conference, Las Vegas, 2010. 13. Thomsen, C. and Pedersen, T. B. : A Survey of Open Source Tools for Business Intelligence. In: International Journal of Data Warehousing and Mining, 2009. 14. Thomsen, C., Pedersen, T.B. and Lehner, W..: RiTE: Providing On-demand Data for Right-time Data Warehousing. In: Proc. of ICDE, 2008. 15. Inmon, W.H. The Evolution of Integration. A White Paper by W. H. Inmon. 2007. 16. Talend Open Studio, [Online]: http://www.talend.com. 17. Fayyad, U. M., Piatetsky-Shapiro, G. and Smyth, P..: From data mining to knowledge discovery: An overview, in Advances in Knowledge Discovery and Data Mining:AAAI Press, 1996. 18. Golfarelli, M., Rizzi, S. and Cella, I..: Beyond Data warehousing: what s next in business intelligence. In: Proceedings of the 7th ACM international workshop on data warehousing and OLAP, November 2004. 19. Schönbach, C., Kowalski-Saunders, P. and Brusic, V..: Data warehousing in molecular biology. Brief. Bioinform. vol. 1, no. 1, pp. 190-198, May 2000. 439