An Integration Adaptation for Real-Time Datawarehousing

Transcription

1 , pp An Integration Adaptation for Real-Time Datawarehousing Imane Lebdaoui 1, Ghizlane Orhanou 2 and Said Elhajji 3 Laboratory of Mathematics, Computing and Applications, Department of Mathematical and Computer Sciences, Faculty of Sciences, University of Mohammed V-Rabat, BP.1014 RP, Rabat, Morocco. 1 [email protected], 2 [email protected], 3 [email protected] Abstract Managing and storing big data makes use of powerful systems mainly based on the concept of data warehousing. Data changes usually occur in operational data sources that are mostly heterogeneous and remote from each other. To gather them in a single location called the data warehouse, they may undergo many treatments and processes according to the organization s policies and rules. Once in the data warehouse, they become ready for decision-making and analytical tools. For fast and good decisions for the future, based on fresh data, it is necessary that the data warehouse reflects the real operational data changes and provides the freshest data to the analytical systems. This paper addresses the problem of integrating big data into the data warehouse in short time and proposes a new model called DJ-DI. Based on Division of data changes by adapting table joins, this model increases the data integration s rate and thus data reach the data warehouse in shorter time. We have conducted different simulations of the DJ-DI model under our experimental platform. The obtained results show that the DJ-DI model offers a remarkable improvement of data integration s rate. Keywords: Big data, data changes, division, data warehouse, real-time, data integration, join 1. Introduction The era of big data induces the organizations to utilize robust systems to manage them efficiently in order to turn them into useful and valuable information. Data are firstly generated and handheld by operational data sources (ODS), which are, usually, heterogeneous and far from each other. In order to make them ready for decision-making processes and analytical tools, data need to be extracted from operational sources, transformed, cleansed and integrated into one repository called the data warehouse, which still is the main and the famous instrument of managing big data. Actually, a Data Warehouse (DW) gathers both the new and the historical data that have existed or still exist, under some form in the operational data sources. Data integration is the most important stage in data warehousing since it constitutes 80% of the work [7], in projects appealing to warehousing. The following schema (Figure 1) summarizes the process of data warehousing and data integration and shows a comparison between the ODS and the DW. ISSN: IJSEIA Copyright c 2014 SERSC

2 Figure 1. Comparison between ODS and DW Since the real-time (RT) becomes a strong requirement (for instance e-business, stock broking and online telecommunications) and large amount of data is a fact, it is necessary to manage data that come from different operational sources and deliver them in timely manner to the data warehouse. In fact, real-time business data get less value as they become older [5]. Real-time data warehousing (RTDW) is one potential trend that proposes solutions to manage large amounts of data in RT fashion. Indeed, to fulfill real-time requirements [8], systems use real-time based technologies like real-time or near real-time ETL, real-time data integration, real-time change data capture and real time partitions [8, 9]. However, such technologies imply additional requirements to the system [8] and further costs that can be heavy for small organizations. To override this situation and save more time, the data do not get lengthways the whole treatments they must receive. Some organizations mistakenly escapes some treatments that are essentials; essentials because they should be respected in the same way they were firstly established. The sequencing of these treatments provides certain level of data quality the organization aims. Some new people to real-time data warehousing propose overpassing the whole data integration process by moving data directly from the operational sources to the business reports [4] (Figure 2). Such process certainly insures a quick access to data for reports but it is dirty since it jeopardizes data quality and therefore data integrity. Actually, the integrity may be hampered because the report should have specific privileges to access the data in operational sources and the existing security rules in operational systems should be reviewed consequently. Figure 2. Both Possible Ways of Feeding Business Intelligence s (BI) reports 116 Copyright c 2014 SERSC

3 In our paper, we present a new methodology of improving data integration while preserving data integrity by respecting the existing rules of security. Our approach, called Divide-Join Data Integration (DJ-DI), behaves according to the size of data change. We have structured the rest of the paper as follows: Section 2 addresses related works to data integration improvement and issues. In Section 3, we introduce our methodology of data integration based on division and join adaptation, then Section 4 presents the experiment results. Finally, Section 5 concludes with the paper s findings and discussion. 2. Related Works To allow decision systems to react efficiently to the real world s requirements, the real world data have to feed and update them correctly and in timely manner. A data warehouse, which is a component of decision systems, need to be refreshed with the operational data change to include the most recent operational records. This mechanism is commonly called data integration (DI). The conventional approach to DI highlights two types of systems: source systems and the target system. Under this approach, all data are first extracted from the source systems and then are integrated into the target system, through an incremental strategy [5]. For any kind of data warehousing (Traditional, near-real time or real-time), the goal of DI is to gather data from different operational data sources that are likely heterogeneous [6], in order to combine them and then to present them in such a way that they appear to be a unified whole, overall the available information [6]. Actually, the business name for data integration allows the 360-degree view of the business [6]. Others meanings of DI include the ability to access, parse, normalize, standardize, integrate, cleanse, extract, match, classify, mask, and deliver data [7]. DI is important] and crucial at the same time; Important since it is strongly demanded by organizations since 2008 [2] that have started to uses different variant of DI. Furthermore, it constitutes 80% of the works in Data warehousing and when it is done correctly; it assures that data complies with policies for data access, use, security, privacy, and data standards [4]. It becomes crucial when DI delivers inconsistent, incomplete and inaccurate data to the business. Thus, serious problems may occur especially those related to data quality [7]. Consequently, these issues involve extra delays, poor customer service and the most critical inconvenient is jeopardizing decisions making processes which may compromises the organization s future, loyalty and also the customer s trust[7]. Thus, many works [1, 7] have presented mechanisms to accelerate data integration in order to save time and make data available for decision systems in real-time or at least in near real-time fashion. The authors of [1] present a model of refreshing the DW with the necessary changes in near real-time manner, while managing three indicators that help on deciding when and what to update. In [7], the author describes a real-time loading methodology by duplicating the schema of DW to store the temporary updated information without defining any type of index, primary key, or constraints. This approach may enable DW to obtain continuous data integration with minimum On-Line Analytical Process (OLAP) response-time; however, data integrity becomes a serious issue that cannot be neglect as we have mentioned in our previous work [3]. 3. Data Integration by Division and Join Adaptation Mechanism To illustrate our methodology, we consider N operational data sources (OD 1, ODS 2 ODS N ) whose data should integrate a fact table in the DW, in real-time. In other words, Copyright c 2014 SERSC 117

4 data integration should be performed, in real-time, as the operational data changes occur. We assume that a map s job that permits transformations, joining and checking assures the minimum of DI routines. Figure 3 schematizes our simplified configuration: ODS 1 Map s job DW FT 0 ODS n Figure 3. A Simplified Schema of Data Integration In the above-mentioned schema (Figure 3), data flow up into a map s job where they undergo transformation, cleansing, checking and may need some tables joining. Afterwards, they integrate the target fact table where they become available for decisional reports and analytical tools. Assumptions: Given: - N operational data sources OD 1, OD 2,, OD N, N 1 - the operational Tables T i,j, the j th Table that belongs to the i th operational data source, i, j 1. - At the instant t when a data change is observed, we refer by DC i,j,k,l (t), to data changes that occurs in T i,j and in T k,l that are joined (T i,j T k,l ), and to their sizes respectively by S i,j (t) and S k,l (t). The key idea of our model starts from the values of the changes size S i,j (t) and S k,l (t), which influence the behavior of our model. Our methodology is based on three (3) main axis: - Instantaneous Data Change measurement - Integration adaptation; - OLAP query adaptation. We appeal to divide to conquer principle, in a restricted way. We assume that dividing infinitely may threaten the system s performance, which is the opposite of our model s goal. 3.1 Instantaneous Data Change Measurement When a change occurs in the operational Tables, our model launches a trigger that records the size of data changes. At a specific instant t, for each data change that is recorded into separate Table, we have created the following function that instantaneously records the size of the change. 118 Copyright c 2014 SERSC

5 Algorithm n 1 : Data Change measurement Function getsize (t_table_name varchar2) RETURN NUMBER IS l_size NUMBER; BEGIN l_size := 0; select sum (bytes)/(1024*1024) into l_size from xxx where segment_name= t_table_name ; RETURN l_size; END get_table_size; According to the recorded value, one of two modes of the map s job should be launched. One solution of dividing data change can be measuring data change in each table that has received a new change and dividing all of the new changes in all Tables. This solution may imply, each time, the creation of many Tables, each one of them will receive one division of data change. Unfortunately, the experimenting of this solution shows is complexity while it does not bring a remarkable improvement of data integration time. Thus, we have based our model on the following method shown by Figure 4. Figure 4. DJDI Model Switches between the Original Configuration and the Adapted One The method consists on the following points: - We assume that new changes may concern many Tables of ODS; at least two Tables (t 1 and t 2 ). We concentrate our interest, and so the model s effort on the Table that had received the smaller change volume. To simplify the explanation of our methodology, we call this Table by t 2 and we call the second Table t 1. - Thus, we divide data changes volume of t 2 into two portions as follows Copyright c 2014 SERSC 119

6 Given i is an integer ; Insert the i th division of the changes (occurred in Table t 2 ) into t 2i. The data change of the other Tables will not receive any division or partitioning. - We create two Tables t 21 and t 22 that have the same structure of t 2 and the same integrity constraints, we insert each data change division into t 2i where 1 i 2; - We rename t 2 into another name. We rename t 2i into t 2. In this way, at an instant t when data change is captured, t 2 contains just one division of data change. - Then, the model launches the map s job on t 2 that have joins with the other Table t 1. - Thus, only one portion of data change will be integrated each time. When the map s job is over, t 2 is renamed into t 21 and we rename t 2 i+1 into t 2 - The other portions will be integrated thereafter in the same way - At the end, the original Table gets its name and the t 2i are emptied. The following simplified algorithm shows the main routines of the DJDI methodology. Algorithm n 2 : DJDI StoredProcedure DJDI() { h is Number; Begin h:= Get_size(data_change()); if h t h then execute map s job(); else Divide data_change (Tb 1 ) into T 11 and T 12 ; i := 1; while i <> 3 loop Execute 'alter table T 1 _' i ' rename to T 1 ' ; commit; execute map s job; Execute 'alter table T 1 rename to T_' i ; commit; i:=i+1; end loop; end if commit; End. } 120 Copyright c 2014 SERSC

7 3.2 Integration Adaptation Relying on the divide to conquer principle, our methodology proposes performing data integration with the respect of the existing organization s rules. Actually, the principle that masters our methodology in the integration phase is bi-pillared: - Pillar n 1: It is faster to insert new rows into a table, which has little or no contents, than it is into a large sized Table [8]. - Pillar n 2: It is faster to integrate small amounts of data to integrate a big amount [8]. Compared to the work presented in [8], our methodology proposes to preserve data integrity by maintaining the same existing integrity constraints. For this axis, our model appeals to techniques of renaming the original fact Table and creating a duplicate of it that has the same original name, this duplicated fact Table may receive the newest data. We avoid creating and dropping Tables each time in order not to disturb the system s performance. The same key idea masters this axis and consists of switching automatically between a direct integration of small amounts of data change, or fact table renaming when data changes value are important. We have defined a threshold t h that determine when the data changes is important or not. Thus, our model integrated the new changes differently depending on one of the following ways: Way 1: when data changes are not important If data change is less than the threshold value t h, then they are integrated normally according the original configuration. Way 2: when data changes are important and greater than t h, the new model activates a mechanism of renaming the original fact Table and duplicating it and giving it the real name of the fact Table. In this case, the map s job is executed but toward a free fact Table, which means faster data insertion (according to the principles we have aforementioned). The following schema shows the change we have brought to the original configuration (In case of important data changes). Figure 5. Fact Table Renaming and Data Change Division In this latest case, OLAP query adaptation should be predicated. The following algorithm shows the main actions of this step. Copyright c 2014 SERSC 121

8 Algorithm n 3 : Fact _Table_Duplication StoredProcedure FctDuplication() h is Number; Begin h:= Get_size(data_change()); if h <=t h then execute map s job(); else rename Fct 0 into Fct 0 and rename Fct 1 into Fct 0; execute map s job; end if commit; End. 3.3 OLAP Query Adaptation We consider that we recourse to OLAP adaptation when data changes are important. Under these circumstances, our model duplicates the fact Table. Thus, data are inserted into the fact Tables (the primary fact Table and the secondary fact Tables) and they can be available for OLAP systems by adjusting queries as following: Select (field 1, field n) from Fct0 where condition1, condition 2 becomes Select (field 1, field n) from Fct0 where condition1, condition 2 join Select (field 1, field n) from Fct1 where condition1, condition 2 Consequently, new data change is available in the DW side more quickly and can be retrieved while querying two Tables instead of querying one. The original Table contains historical data while the second one contains the most recent data changes. 4. Performance Evaluation of the DJ-DI Model In this section, we report the performance evaluation of our method. The algorithms are implemented with PlSql. All the experiments were conducted on Intel i CPU 2.50GHz, 2.89GB memory HP PC running Oracle 11g [9] under Windows 7. We implement the following map s job under Oracle Warehouse Builder (OWB) 11g [9]: 122 Copyright c 2014 SERSC

9 Figure 6. The Initial Data Integration Schema The ROLAP approach is used to store data within the DW (the fact Table) because for sql conventional techniques. We generate datasets automatically and run our model multiple times while varying the number of the datasets and thus the size of data changes. We have run the new model for many months. We presume that the data changes concern many fields of two Tables (Table pers and table sit). When data change size is important and is greater than t h, we execute the map s job twice as previously explained in section 3. We have run three (3) scenarios that we will compare thereafter: - Normal mapping (1) - Fact table duplication (only) (2) - Data change Division (only)(3) In order to follow the performance of the model, we have defined two metrics: - The Elapsed time for each scenario, its unit is the second s - The rate of the data integration, its unit is the numbers of MegaBytes per second MB/s Furthermore, we have also defined an indicator to compare the performance between the scenarios. This indicator is expressed as follows: Improvement (1)/(3)= (1- Intgr_rt (1) / Intgr_rt (3) ). Intgr_rt(i) indicates Integration rate of the scenario i Table 1 shows the pourcentage of improvement of integration rate for scenarios (1) and (3). We observe that the time required by the system with the newly adapted configuration with changes division is significantly greater than the time required with systems under the normal configuration. This difference becomes more important when the size of data changes increases, and therefore the earned time resulting from the new configuration is evident. We deduct from the Table 1 that when data change size reaches 64MB, it becomes useful to launches scenario n 3. Thus we define the threshold t h for our model to 64 MB. Copyright c 2014 SERSC 123

10 According to the value of t h, the system behaves differently as we have explained it in the previous section. For important data changes, we execute the map twice on small quantities of data instead of executing it once on big quantity of data (we split changes into 2 pieces). To test our model, we conduct experiments according to two situations: - Situation n 1: inserting data into free fact Table - Situation n 2: inserting data into non empty fact Table Table 1. Integration Rate Comparaison (Scenarios 1 and 3) Change size (MB) T h = 64 MB Integration rate (MB/s) Direct change Division mapping Improvment (%) 16 16,00 16,00 0% 40 31,11 40,00 22% 64 14,49 27,05 46% 72 18,00 36,00 50% 80 20,67 30,00 31% ,68 30,22 61% ,20 58,67 67% ,35 41,60 58% ,04 43,91 59% ,22 30,55 37% ,06 34,30 44% ,86 28,80 31% ,23 33,53 37% ,05 29,15 31% ,31 29,41 24% ,28 20,82 22% - Situation n 1: inserting data into free fact Table We observe the speed of executing the map s job and the rate of data integration while running three scenarios: - Normal mapping (1) - Fact table duplication (only) (2) - Data change Division (only)(3) 124 Copyright c 2014 SERSC

11 We are handling 500MB of operational data changes that generate about 1600MB in the target Tables (the data ware house). Figure 7. Elapsed Times (in seconds) when Integrating 500MB of Data Changes As shown in the abvementionned chart, our model improves the time of data integration by 12, 5% (between scenarios (1) and (3)). Indeed, the same size of the data changes needs 102s to be integrated in the DW by using the normal configuration (1) while they spend just 87s to be integrated by using our model (3). The scnerio (2) which consists of duplicating the fact Table don t bring real improvememnt since the original fact Table is already empty. Another representation of the results is shown in Figure 8. That displays a comparison between the rates of data integration by using the three scenarios while runing each of them four times: Figure 8. Rates of Data Integration when Integrating 500MB of the Data Changes empty Fact Table Copyright c 2014 SERSC 125

12 When doubling the volume of the data changes, i.e., by handling 1Go of data changes that generate around 6,3Go of data in the data warehouse, we obtain the following chart: Figure 9. Elapsed Times (in seconds) when Integrating 1GB of Data Changes - Situation n 2: inserting data into non empty fact Table We consider that the fact Table contains 1600 MB of data. We run the four options of each abovementioned scenarios. Further to the three-abovementioned scenarios, we have added a fourth scenario (4) which is a combination of the scenarios (2) and (3). We call this scenario mixed (1) and (2). This scenario is the summary of our model. We show the results of comparing the rate of data integration in the following Figure 10. Clearly, we observe that our model based on data change division and join adaptation, allows faster data integration. It offers up to 67% improvement according to the normal configuration. Furthermore, when combining this data change division and the fact Table duplication s option, we observe that when they are together, they do not lead to very important results compared to the scenario of data change division. The scenario of dividing data changes still is the speediest. Figure 10. Rates of Data Integration when integrating 1600MB of the Data Changes-non Empty Fact Table 126 Copyright c 2014 SERSC

13 5. Conclusion and Future Work In this paper, we have presented a DJDI model for data integration based on division and join adaptation. This model, built on dividing the volume of the data changes and on fact Table s duplication, shows a real improvement of the rate of data integration compared to the normal configuration. The model behaves according to the importance of the size of data changes. Then, the map s job is launched one or multiple times, in order to integrate data into destination, in shorter time while conserving data integrity. We consider that our methodology make data available in the DW in short time while respecting the existing organization s rules. It respects the whole steps of data warehousing and especially the sequencing. Future works may include the experiments of this model while managing different levels of data categories that we have presented in our previous work [3]. References [1] L. Chen, J. W. Rahayu and D. Taniar, Towards Near Real-Time Data Warehousing, In International Conference on Advanced Information Networking and Applications, 2010.Proceedings 24 th IEEE, (2010), pp [2] Oracle white paper, Real-Time Data Integration for Data Warehousing and Operational Business Intelligence, (2010). [3] I. Lebdaoui, G. Orhanou and S. El Hajji, Data Integrity in Real-time Data warehousing, in the World Congress on Engineering, Proceedings WCE 2013, vol. III, (2013), pp [4] P. Russom, Data Integration for Real-Time Data Warehousing and Data Virtualization, TDWI Checklist Report, (2010). [5] Oracle white paper- march 2014, Best Practices for Real-Time Data Warehousing, (2014). [6] R. Kimball and J. Caserta, The data warehouse ETL toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data, Wiley Publishing, Canada, (2004). [7] A. Reeve, Managing Data in Motion: Data Integration Best Practice Techniques and Technologies, Kindle Edition, (2013), pp [8] R. Jorge Santos and J. Bernardino, Real-Time Data Warehouse Loading Methodology, in international symposium on Database engineering & applications, 2008, Proceedings IDEAS '08, (2008), pp [9] Oracle Corporation, (2014), Authors Imane Lebdaoui, she received in 2005, the State Engineer Diploma in Information Systems from Hassania School of Public Works (EHTP- Morocco). She is a PhD student in Laboratory of Mathematics, Computing and Applications, Department of Mathematical and Computer Sciences, Faculty of Sciences, University of Mohammed V-Rabat, Morocco. Her research interests include databases Management, databases security, data warehouses, real-time and big data. Ghizlane Orhanou, is an Associate Professor in the Computing Sciences Department at the University Mohammed V Agdal, Morocco. She received Ph.D degree in computer sciences from the University Mohammed V Agdal, Morocco in She received in 2001 a Telecommunication Engineer diploma from Telecommunication Engineering Institute (INPT Morocco). Her main research interests include networked and Information systems security. Said El Hajji, is a Professor in the Mathematics Department since 1991 at the University Mohammed V Agdal, Morocco. Responsible of the Mathematics, Computing and Applications Laboratory. He received PhD degree from Laval University in Canada. His main research interests include modeling and numerical simulations, security in networked and Information systems. Copyright c 2014 SERSC 127

14 128 Copyright c 2014 SERSC