Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Transcription

1 Volume 6, Issue 3, March 2016 ISSN: X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: Special Issue on 5 th National Conference on Recent Trends in Information Technology 2016 Conference Held at P.V.P. Siddhartha Institute of Technology Kanuru, Vijayawada, India A Relative Study on Traditional ETL and ETL with Apache Hadoop 1 Y. Ramu, 2 C P Pavan Kumar Hota, 3 Dr. B. V. Subba Rao 1 Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Assistant Professor, Department of IT, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 3 Professor, Dept of IT, PVP Siddhartha Institute of Technology, Vijayawada, Andhra Pradesh, India Abstract ETL process is the backbone of data warehouse or data processing system, as it supplies data with necessary integration from heterogeneous and distributed data sources. Before processing the data to data processing systems, it extracts from multiple sources, then cleanses formats and loads the data into systems for analysis. Big data includes data sets with sizes beyond the ability of other software tools to capture, manage and process data within an elapsed time. In this paper, we present various ETL operations to handle big-data with Apache Hadoop; distinct hadoop components; and a brief comparative study between traditional ETL & ETL with hadoop sytems. Keywords ETL, Big data, Hadoop. I. INTRODUCTION Web based organizations collect a lot of data from multiple sources. Now a days, web-scale-companies are everywhere in the market. So, the amount of data organized by these web-scale-companies becomes huge over time. Developing applications on top of this huge data have their own challenges. As the data keeps on growing, then it becomes too complex to develop, too costly to operate and takes too much of time to execute. Big Data: Big Data is a term that describes large volumes of high velocity, complex and variable data that require advanced techniques and technologies to enable tasks like capture, storage, distribution, management, and analysis of the information. It is a computing infrastructure that can take in, validate and analyze high volume of data, and analyzing diverse data (structured/unstructured) from multiple sources. Figure 1: 3 V's of Big Data Big data are large-scaled heterogeneous data in terms of quantity, complexity, semantics, distribution, and processing costs in computer science, information science, cognitive informatics, web-based computing, cloud computing, and computational intelligence. Big data has become a popular phrase that can be understood by various definitions with respect to the Volume, Velocity, Variety (as shown in Figure:1). Later studies revealed out that the definition of 3V's is insufficient to explain the big data we face now. Thus, Veracity, Validity, Value, Variability were added too. Volume: Volume refers to the huge amount of data that is generated from various sources. Overall more than 3 exabyte's of data is generated every day and it keeps growing. Velocity: Velocity refers to the speed of data generation. i.e. how fast the data is generated and processed to meet the demands and the challenges ahead in the enterprise growth. Variety: Variety refers to the format of data. Now a days, data that is generated from various sources can be in three different formats: a. Structured Data: Structured data is data that is having a pre-defined schema, like RDBMS, XML etc. b. Semi Structured Data: Semi-Structured data is data that may have a pre-defined schema, it is often ignored, like XML, JSON etc. c. Un-Structured Data: Un-Structured data is data that does not have any pre-defined format, like Text Files, Images, Audio Files, Video Files, etc. 2016, IJARCSSE All Rights Reserved Page 74

2 Hadoop: Usually, the developers might use the databases such as Oracle, SQL Server etc.., of their choice to store the data. In this case, when user interacts with the application it works fine, if the data is less in volume. But, when it comes to deal with huge amount of data, then it becomes a problem with single database.to overcome this problem, Google provided a solution called "Map reduce". This map reduce algorithm divides the task into smaller components and assign those individual smaller tasks to many computers and collects the result by integrating individual results to from the final result set. Using the map reduce algorithm, Doug Cutting with his team developed open source software called Hadoop. In which, data is processed in parallel and is able to develop application that performs statistical analysis with huge amount of data. In other words, Hadoop allows distributed processing and storage of datasets to clusters of computers. Hadoop architecture mainly consists of two layers (as shown in Figure:2): 1. Processing layer (Map Reduce) and 2. Storage layer (HDFS) Figure 2: Major Layers of Hadoop II. TRADITIONAL ETL ETL is a popular architectural pattern for Data Warehouse[5]. ETL process extracts data from multiple sources, then cleanses, formats and loads it into a data warehouse for analysis (as shown in Figure:3 and Figure:4). When the source data sets are large, fast and loads unstructured, then it becomes too complex to develop, too expensive to operate and takes too long time to execute. Figure 3: Traditional ETL Extract phase refers to extracting data from various sources that produce them. Data warehousing systems integrate data from various logically related sources. Transformation phase refers to transform the source data into standardized form. Then, Load phase, which moves or uploads the data into target system. Figure 4: ETL Process III. ETL WITH HADOOP An ETL workflow with Hadoop process comprises various tasks as follows: Initially, input data from various data source, which contains data in different formats to HDFS Map the consolidated data in to a table to make it query table The target data is transformed with in a finalized format, and is mapped to destination source. Convert all input data sources information in to target format and make it available at central. Use the finalized data available at central for reporting or analytics. 2016, IJARCSSE All Rights Reserved Page 75

3 Extract, Transform, Load operations can be performed with Apache Hadoop[1]. It includes various components such as Map Reduce, HDFS, Apache Flume and Apache Sqoop (as shown in Figure: 5). Figure 5: ETL Process with Hadoop and its Components Map Reduce: Map-reduce technique contains two phases, i) Map-phase and ii) Reduce-phase. The overall mapreduce approach is can be described as various internal tasks as shown in the Figure:6. Map phase accepts a data set and converts it into another data set by breaking individual elements into (key, value) pairs. Reduce-phase performs after the map-phase. This phase accepts the output of map-phase as an input of reducephase, thus combines (key, value) pairs into smaller set of tuples. Combiner, combines similar data from the map phase into separate sets. It accepts intermediate keys as input from the map phase and aggregates values in a smaller scope. Shuffle and Sort phase, shuffles the individual (key, value) pairs based on their similarity and then (key, value) pairs are sorted by key into a larger data list. In Reducer phase, data can be aggregated, filtered and combined in a number of ways. It results to zero or more (key,value) pairs. Finally the output phase, translates the key value pairs from the reducer function and writes them onto a file using record writer. Figure 6: Map-Reduce Work Flow HDFS: Hadoop File System is used for storing the data [3] on Hadoop cluster. One can read the files from HDFS and write the files to HDFS. Data Files are split into blocks before storing on the cluster. The size of each data block is either 64MB (default size) or 128 MB. Hadoop file system architecture is based on Master/Slave architecture as shown in Figure: 7, which consists of Name node and Data nodes. Figure 7: Hadoop File System Name Node: It is the master node which stores meta data information such as file names, number of blocks, details of data node blocks containing information etc.. If Name node fails then the cluster becomes inaccessible. Due to this reason, Secondary Name Nodes are encouraged. 2016, IJARCSSE All Rights Reserved Page 76

4 Secondary Name Node: It performs periodic check points. Data node cannot connect to secondary Name node. It is just used for the recovery of name node. Data Node: It acts as a slave node that stores the blocks of data to local file system. Each data node periodically sends reports to the name node. Block report contains a list of all blocks that are available on a data node. Apache Flume: It is a distributed system used for collecting, aggregating and moving large amounts of data from multiple sources into Hadoop File System (HDFS). Flume is a distributed system that gets data from multiple sources and aggregates them for processing as shown in Figure: 8. It contains various components like: Source node is feeds the data to one or more channels. Channel is the location representing various data events. Sink node, acts as the transporter to move data events from a channel to destination. Figure 8: Apache Flume Figure 9: Channel Processor Flume, helps in integrating and aggregating various data sources. It facilitates data reading from internal and external data sources. It supports channels to hold various data events raised by data sources as per pre-configured approach. Initially, clients send various data events to agents. Sources operating with an agent will receive these events, and passes received events through interceptors. If not filtered then, put them on channels identified by the channel selector. Often, we will be having more than one channel for effective data-events and channelizing. At channel selector, by default, every channel will gets a copy or other option is channel picked based on header value. Interceptor is applied to source configuration element. One source can have many interceptors. Interceptor can be used for tagging, filtering and routing as shown in Figure:9. Apache Sqoop: "Sqoop in the name itself shows that "Sql to Hadoop and Hadoop to Sql". It means, it is a tool for transferring data between Hadoop and Transactional database. In other words, using Sqoop, one can import data from other data base storage systems into HDFS. Then apply Map-Reduce on that data,and then export the data back to the RDBMS. Sqoop automatically processes the data to import and export in parallel with fault tolerance. Apache Sqoop s import and export process is described as follows: i) Apache Sqoop import process: Step-1: Sqoop examines the databases to gather the necessary meta data for the data being imported. Step-2: Map only hadoop job submitted to cluster by sqoop.the mapped only job performs the data transfering using hadoop. Apache Sqoop imports individual tables from RDBMS to HDFS. Each row in a table is treated as a record in Hadoop File System. By default, these files contain delimited fields with new lines separating different records. These records are stored in text format either in text files or binary files. ii) Apache Sqoop export process: Step-1: Examines the database for metadata followed by second step of transferring the data. Step-2: Transfer the data 2016, IJARCSSE All Rights Reserved Page 77

5 Sqoop divides the input data set into splits. It uses individual map task to push the split to database. Each map task performs many transactions to ensure optimal throughput and minimum resource utilization. IV. COMPATIVE STUDY OF TRADTIOAN ETL & ETL WITH HADOOP Here, we have with a brief comparative study on ETL approach in traditional and hadoop systems. A traditional ETL approach focuses on extraction, transformation and loading of various OLTP (On-Line Transactional Processing) data sources in to a central data store. Before loading data, it transforms all the data in to a standardized format after performing various data cleaning tasks. Finalized format of data in the central store is generally in row-column fashion. This traditional ETL helps data professionals to query for various types of data and retrieve a valuable information. Data across all operational and analytical systems is in a unique format, so it facilitates data professionals for various data operations time to time. But, data professionals must accept overhead time at data staging. In recent times, hadoop systems with HDFS, facilitates data professionals to store all data sources in terms HDFS mapped tables which are very flexible for data queries. Source data can also be transformed to a desired format and can be stored at a required remote location. Data storing and querying become very easy and flexible. A brief comparative study with five different parameters between tradional ETL and ETL with hadoop is shown in the Tabel:1. Table 1: Traditional ETL Vs. ETL with Hadoop Comparative Parameter Traditional ETL with ETL Hadoop Possibility of handling unstructured data Is the system scalable to desired extent Is it facilitates at low cost Is the system provides high security Is the system supports dynamic real-time data analysis Building an ETL systems with HDFS will certainly improves the way of data handling. Here, we have following advantages of ETL with Hadoop over Traditional ETL An improved phenomenon of having a centralised data centre for all the data sources. No burden of data movements across multiple clusters Provision of data access as per source format Data transformations are expressed in terms of source platform features and can refer any one of the hadoop resident data sources as well. So, we believe that data professionals will be facilitated more to access any required data from any remote machine for their data processing or data analytics purpose. These ETL with hadoop systems will certainly help data professionals in the current big-data era. V. CONCLUSIONS Effectiveness of any decision support systems depends on ETL process. In the current big-data era, to cope with emerging data trends, ETL approach should always be refined from time to time. In this paper, our comparative study represents that adopting ETL with apache hadoop system will certainly help data professionals with its rich set of components and tools available to provides improved performance in data automation, schema updates and business intelligence applications with better scalability, security and real time features. REFERENCES [1] Jaswender Malik, Kavita Framework for ETL with hadoop map reduce, International Journal of technology enhancements and emerging Engineering Research, Vol 3, ISSUE 07 ISSN [2] Marcel Kornacker, Lenni Kuff, From Raw Data to Analytics with No ETL, Cloudera, Inc.! [3] M. Bala, O. Boussaid, and Z. Alimazighi, Big-ETL: Extracting,Transforming, Loading Approach for Bigdata, Int'l Conf. Par. and Dist. Proc. Tech. and Appl.PDPTA'15 [4] J. Dean and S. Ghemawat, Mapreduce: simplified data processing on large clusters, Communications of the ACM, vol. 51, no. 1, pp , [5] P. Vassiliadis, A. Simitsis, and S. Skiadopoulos, Conceptual modeling for etl processes, in Proceedings of the 5th ACM international workshop on Data Warehousing and OLAP. ACM, 2002, pp , IJARCSSE All Rights Reserved Page 78