Towards a Semantic Extract-Transform-Load (ETL) framework for Big Data Integration

2014 IEEE International Congress on Big Data Towards a Semantic Extract-Transform-Load (ETL) framework for Big Data Integration Srividya K Bansal Dept. of Engineering & Computing Systems Arizona State University Mesa, AZ, USA srividya.bansal@asu.edu Abstract Big Data has become the new ubiquitous term used to describe massive collection of datasets that are difficult to process using traditional database and software techniques. Most of this data is inaccessible to users, as we need technology and tools to find, transform, analyze, and visualize data in order to make it consumable for decision-making. One aspect of Big Data research is dealing with the Variety of data that includes various formats such as structured, numeric, unstructured text data, email, video, audio, stock ticker, etc. Managing, merging, and governing a variety of data is the focus of this paper. This paper proposes a semantic Extract- Transform-Load (ETL) framework that uses semantic technologies to integrate and publish data from multiple sources as open linked data. This includes - creation of a semantic data model to provide a basis for integration and understanding of knowledge from multiple sources; creation of a distributed Web of data using Resource Description Framework (RDF) as the graph data model; extraction of useful knowledge and information from the combined data using SPARQL as the semantic query language. Keywords Big data; Data integration; Ontology; Semantic technolgies; I. INTRODUCTION There has been an exponential growth and availability of data, both structured and unstructured. Big Data has become the new ubiquitous term used to describe massive collection of datasets that is so large that it's difficult to process using traditional database and software techniques. Big Data may comprise of petabytes (1,024 terabytes) or exabytes (1,024 petabytes) of data consisting of billions to trillions of records of millions of people - all from different sources (e.g. Web, sales, customer contact center, social media, mobile data, etc). The data is typically loosely structured data that is often incomplete and inaccessible. Big Data is transforming science, engineering, medicine, healthcare, finance, business, and ultimately society itself. Massive amounts of data are available to be harvested for competitive business advantage, sound government policies, and new insights in a broad array of applications (including healthcare, biomedicine, energy, smart cities, genomics, transportation, etc.). Yet, most of this data is inaccessible for users, as we need technology and tools to find, transform, analyze, and visualize data in order to make it consumable for decision-making [1]. Research community also agrees that it is important to engineer Big Data meaningfully [2]. Meaningful data integration in a schema-less, and complex Big Data world of databases is a big open challenge. Big Data research is usually discussed in the areas of the 3V s Volume, Velocity, and Variety. Volume being the storage of massive amount of data streaming in from social media, sensors, and machine-to-machine data being collected; determination of relevance within large data volumes; and use of analytics to create value from relevant data. Velocity is streaming in at unprecedented speed at which the data is streaming in and reacting quickly enough to deal with it in near-real time. Variety deals with the various types of formats in which data comes in (structured, numeric, unstructured text data, email, video, audio, stock ticker, etc.). Big Data challenges are not only in storing and managing this variety of data but also extracting and analyzing consistent information from it. Researchers are working on creating a common conceptual model for the integrated data [3]. Managing, merging, and governing heterogeneous data is an open research challenge that this paper focuses on. There has been a tremendous increase of published data on the Web. Linked Open Data community effort has led to a huge data space with 31 billion RDF triples as shown in [4]. This data can be used in a number of interesting Web applications, mobile applications, and for analytics. An interesting example application would be a Smart City project that would integrate and use information from various sources such as transportation, weather, social media streams, maps, energy, real estate, policies, crime reports, etc. Government agencies are increasingly making their data accessible through initiatives such as data.gov to promote transparency and economic growth [5]. For example, a traffic jam that emerges due to an unplanned protest may be captured through a twitter stream, but missed when examining weather conditions, event databases, reported roadwork, etc. Additionally, weather sensors in the city tend to miss localized events such as flooding. These views of the city combined however, can provide a richer and more complete view of the state of the city, by merging traditional data sources with messy and unreliable social media streams thereby contributing to smart living, people, environment, economy, mobility, and governance. We need ways to organize a variety of data such that common things are represented together, while the things that are distinct can be represented as well. This will allow effective and creative use of search/query engines and analytic tools for Big Data, which is absolutely essential to 978-1-4799-5057-7/14 $31.00 2014 IEEE DOI 10.1109/BigData.Congress.2014.82 535 522

create smart and sustainable environments. This project focuses on dealing with the Variety V of Big Data more specifically creating a semantic extract-transform-load framework for data integration. This paper proposes the use of semantic technologies to connect, link, and load data into a data warehouse. This includes: (i) creation of a semantic data model via ontologies to provide a basis for integration and understanding knowledge from multiple sources; (ii) creation of integrated semantic data using Resource Description Framework (RDF) as the graph data model [6]; (iii) extracting useful knowledge and information from the combined web of data using SPARQL as the semantic query language [7]. This paper presents the preliminary implementation and results using a few sample public datasets that provide household travel data, vehicle data, and fuel economy data. The rest of the paper is organized as follows: Section 2 presents related work in this area. Section 3 presents a motivating scenario showing an application of meaningful data integration. Section 4 presents the conceptual model for a semantic ETL framework. Section 5 presents the system architecture and prototype implementation followed by conclusions and future work. II. RELATED WORK One of the popular approaches to data integration has been Extract-Transform-Load (ETL) as shown in [8], [9]. Authors of this work have described a taxonomy of activities in ETL and a framework that uses a workflow approach to design ETL activities. They used a declarative database programming language called LDL to define the semantics of ETL activities. Similarly there are other groups that have used various other approaches such as UML and data mapping diagrams for representing ETL activities, quality metrics driven design for ETL, and scheduling of ETL activities [10] [13]. The focus in all of these papers has been on the design of ETL workflow and not about generating meaningful/semantic data. A semantic approach to ETL technologies was proposed in [14], [15]. In this approach semantic technologies are used to further enhance definitions of the ETL activities involved in the process rather than the data itself. A tool that allowed semi-automatic definition of inter-attribute semantic mappings, by identifying parts of data source schemas, which are related to the data warehouse schema was proposed in [15]. This supported the extraction phase of ETL. The use of semantics here was to facilitate the extraction process and workflow generation with semantic mappings. In contrast, our approach uses ontologies to provide a common vocabulary for the integrated data and generates semantic data as part of the transformation phase of ETL. This semantic data is then loaded in the data store or warehouse for querying and analytics. There is related work on creating a common conceptual model for data integration to handle the heterogeneous data coming from multiple sources. Reference [3] uses processmining techniques to handle the heterogeneity by computing the mismatch among data sources being integrated. Work has been done on data abstraction and visualization tools for Big Data [16] that is applicable after the data integration phase. Numerous studies are being done on the analysis of Big Data using various algorithms such as influence-based module mining (IBMM) algorithm [17], online association rule mining [18], graph analytics [19], and provenance analysis support framework [20]. The method of publishing and linking structured data on the web is called Linked Data. This data is machinereadable, its meaning is explicitly defined, it is linked to other external data sets, and it can be linked to from other data sets as well [4]. Reference [21] uses these interlinked RDF data stores from the Linked Open Data (LOD) cloud and queries them using SPARQL to perform analysis. They use statistical Learning predictive models to learn classifiers from the data. III. MOTIVATING SCENARIO There has been a growing demand and interest in mobile applications for automotive industry that enhance the driver s experience through personalization, provide feedback on optimizing vehicle performance, diagnosis and part failure detection, and driver assistance. Most of these applications rely on Big Data that is available to the public through the cloud or as linked open data. Consider the following scenario where a typical driver, John gets into his car in the morning and turns on the ignition. A built-in innovative application in the car greets him and asks him if he is going to work based on the time of the day. John responds by saying yes and the app replies that the vehicle performance has been optimized for the trip. The built-in system uses the GIS system, road grade data, and speed limits data to create an optimal velocity profile. As John starts driving and is approaching Recker road to turn left, the app informs John about a road repair on Recker road for a 1-mile stretch on his route up to 3pm that day. The app suggests that John should continue driving and take the next left on Power road. John follows the suggestion and answers a text message that he receives from his collaborator. As he is answering to the text message, his car drifts into the neighboring lane. The app immediately notifies John that he is departing from his lane and John quickly adjusts his driving. As John approaches his workplace he drives towards Green Lot 1 where he usually parks. The app informs John that there are only 2 parking spots open in Green Lot 1. As John is already running late for a meeting by 5 minutes, he decides to directly drive to the next parking lot Green Lot 2 and avoid spending the time looking for the 2 empty spots in Lot 1. As John enters Lot 2 and is driving towards one of the empty spots, he gets too close to one of the parked cars. The app immediately warns John of a collision. John quickly adjusts his car away from the parked cars and parks in one of the empty spots. The app logs tracking data about John s style of driving on the server for future use. In order to build such apps for automobiles, access to a number of data sets from various sources is required. Some of this data is real-time data that is continuously being updated. Data related to traffic, road repairs, emergencies, accidents, driving habits, maps, parking, fuel economy data, household data, diagnosis data, etc. would be required. 536 523

Various other apps could be built that focus on reducing energy consumption, reducing emissions, and using existing related data to analyze and provide useful feedback. Another useful app is for fuel economy guidance that is based on actual vehicle data, road conditions, traffic, and most importantly personal driving habits. This app would show the car s fuel efficiency, and provide feedback on how good or bad it is based on data from others in the community. It also provides guidance on improving one s personal driving habits and thereby saving on fuel. It is important to effectively integrate data such that the data is tied to a meaningful and rich data model that can be queried by these innovative applications. IV. CONCEPTUAL MODEL In this section we present our conceptual model by describing existing ETL process, our proposed idea of a Semantic ETL process and describe the semantic technology stack that will be used in our model. A. Extract-Transform-Load (ETL) Process Extract-Transform-Load (ETL) process in computing has been in use for integration of data from multiple sources or applications, possibly from different domains. It refers to a process in data warehousing that extracts data from outside sources, transforms it to fit operational needs, which can include quality checks, and loads it into the end target database, more specifically, operational data store, data mart, or data warehouse. A number of tools facilitate the ETL process, namely IBM Infosphere [22], Oracle Warehouse Builder [23], Microsoft SQL Server Integration Services [24], and Informatica Powercenter for Enterprise Data Integration [25]. Talend Open Studio [26] and Pentaho Kettle [27] are two open source ETL products. The three phases of this process are described as follows: Extract: this is the first phase of the process that involves data extraction from appropriate data sources. Data is usually available in flat file formats such as csv, xls, and txt or is available through a RESTful client. Transform: this phase involves the cleansing of data to comply with the target schema. Some of the typical transformation activities involve normalizing data, removing duplicates, checking for integrity constraint violations, filtering data based on some regular expressions, sorting and grouping data, applying built-in functions where necessary, etc. [8] Load: this phase involves the propagation of the data into a data mart or a data warehouse that serves Big Data. Most ETL tools provide a graphical interface to create a workflow of ETL activities and automate their execution. Figure 1 shows an example workflow created by Talend to transform and load NASA Aviation Safety Reporting System data into a database. Figure 1: Example ETL workflow B. Semantic ETL In this paper we introduce the use of semantic technologies in the Transform phase of an ETL process to create a semantic data model and generate semantic linked data (RDF triples) to be stored in a data mart or warehouse. The transform phase will still continue to perform other activities such as normalizing and cleansing of data. Extract and Load phases of the ETL process would remain the same. Figure 2 shows the overview of activities in semantic ETL. Transform phase will involve a manual process of analyzing the datasets, the schema and their purpose. Based on the findings, the schema will have to be mapped to an existing domain-specific ontology or an ontology will have to be created from scratch. If the data sources belong to disparate domains, multiple ontologies will be required and alignment rules will have to be specified for any common or related data fields. C. Technology Stack Semantic web technologies facilitate: the organization of knowledge into conceptual spaces, based on their meanings; extraction of new knowledge via querying; and maintenance of knowledge by checking for inconsistencies. These technologies can therefore support the construction of an advanced knowledge management system [28], [29]. The following Semantic technologies and tools are used as part of our Semantic ETL framework: Uniform Resource Identifier (URI), a string of characters used to identify a name or a web resource. Such identification enables interaction with representations of the web resource over a network (typically the Web) using specific protocols. Resource Description Framework (RDF) [6], a general method for data interchange on the Web, which allows the sharing and mixing of structured and semi-structured data across various applications. As the name suggests, RDF is a language for describing web resources. It is 537 524

Figure 2: Overview of Semantic ETL Process used for representing information, especially metadata, about web resources. RDF is designed to be machinereadable so that it can be used in software applications for intelligent processing of information. Web Ontology Language (OWL) [30] is a markup language that is used for publishing and sharing ontologies. OWL is built upon RDF and an ontology created in OWL is actually a RDF graph. Individuals with common characteristics can be grouped together to form a class. OWL provides different types of class descriptions that can be used to describe an OWL class. OWL also provides two types of properties: object properties and data properties. Object properties are used to link individuals to other individuals while data properties are used to link individuals to data values. OWL enables users to define concepts in a way that allows them to be mixed and matched with other concepts for various uses and applications. SPARQL a RDF Query Language [7], which is designed to query, retrieve, and manipulate data stored in RDF format. SPARQL allows for a query to consist of triple patterns, conjunctions, disjunctions, and optional patterns. SPARQL allows users to write queries against data that can loosely be called "key-value" data, as it follows the RDF specification of the W3C. The entire database is thus a set of "subject-predicate-object" triples. Protégé Ontology Editor - Protégé is an open-source ontology editor and framework for building intelligent systems [31]. It allows users to create ontologies in W3C s Web Ontology Language. It is used in our framework to provide semantic metadata to the schema of datasets from various sources. V. IMPLEMENTATION In this section we present our prototype implementation of the semantic ETL process using public datasets on National Household travel survey data and EPA s Fuel Economy data. These data sets were chosen to build a prototype for a proof-of-concept of the semantic ETL framework. Table 1: Sample data fields from National Household data set 538 525

A. Data sets Our first data source is the National Household Travel survey (NHTS) data for the year 2009 published by the U.S. Department of Transportation. This data was collected to assist transportation planners and policy makers who needed transportation patterns in the United States. The dataset consists of daily travel data of trip taken in a 24-hour period with information on the purpose of the trip (work, grocery shopping, school dropoff, etc.), means of transportation used (bus, car, walk, etc.), how long the trip took, day of week when it took place, and additional information in case a private vehicle was used. For private vehicle driver characteristics and vehicle attributes were all also collected. This data was collected for all areas of the country, urban and rural. This dataset has been used by research community to study relationships between demographics and travel, correlation of the modes of transportation, amount, or purpose of travel with the time of the day and day of the week. We chose this dataset to integrate the private vehicles used by members of the household and fuel economy of those corresponding vehicle models by getting that information from another dataset on Fuel Economy. The second data source is the U.S. Environmental Protection Agency s (EPA) data on Fuel Economy of vehicles. This dataset provided detailed information about vehicles produced by all manufacturers (make) and models. It provided detailed vehicle description, mileage data for both city drive and highway drive, mileage for different fuel types, emissions information, and fuel prices that included regular, midgrade, and premium gasoline, diesel, electric, lpg, e85, and cng. This data was obtained as a result of vehicle testing conducted by EPA s National Vehicle and Fuel Emissions Lab and the manufacturers with an oversight by EPA. Table 1 & 2 show sample data fields from the National Household Travel dataset. Table 3 shows sample data fields from the Fuel Economy dataset. B. Semantic Data Model generation We used Protégé OWL [31] editor for ontology engineering and created a data model that comprised of primary classes Person, Household, Vehicle at the top most level. All the fields in the datasets were modeled as classes, subclasses, or properties and connected to the primary classes via suitable relationships. The ontology had both object properties and data properties depending on the type of data field being represented. Figure 3 shows a partial semantic data model with the primary classes and properties and their relationships. C. Semantic Instance Data generation The datasets were obtained from the sources in CSV format. An XML tool called Oxygen XML editor [32] was used to convert the CSV data into OWL instance data conforming to the vocabulary defined by the ontology. Table 4 shows a couple of sample instances from the Household dataset and table 5 shows a couple of sample instances from the Fuel Economy dataset. Table 2: Sample fields from National Household Vehicle data Table 3: Sample data fields for EPA Fuel Economy data D. Semantic Querying We tested the integrated data by querying it using the RDF Query language SPARQL [7]. The Java open source framework for building Semantic Web and Linked data applications, Apache Jena [33], was used for executing SPARQL queries. An application was built that loads the ontologies and data, checks for conformity of data with the vocabulary, and then run SPARQL queries against the data. A sample list of queries is provided in Table 6. This is an ongoing project and this initial version was built to test the proposed concept of a semantic ETL framework. We will continue building in this data set by pulling information from a variety of sources and analyzing the data for interesting results. 539 526

Figure 3: Semantic data model Table 5: Sample OWL Fuel Economy Instance data Table 4: Sample OWL Household Instance data Table 6: Sample Queries 540 527

VI. CONCLUSIONS AND FUTURE WORK Integration of data from various heterogeneous sources into a meaningful data model that allows intelligent querying is an important open issue in the area of Big Data. Traditionally, Extract-Transform-Load (ETL) process has been used for data integration in the industry. This paper proposes a semantic ETL framework that uses semantic technologies to produce rich and meaningful knowledge around data integration and produces semantic data that can possible be published on the web and contribute to the Web of data. Successful creation of such a framework will be of tremendous use to various innovative Big Data applications as well as analytics. In order to test the proposed technique, we used the semantic ETL process to integrate a few public data sets with information on vehicles, household transportation, and fuel economy. We created a simple java application that ran SPARQL queries against the combined semantic data. One of the challenges of this project is ontology engineering that needs a fairly good understanding of the data from different sources. A human expert has to perform this step and it is often a time-consuming process. This is an ongoing project with scope for numerous interesting web and mobile applications that will use a variety of data to enhance user experience. Future work also includes looking into real-time data and how semantic ETL can help with its integration. Apps can be built for various domains such as automotive, aerospace, healthcare, education to list a few. REFERENCES [1] E. Kandogan, M. Roth, C. Kieliszewski, F. Ozcan, B. Schloss, and M.-T. Schmidt, Data for All: A Systems Approach to Accelerate the Path from Data to Insight, in 2013 IEEE International Congress on Big Data (BigData Congress), 2013, pp. 427 428. [2] C. Bizer, P. Boncz, M. L. Brodie, and O. Erling, The Meaningful Use of Big Data: Four Perspectives Four Challenges, SIGMOD Rec., vol. 40, no. 4, pp. 56 60, Jan. 2012. [3] A. Azzini and P. Ceravolo, Consistent Process Mining over Big Data Triple Stores, in 2013 IEEE International Congress on Big Data (BigData Congress), 2013, pp. 54 61. [4] C. Bizer, T. Heath, and T. Berners-Lee, Linked datathe story so far, International journal on semantic web and information systems, vol. 5, no. 3, pp. 1 22, 2009. [5] F. Lecue, S. Kotoulas, and P. Mac Aonghusa, Capturing the Pulse of Cities: Opportunity and Research Challenges for Robust Stream Data Reasoning, in Workshops at the Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012. [6] P. Hayes and B. McBride, Resource description framework (RDF), 2004. [Online]. Available: http://www.w3.org/tr/rdf-mt/. [Accessed: 28-Feb- 2014]. [7] SPARQL Query Language for RDF. [Online]. Available: http://www.w3.org/tr/rdf-sparql-query/. [8] P. Vassiliadis, A. Simitsis, and E. Baikousi, A Taxonomy of ETL Activities, in Proceedings of the ACM Twelfth International Workshop on Data Warehousing and OLAP, New York, NY, USA, 2009, pp. 25 32. [9] P. Vassiliadis, A. Simitsis, P. Georgantas, M. Terrovitis, and S. Skiadopoulos, A generic and customizable framework for the design of ETL scenarios, Information Systems, vol. 30, no. 7, pp. 492 525, Nov. 2005. [10] S. Luján-Mora, P. Vassiliadis, and J. Trujillo, Data mapping diagrams for data warehouse design with UML, in Conceptual Modeling ER 2004, Springer, 2004, pp. 191 204. [11] J. Trujillo and S. Luján-Mora, A UML based approach for modeling ETL processes in data warehouses, in Conceptual Modeling-ER 2003, Springer, 2003, pp. 307 320. [12] A. Simitsis, K. Wilkinson, M. Castellanos, and U. Dayal, QoX-driven ETL design: reducing the cost of ETL consulting engagements, in Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, 2009, pp. 953 960. [13] A. Karagiannis, P. Vassiliadis, and A. Simitsis, Macro-level Scheduling of ETL Workflows, Submitted for publication, 2009. [14] D. Skoutas and A. Simitsis, Ontology-Based Conceptual Design of ETL Processes for Both Structured and Semi-Structured Data:, International Journal on Semantic Web and Information Systems, vol. 3, no. 4, pp. 1 24, 34 2007. [15] S. Bergamaschi, F. Guerra, M. Orsini, C. Sartori, and M. Vincini, A semantic approach to ETL technologies, Data & Knowledge Engineering, vol. 70, no. 8, pp. 717 731, Aug. 2011. [16] S. K. Bista, S. Nepal, and C. Paris, Data Abstraction and Visualisation in Next Step: Experiences from a Government Services Delivery Trial, in 2013 IEEE International Congress on Big Data (BigData Congress), 2013, pp. 263 270. [17] Y. Guo, X. Shang, J. Li, and Z. Li, Revealing the Causes of Dynamic Change in Protein-Protein Interaction Network, in 2013 IEEE International Congress on Big Data (BigData Congress), 2013, pp. 189 194. [18] E. Olmezogullari and I. Ari, Online Association Rule Mining over Fast Data, in 2013 IEEE International Congress on Big Data (BigData Congress), 2013, pp. 110 117. [19] M. U. Nisar, A. Fard, and J. A. Miller, Techniques for Graph Analytics on Big Data, in Big Data (BigData Congress), 2013 IEEE International Congress on, 2013, pp. 255 262. 541 528

[20] Y.-W. Cheah, R. Canon, B. Plale, and L. Ramakrishnan, Milieu: Lightweight and Configurable Big Data Provenance for Science, in Big Data (BigData Congress), 2013 IEEE International Congress on, 2013, pp. 46 53. [21] H. T. Lin and V. Honavar, Learning Classifiers from Chains of Multiple Interlinked RDF Data Stores, in Big Data (BigData Congress), 2013 IEEE International Congress on, 2013, pp. 94 101. [22] IBM InfoSphere Platform big data, information integration, data warehousing, master data management, lifecycle management and data security. [Online]. Available: http://www- 01.ibm.com/software/data/infosphere/. [Accessed: 28- Feb-2014]. [23] Warehouse Builder 11gR2: Home Page on OTN. [Online]. Available: http://www.oracle.com/technetwork/developertools/warehouse/overview/introduction/index.html. [24] Integration Services Microsoft SQL Server 2012. [Online]. Available: http://www.microsoft.com/enus/sqlserver/solutions-technologies/enterpriseinformation-management/integration-services.aspx. [25] Enterprise Data Integration Informatica. [Online]. Available: http://www.informatica.com/us/products/dataintegration/enterprise/. [26] Talend Open Studio Talend. [Online]. Available: http://www.talend.com/products/talend-open-studio. [27] Data Integration Pentaho Business Analytics Platform. [Online]. Available: http://www.pentaho.com/product/data-integration. [28] N. Shadbolt, W. Hall, and T. Berners-Lee, The semantic web revisited, Intelligent Systems, IEEE, vol. 21, no. 3, pp. 96 101, 2006. [29] G. Antoniou and F. Van Harmelen, A semantic web primer. the MIT Press, 2004. [30] OWL - Semantic Web Standards. [Online]. Available: http://www.w3.org/2001/sw/wiki/owl. [Accessed: 23-Jan-2014]. [31] Protégé Ontology Editor. [Online]. Available: http://protege.stanford.edu/. [32] Oxygen XML Editor. [Online]. Available: http://www.oxygenxml.com/. [Accessed: 28-Feb- 2014]. [33] Apache Jena - Home. [Online]. Available: https://jena.apache.org/. 542 529