A Comparative Study on Operational Database, Data Warehouse and Hadoop File System T.Jalaja 1, M.Shailaja 2



Similar documents
Turkish Journal of Engineering, Science and Technology

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

DATA WAREHOUSING AND OLAP TECHNOLOGY

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 28

Data Warehousing and OLAP Technology for Knowledge Discovery

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Namrata 1, Dr. Saket Bihari Singh 2 Research scholar (PhD), Professor Computer Science, Magadh University, Gaya, Bihar

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

A Survey on Data Warehouse Architecture

How To Handle Big Data With A Data Scientist

Data Warehousing and Data Mining in Business Applications

Manifest for Big Data Pig, Hive & Jaql

IST722 Data Warehousing

BIG DATA CHALLENGES AND PERSPECTIVES

BUILDING OLAP TOOLS OVER LARGE DATABASES

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

An Overview of Data Warehousing, Data mining, OLAP and OLTP Technologies

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Data Refinery with Big Data Aspects

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Part 22. Data Warehousing

Advanced Big Data Analytics with R and Hadoop

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Reduction of Data at Namenode in HDFS using harballing Technique

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Using distributed technologies to analyze Big Data

Improving Data Processing Speed in Big Data Analytics Using. HDFS Method

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Data Warehousing Concepts

Data Mining in the Swamp

DATA WAREHOUSING APPLICATIONS: AN ANALYTICAL TOOL FOR DECISION SUPPORT SYSTEM

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

Introduction to Data Warehousing. Ms Swapnil Shrivastava

Foundations of Business Intelligence: Databases and Information Management

Agile Business Intelligence Data Lake Architecture

Dimensional Modeling for Data Warehouse

Alternatives to HIVE SQL in Hadoop File Structure

Data Warehousing Systems: Foundations and Architectures

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Testing Big data is one of the biggest

Implement Hadoop jobs to extract business value from large and varied data sets

Accelerating and Simplifying Apache

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

An Introduction to Data Warehousing. An organization manages information in two dominant forms: operational systems of

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Data Warehousing: A Technology Review and Update Vernon Hoffner, Ph.D., CCP EntreSoft Resouces, Inc.

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

CSE-E5430 Scalable Cloud Computing Lecture 2

Data Warehousing & OLAP

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Foundations of Business Intelligence: Databases and Information Management

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Overview of Data Warehousing and OLAP

Big Data With Hadoop

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION

International Journal of Advance Research in Computer Science and Management Studies

BIG DATA What it is and how to use?

OLAP and Data Warehousing! Introduction!

Log Mining Based on Hadoop s Map and Reduce Technique

Chapter 7. Using Hadoop Cluster and MapReduce

Data Warehouse: Introduction

Data-Intensive Computing with Map-Reduce and Hadoop

How to Enhance Traditional BI Architecture to Leverage Big Data

Hadoop Big Data for Processing Data and Performing Workload

DATA WAREHOUSING - OLAP

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

International Journal of Advancements in Research & Technology, Volume 3, Issue 5, May ISSN BIG DATA: A New Technology

OLAP. Business Intelligence OLAP definition & application Multidimensional data representation

A Critical Review of Data Warehouse

Anwendersoftware Anwendungssoftwares a. Data-Warehouse-, Data-Mining- and OLAP-Technologies. Online Analytic Processing

A Study on Big Data Integration with Data Warehouse

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Fluency With Information Technology CSE100/IMT100

Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

BIG DATA TRENDS AND TECHNOLOGIES

DATA WAREHOUSE CONCEPTS DATA WAREHOUSE DEFINITIONS

B.Sc (Computer Science) Database Management Systems UNIT-V

14. Data Warehousing & Data Mining

Hybrid Support Systems: a Business Intelligence Approach

NoSQL and Hadoop Technologies On Oracle Cloud

Oracle Big Data SQL Technical Update

Transcription:

RESEARCH ARTICLE A Comparative Study on Operational base, Warehouse Hadoop File System T.Jalaja 1, M.Shailaja 2 1,2 (Department of Computer Science, Osmania University/Vasavi College of Engineering, Hyderabad, India) OPEN ACCESS Abstract: A Computer database is a collection of logically related data that is stored in a computer system, so that a computer program or person using a query language can use it to answer queries. An operational database (OLTP) contains up-to-date, modifiable application specific data. A data warehouse (OLAP) is a subject-oriented, integrated, time-variant non-volatile collection of data used to make business decisions. Hadoop Distributed File System (HDFS) allows storing large amount of data on a cloud of machines. In this paper, we surveyed the literature related to operational databases, data warehouse hadoop technology. Keywords Operational base, warehouse, Hadoop, OLTP, OLAP, Map Reduce. I. INTRODUCTION One of the largest technological challenges in software systems research today is to provide mechanisms for storage, manipulation, information retrieval on large amounts of data. A database is a collection of related data a database system is a database database software together. Operational bases are transactional databases, which supports on-line transaction processing (OLTP) that includes insertions, updates, deletions also supports information query requirements. Operational base is designed to make transactional systems run efficiently. It is used to store detailed current data. The main emphasis of this system is on very fast query processing, maintaining data integrity in multi-access environments. Thus databases must strike a balance between efficiency in transaction processing supporting query requirements.they can't further be optimized for theapplications such as OLAP, DSS data mining. A Warehouse [1] is a database that is designed for facilitating querying analysis. In contrast to databases, data warehouses generally contain very large amounts of data from multiple sources that may include databasesfrom different data models sometimes files acquired from independent systems platforms. Often it is designed as OLAP (On-Line Analytical Processing) systems. Big data comes from relatively new types of data sources like social media, public filings, content available in the public domain through agencies or subscriptions, documents e-mails including both structured unstructured texts,digital devices sensors including location based smart phone, weather telemetric data.there are several approaches to collecting, storing,processing, analyzing big data.hadoop [5] is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. Hadoop is not atype of database, but rather a software ecosystem that allows massive parallel computing. The main focus of this paper is to present a clear understing on operational database, data warehouse Hadoop technology. This paper is organized as follows: In section II, we present literature survey on Operational database, warehouse Hadoop Distributed File System. In Section III, we present the comparative ISSN: 2395-1303 http://www.ijetjournal.org Page 37

studies among the works cited above. In section IV, we summarize the work. II. LITERATURE SURVEY Day-by-day technology in software systems is improving is becoming very challenging in terms of storage, manipulation, information retrieval. There are different types of storage systems having the capability of storing data from few Megabytes to Peta bytes.in this paper we discuss about the traditional/operational database, which is capable of storing the transactional data of an organization in part A. Then in part B,we discuss about the datawarehouse, which is capable of analyzing the business data which is used for decision making. Later in part C we come across the way to store large volume of structured as well as unstructured data, which is further utilized for analytics. A. Operational base (OLTP) An operational database contains data about the things that go on inside an organization or enterprise. For example, an operational database might contain fact dimension data describing transactions, data on customer complaints, employee information, taking order fulfilling them in a store, track of payments inventory,information amounts from credit cards etc. Operational systems maintain records of daily business transactions.traditional databases support on-line transaction processing (OLTP), which includes insertions, updates, deletions, while also supporting information query requirements. An operational database contains enterprise data which are up to date modifiable. As the name implies, it is the database that is currently progressive in use capturing real time data supplying data for real time computations other analyzing processes. A database structure commonly used in GIS in which data is stored based on 2 dimensional tables where multiple relationships between data elements can be defined established in an adhoc manner. Relational base Management System - a database system made up of files with data elements in two-dimensional array (rows columns).this database management system has the capability to recombine data elements to form different relations resulting in a great flexibility of data usage. SQL is used to manipulate relational databases. The relational model contains the following components: Collection of objects or relations. Set of operations to act on the relations. integrity for accuracy consistency. All other database structures can be reduced to a set of relational tables. It is easy to use when compared to other database systems.traditional databases are optimized to process queries that may touch a small part of the database transactions that deal with insertions or updates of a few tuples per relation to process. Because of the very dynamic nature of an operational database, there are certain issues that need to be addressed appropriately. An operational database can grow very fast in size bulk so database administrations IT analysts must purchase high powered computer hardware top notch database management systems. The operational database is the source of data for the data warehouse. B. warehouse (OLAP) According to Surajit Chaudhuri Umeshwar Dayal as in [6], a data warehouse is a subject-oriented, integrated, timevarying,nonvolatile collection of data that is used primarily in organizational decision making as in [7]. Typically, the data warehouse is maintained separately from the organization s operational databases. The data warehouse supports on-line analytical processing (OLAP), the functional performance requirements of which are quite different from those of the on-line transaction processing (OLTP) applications traditionally supported by the operational databases. A data warehouse is a centralized repository that stores data from multiple information sources transforms them into a common, ISSN: 2395-1303 http://www.ijetjournal.org Page 38

multidimensional data model for efficient querying analysis as shown in Figure 1. Figure 1: Warehouse Architecture Historical, summarized consolidated data is more important than detailed, individual records. Since data warehouses contain consolidated data, perhaps from several operational databases, over potentially long periods of time, they tend to be orders of magnitude larger than operational databases; enterprise data warehouses are projected to be hundreds of gigabytes to terabytes in size. The workloads are query intensive with mostly ad hoc, complex queries that can access millions of records perform a lot of scans, joins, aggregates. Query throughput response times are more important than transaction throughput. To facilitate complex analyses visualization, the data in a warehouse is typically modeled multidimensional. The 3-D data cube of the sales data with respect to time, item, location is shown in Figure 2. Figure 2: Multidimensional data ( Cube) Typical OLAP operations include rollup (increasing the level of aggregation) drill-down (decreasing the level of aggregation or increasing detail) along one or more dimension hierarchies, slice dice (selection projection), pivot (re-orienting the multidimensional view of data). Decision support usually requires consolidating data from many heterogeneous sources: these might include external sources such as stock market feeds, in addition to several operational databases. The different sources might contain data of varying quality, or use inconsistent representations, codes formats, which have to be reconciled. Finally, supporting the multidimensional data models operations typical of OLAP requires special data organization, access methods, implementation methods, not generally provided by commercial DBMSs targeted for OLTP. It is for all these reasons that data warehouses are implemented separately from operational databases. A popular conceptual model that influences the front-end tools, database design, the query engines for OLAP is the multidimensional view of data in the warehouse. C. Hadoop Distributed Filesystem (HDFS) According to P.Sarada Devi, V.Visweswara Rao K.Raghavender as in [8], warehouse is a collection of heterogeneous data which can effectively easily manage the data from multiple databases in a uniform fashion. In addition to data warehousing ETL, many technologies such as ERP (Enterprise Resource Planning), SCM,SAP, BI tools are used in the world market for hling structured data efficiently effectively.in order to hle the large amount of structured, semi structured unstructured data generated by different social network or industries, Hadoop is being used. Hadoop is a framework developed as an Open Source Software based on papers published in 2004 by Google Inc. that deal with the Map Reduce distributed processing the Google File System, a system they had used to scale their data processing needs. Hadoop is an open source framework for writing running distributed applications that process large amounts of data''. It consists of two main components Storage: The Hadoop Distributed File System Processing: "Map Reduce'' ISSN: 2395-1303 http://www.ijetjournal.org Page 39

Figure 3: Hadoop Architecture HDFS is the storage component of Hadoop. It s a distributed file system that s modeled after the Google File System (GFS) paper [11]. Files in HDFS are stored across one or more blocks, each block is typically 64 MB or larger. Blocks are replicated across multiple hosts in the hadoop cluster to help with availability fault tolerance. HDFS is able to store huge amounts of information, scale up incrementally survive the failure of significant parts of the storage infrastructure without losing data. Hadoop creates clusters of machines coordinates work among them. Clusters can be built with inexpensive computers. If one fails, Hadoop continues to operate the cluster without losing data or interrupting work, by shifting work to the remaining machines in the cluster. HDFS manages storage on the cluster by breaking incoming files into pieces, called blocks, storing each of the blocks redundantly across the pool of servers. In the common case, HDFS stores three complete copies of each file by copying each piece to three different servers [9]. Map Reduce is a batch-based, distributed computing framework modeled after Google s paper on Map Reduce [12]. Map reduce data processing model has two main phases - ''MAPPING'' ''REDUCING''. Mapping Phase: Map Reduce takes the input data feeds each data element to the mapper which filters transforms the input data into the desired format for business use. Reducing Phase: The reducer processes all the outputs from the mapper arrives as a final result by performing business logic to get the desired output. Figure 4: MapReduce Architecture MapReduce Architecture The processing pillar in the Hadoop ecosystem is the MapReduce framework. The framework allows the specification of an operation to be applied to a huge data set, divide the problem data, run it in parallel. From an analyst s point of view, this can occur on multiple dimensions. For example, a very large dataset can be reduced into a smaller subset where analytics can be applied [10]. In Hadoop, these kinds of operations are written as MapReduce jobs in Java. There are a number of higher level languages like Hive Pig that make writing these programs easier. The outputs of these jobs can be written back to either HDFS or placed in a traditional data warehouse. III. COMPARISION TABLE 1 Type of data Storag e Modeli COMPARISION OF OLTP, OLAP, HDFS base (OLTP) warehouse (OLAP) Hadoop (HDFS) Transactiona Historical data, l data, Derived data, detailed Metadata data. Stores operational data structured data Network, hierarchical, Stores historical information structured data Conceptual, Logical, And Real time data, derived, enterprise, Simple complex data Stores enterprise data - structured, semi structured Unstructur ed data. File system(sc ISSN: 2395-1303 http://www.ijetjournal.org Page 40

ng techni ques Optimi zation Access techni ques Access to few records at once by predefined transactions Physical modeling techniques. Efficient for performing Reading/retrieving large data sets for aggregating data. Access a lot of records in each access by ad hoc queries periodic reports object oriented, object relational Relational modeling techniques (Schema-onwrite) Efficient for performing read-write operations of single point transactions. hema-onread) HDFS is efficient for performin g reading writing large files (gigabytes larger). High throughput Access Streaming data, large volumes of data warehousing is faster cost effective when compared with Apache Hadoop.Unlike traditional ETL tools, Hadoop persists the raw data can be used to re-process it repeatedly in a very efficient manner.hadoop mostly process semi structured unstructured data also, whereas Warehouse is best suited for processing only structured data efficiently. Figure 5: Showing OLTP, OLAP, Hadoop IV. CONCLUSION Thus, the operational/traditional databases are used to store an application specific data pertaining to an Organization or Enterprise.This data is used for query purpose.but, in order to store large amount of data compared to Traditional database, which is further used for analysis purpose, based on which the business decisions are made, warehouses are maintained. Incontrast, to store Zettabytes of structured/unstructured dynamic data, Hadoop distributed File systems are used. These are more efficient fault tolerant. REFERENCES 1. Wided Oueslati Jalel Akaichi, A Survey on Warehouse Evolution, Vol.2, No.4, November 2010, IJDMS. 2. T.K.Das1 AratiMohapatro A Study on Big Integration with Warehouse International Journal of Computer Trends Technology (IJCTT) volume 9 number 4 Mar 2014. 3. Dr. Mrs.Pushpa Suri, Mrs. Meenakshi Sharma A Comparative Studybetween The Performance of Relational& object Oriented base In Warehouse". 4. Inmon, W. et.al. (2000) Exploration warehousing: turning business information into business opportunity. John Wiley & Sons. 5. Hadoop. http://hadoop.apache.org 6. Surajit Chaudhuri Umeshwar Dayal An Overview of Warehousing OLAP Technology, ACM Sigmod Record, March 1997). 7. Inmon, W.H., Building the Warehouse. John Wiley, 1992. 8. P.SaradaA Devi, V.Viswewara Rao K.Raghavender Emerging Technology Big -Hadoop Over Warehousing Using ETL Proceedings of 2nd IRF International Conference, 21st September-2014, Vizag, India. 9. MrigankMridul, AkashdeepKhajuria, Snehasish Dutta, Kumar N Analysis of Bidgata using Apache Hadoop Map Reduce Volume 4, Issue 5, May 2014. 10. Harshawardhan S. Bhosale, Prof. DevendraP. Gadekar Harshawardhan S. Bhosale1, Prof. Devendra P. Gadekar International JournalofScientific Research Publications, Volume 4, Issue 10, October 2014. 11. Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung The Google File System. 12. Jeffrey Dean Sanjay Ghemawat,Google, Inc MapReduce: Simplified Processing on Large Clusters. ISSN: 2395-1303 http://www.ijetjournal.org Page 41