A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems



Similar documents
Hosting Transaction Based Applications on Cloud

A Distribution Management System for Relational Databases in Cloud Environments

MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Communication System Design Projects

International Journal of Innovative Research in Computer and Communication Engineering. (An ISO 3297: 2007 Certified Organization)

Scalable Transaction Management on Cloud Data Management Systems

A Demonstration of Rubato DB: A Highly Scalable NewSQL Database System for OLTP and Big Data Applications

Review of Query Processing Techniques of Cloud Databases Ruchi Nanda Assistant Professor, IIS University Jaipur.

Data Management in the Cloud: Limitations and Opportunities

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Report Data Management in the Cloud: Limitations and Opportunities

Loose Coupling between Cloud Computing Applications and Databases: A Challenge to be Hit

Business Intelligence and Column-Oriented Databases

Load Balancing in Distributed Data Base and Distributed Computing System

Towards Full-fledged XML Fragmentation for Transactional Distributed Databases

PartJoin: An Efficient Storage and Query Execution for Data Warehouses

Data Management in the Cloud

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

In Memory Accelerator for MongoDB

Transactions Management in Cloud Computing

Implementing New Approach for Enhancing Performance and Throughput in a Distributed Database

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

What is Analytic Infrastructure and Why Should You Care?

5-Layered Architecture of Cloud Database Management System

Distributed Databases in a Nutshell

Efficient Integration of Data Mining Techniques in Database Management Systems

CSCI 550: Advanced Data Stores

Determination of the normalization level of database schemas through equivalence classes of attributes

Report for the seminar Algorithms for Database Systems F1: A Distributed SQL Database That Scales

International Journal of Innovative Research in Computer and Communication Engineering

An Approach to Implement Map Reduce with NoSQL Databases

Data Management in Cloud based Environment using k- Median Clustering Technique

Cloud Data Management: A Short Overview and Comparison of Current Approaches

Distributed Framework for Data Mining As a Service on Private Cloud

Joining Cassandra. Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece

Advanced Database Group Project - Distributed Database with SQL Server

Transaction Management in Distributed Database Systems: the Case of Oracle s Two-Phase Commit

Data Management Course Syllabus

Oracle8i Spatial: Experiences with Extensible Databases

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Foundations of Business Intelligence: Databases and Information Management

Big Data Storage Architecture Design in Cloud Computing

American International Journal of Research in Science, Technology, Engineering & Mathematics

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

A Secure Model for Medical Data Sharing

Applying Attribute Level Locking to Decrease the Deadlock on Distributed Database

Data Management Challenges in Cloud Computing Infrastructures

SCHEDULING IN CLOUD COMPUTING

Data Mining in the Swamp

Course MIS. Foundations of Business Intelligence

Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India

Big Objects: Amazon S3

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

bigdata Managing Scale in Ontological Systems

How To Write A Database Program

Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive

Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner

PERFORMANCE ANALYSIS OF PaaS CLOUD COMPUTING SYSTEM

Benchmarking and Analysis of NoSQL Technologies

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

BIG DATA WEB ORGINATED TECHNOLOGY MEETS TELEVISION BHAVAN GHANDI, ADVANCED RESEARCH ENGINEER SANJEEV MISHRA, DISTINGUISHED ADVANCED RESEARCH ENGINEER

Figure 1 Cloud Computing. 1.What is Cloud: Clouds are of specific commercial interest not just on the acquiring tendency to outsource IT

Evaluation of NoSQL and Array Databases for Scientific Applications

NewSQL: Towards Next-Generation Scalable RDBMS for Online Transaction Processing (OLTP) for Big Data Management

Reallocation and Allocation of Virtual Machines in Cloud Computing Manan D. Shah a, *, Harshad B. Prajapati b

Horizontal Partitioning by Predicate Abstraction and its Application to Data Warehouse Design

Cloud-dew architecture: realizing the potential of distributed database systems in unreliable networks

A Grid Architecture for Manufacturing Database System

References. Introduction to Database Systems CSE 444. Motivation. Basic Features. Outline: Database in the Cloud. Outline

Introduction to Database Systems CSE 444

This paper defines as "Classical"

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM

Future Prospects of Scalable Cloud Computing

Horizontal Fragmentation Technique in Distributed Database

Turkish Journal of Engineering, Science and Technology

UPS battery remote monitoring system in cloud computing

Transcription:

A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems Ismail Hababeh School of Computer Engineering and Information Technology, German-Jordanian University Amman, Jordan Abstract- As many distributed database applications contain online information that change continuously and expand incrementally, comprehensive cloud Application Programming Interface API s are required to monitor and control the accuracy of the information and data proliferation. This cloud software is required to monitor and control the accuracy of the information and data proliferation. This software can be viewed as integrated cloud computing services; data fragmentation, clustering network sites, and fragments Allocation that support transactional database applications. In this paper, we describe our data Fragmentation as a Service (FaaS) in construction of a cloud computing software system. Specifically, we design a novel data fragmentation as a service to facilitate enormous data processing, and introduce some functioning enhancement on data distribution to improve the cloud system performance. This research presents our attempt to implement data fragmentation service in a cloud computing system, with large scale data mining as targeted application. Keywords: SaaS, FaaS, CaaS, AaaS, DFA, API. 1 Introduction Cloud computing is web based system development in which huge scalable computing services are provided to users over the Internet. The cloud computing system includes web communications, Software as a Service (SaaS), up-and-coming tools, and has involved extra attention from researchers in different technology areas. Many cloud computing providers have their data centers spread worldwide to maintain data availability which is typically achieved by replication processes. Amazon s cloud simple storage service [1] replicates data across different geographical regions so that data and applications can continue even in the face of failures of their location. This is likely to be help in running applications on data warehouses, but not transactional data management systems [2]. Yahoo [3] and Amazon [4] both implement data replication through PNUTS and SimpleDB cloud data services over distributed network sites. They designed to run analytical applications on data warehouses, but not for transactional data applications. Similarly, Google [5] implements a replicated database, but does not offer a complete relational Application Programming Interface and weakens the data atomicity. The cloud API is written as series of XML-based messages, and executed on the cloud servers to utilize remote web-based applications and reduce the number of calls between the client and the distributed servers [6]. Microsoft SQL Server [7] cloud data service is implemented over distributed network sites. However, as it doesn t apply commit protocols, the distributed system presents lack of data consistency. Researchers in [8] designed the H-Store project to minimize the number of transactions that access data from multiple locations. However, the project still in the theoretical phase, and its feasibility on a real world distributed database systems has not verified. In distributed relational database systems, the transactions on the applications are usually subsets of relations (fragments), so using these fragments and distributing them over the network sites increases the system throughput by means of parallel execution. Therefore, an efficient cloud API fragmentation web service is presented to access and manage data relationships, and enhance both the speed and simplicity of the distributed database functionality. This web service is used to retrieve raw data from the cloud data centers by external programs like Java applications. Moreover, it helps to reduce the cost of accessing data over distributed network sites and increases the distributed system performance through data allocation processes. The remainder of the paper is organized as follows: related work is discussed in Section 2; Section 3 describes the data fragmentation architecture; data fragmentation design is presented in Section 4; Section 5 depicts the performance evaluation and experimental results; and finally Section 6 draws conclusion and outline future work.

2 Related Work Various strategies have already partitioned data across distributed systems. There are approaches that determined three main partitioning categories; vertical, horizontal, and hybrid [9, 10]. Some have defined the vertical fragmentation as a process of generating data records fragments [11, 12, 13, 14]. Other researchers have addressed the necessity of horizontal fragmentation [15, 16, 17] which make the processes of data backup and restore much easier. A mixed or hybrid fragmentation; vertical fragmentation followed by a horizontal or vice versa, has been covered by few researches [12,18] due to the intractable nature of this type of fragmentation in relational distributed database systems. data fragments, supporting the use of knowledge extraction, and helping to achieve the effective use of small fragments. The domain knowledge in DFA describes and categorizes the essential and representative elements of the distributed database systems, specifically, for the databases fragmentation. The purpose of the DFA domain knowledge is to ensure that all data elements are available and consistent for database fragmentation process. In addition, it is used to prepare data elements that are valid from one transaction to another, from one application to another, and from one database to another in distributed database systems. The details of this DFA are described and illustrated in the following section. The studies in [16] and [17] are by far the closest to our fragmentation method. The method in [16] considered each record as a fragment in the relation and large number of database fragments is generated, thus more communication costs are required fragments processing. In contrast, the approach in [17] used the whole relation as a fragment, not all records of the fragment have to be retrieved or updated, and a selectivity matrix that indicates the percentage of accessing a fragment by a transaction is considered. However, more redundant fragments are available, and the generated fragments are overlapped. A key difference between our cloud API fragmentation method and the others is that: it presents the minimum number of disjoint fragments that could be generated for each relation according to the queries requirements. This fragmentation service is designed for a cloud computing systems in order to reduce the communication cost over the cloud sites and increase the distributed system throughput. Moreover, the generated disjoint fragments are allocated then into the cloud servers where it saves more communication costs. 3 Data Fragmentation Architecture In a distributed relational database systems, the complete database is not a suitable data unit for distribution because it is too big, especially when considering information relevant for different cloud data centers. Therefore, it is appropriate to develop a web application programming interface service, specifically FaaS, that can extracts the minimum number of disjoint data records which would be allocated to the cloud servers. The architecture of FaaS is recognized by the domain knowledge and three main processes; eliminating data redundancy, defining transactions, and fragmenting data records. Figure 1 describes the Data Fragmentation Architecture (DFA) service that will be used for generating Figure 1. Data Fragmentation Architecture 4 Data Fragmentation Design The requested data in DFA are identified by means of transactions triggered as queries, which determine the specific information that should be extracted from the cloud database servers. The transactions are executed and result in redundant data records as two or more different queries may require the same data records. The redundant data are eliminated and the remaining data records are then partitioned so as to generate the minimum number of fragments which are neither replicated nor intersected (disjoint). 4.1 Eliminating Data Redundancy Different cloud API s are developed to get rid of data redundancy from the distributed cloud servers. The following algorithm is designed to prevent data replication, based on Primary Code Number PCN, from being entered into database application runs over cloud servers.

Eliminating Data Redundancy Algorithm: Step1: Build a map which stores the unique list of leads being inserted/updated, using Primary Code Number as a key. Step2:Check for any lead where the key is inserted/updated Step3:If the key is a replication of another lead in this group, Then do steps 4 and 5 Step4:Create a single database query, using the lead map, to find all the leads in the database that have the same key Step 5: Issue a non validation message Else, do step 6 Step 6:If the key is not replicated, Then Add this lead to the new lead map End If End If Step 7: End In this cloud API, all transactions are processed and the redundant data records are eliminated. Thus, database applications get more speed and so more efficient as it have only the required data records to be accessed, processed, and allocated to the distributed cloud servers. 4.2 Defining Database Transactions The data records requested by the clients determine the specific information extracted from the database queries. Database queries are executed as transactions from the applications at the distributed database system sites. The results of the transactions are sets of data records that could be full intersected, partial intersected, or not intersected. The data set itself consists of complete records. Figure 2a. Defining Transactions Figures 2a and 2b illustrates an example of defining and generating different sets of data records according to the definition of each transaction. Figure 2b. Data Records Transactions In this figure, there exist a full intersection between data sets (4,10) over relation 3, and a partial intersection between data sets (1,5), (3,9), and (6,7) over relations 4, 2, and 1 respectively. On the other hand, there is no intersection between the data sets over relation 5. Therefore, a cloud API fragmentation method is developed, partitioned the database records, and generated the minimum number of disjoint fragments which will be allocated to the distributed cloud servers. The details of this approach are illustrated in the following section. 4.3 Fragmenting Data Records The fragmentation process starts looking for any two data records over the same relation having intersection records between them. From any two intersected data sets, three disjoint fragments will be generated; the intersection fragment which represents the common records in both sets, the fragment that represents the records in the first set but not in the second intersected set, and the fragment that represents the records in the second set but not in the first intersected set. Then, the intersected sets are deleted from the data sets list. This process is continued until no more intersections between the data sets still exist. The subsequent fragmentation algorithm describes the processes of generating disjoint fragments from the intersected data records for each relation in DDBS.

Intersected Data Fragmentation Algorithm: k Number of the last fragment in the database (0 at the beginning) Repeat for all relations in the database Repeat for all data records S i, S j in each relation, where i j If S i S j Ø k k + 1 F k S i S j F k+1 S i - F k F k+2 S j - F k Delete S i, S j End if Until all intersected data records in each relation have been processed Until all relations in the database have been processed Rename the final fragments sequentially The data records over each relation that do not have intersection between them are renamed as fragments and considered for further fragmentation process. The following fragmentation algorithm expresses the nonintersected data records and adds them to the list of relation fragments. Non-intersected Data Fragmentation Algorithm: k Number of the last fragment Repeat for all relations in the database k k + 1 F k R - F i (for all fragments F i in relation R) IF F k Ø Then Add F k to the collected fragments of relation R End if Until all relations in the database have been processed The same fragmentation process is applied for any two intersected fragments over the same relation. This refragmentation process will be continued until the intersection between data fragments is finished and the fragments are totally disjoint. Figure 3 shows an example of generating disjoint fragments from the data sets over relation 2 in a DDBS. In this figure, the data sets 1,2 over relation 2 are sharing a common data records 2,3,4 which is considered to be redundant data. The fragmentation process isolates the shared data records from both data sets, and generates the following disjoint fragments; F1 which contains the shared data records 2,3,4, F2 that contains the records 1,5,6,7 which are in data set 1 but not in data set 2. As the third fragment should be generated contains the same data records in fragment F1, it will not be created and the data sets 1,2 have to be deleted. The performance evaluation of the fragmentation method is presented in the following section. Figure 3. Data Sets Fragmentation 5 Performance Evaluation and Experimental Results The database fragmentation performance evaluation is based on the computation resulted from dividing the reduced storage size for each relation by the relation queries records size. The reduced storage size is computed as the difference between the size of the queries records and the size of the generated fragments of each relation in DDBS. The experimental results are demonstrated the usability and the efficiency of the cloud API fragmentation method. When this method is tested for 45 quires over 5 database relations that construct the whole database, 27 disjoint fragments are generated. Therefore, 18 data set records are omitted from the database which eliminates the data redundancy, saves more data storage, minimizes the transferred and processed data, and then increases the overall system performance. The fragmentation methods in [16] and [17] are implemented in our fragmentation method. The performance results of [16] and [17] are compared with our cloud API fragmentation method and depicted in Figure 4.

Figure 4. Fragmentation Performance Evaluation It is shown in this figure that the performance accomplished by our approach outperforms the performance of the methods in comparison, and this will help in reducing the communication costs in data allocation phase. 6 Conclusions Data fragmentation is one of the primary techniques used in partitioning and developing cloud computing services for distributed database systems. This research discussed the efficiency, usefulness, and the performance improvement achieved by the API fragmentation service in a cloud computing system. The experimental results emphasized the ability of this fragmentation method to minimize the data processed and transferred between the distributed database system network sites, reduce the storage size by eliminating data redundancy, and present significant performance improvements that increase distributed database network system throughput. Moreover, this cloud API fragmentation approach realizes the optimal solution properties of data fragmentation in a distributed database system; the relation fragments include all relation records, the union of all relation fragments constructs the original relation, and the relation fragments are disjoint. In addition, it generates the minimum number of fragments for each relation according to the queries requirements. This will reduce the communication cost over distributed network sites and increase the distributed system performance. In a future work, the research will be focused on the resulted disjoint fragments as objects for distribution over the network sites. It should be distributed in such a way that satisfy the requirements of database queries and enhance the system performance. Therefore, developing web API services for a cloud computing distributed systems, like Clustering distributed network sites as a Service (CaaS) and fragments Allocation as a service (AaaS), have an important impact on the transactional database applications, and improve the distributed systems throughput. References [1] Amazon Simple Storage Service (Amazon S3) http://aws.amazon.com/s3 [Accessed 29 th April, 2011]. [2] Daniel J. Abadi. Data Management in the Cloud: Limitations and Opportunities. Data Engineering, IEEE Computer Society. March 2009 Vol. 32 No. 1, pp 3-12. [3] B. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. Pnuts: Yahoo!s hosted data serving platform. Proceedings of VLDB, 2008. [4] Amazon Simple DB. http://aws.amazon.com/simpledb/ [Accessed 19 th February, 2011]. [5] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a distributed storage system for structured data. Proceedings of OSDI, 2006. [6] A.Velte, T.Velte & R.Elsenpeter. Cloud Computing: A Practical Approach. McGraw-Hill. 2010. [7] Microsoft SQL Server for Cloud Servers. http://www.rackspace.com/cloud/blog/2010/12/01/announc ing-sql-server-licenses-for-cloud-servers. [Accessed 7 th March, 2011]. [8] M. Stonebraker, S. R. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, and P. Helland. The end of an architectural era (it s time for a complete rewrite). VLDB, Vienna, Austria, 2007.

[9] Ozsu, M. & Valduriez, P. Principles of Distributed Database Systems. 2nd ed. Englewood Cliffs NJ, Prentice- Hall. 1999. [10] Khalil, N., Eid, D. & Khair, M. Availability and Reliability Issues in Distributed Databases Using Optimal Horizontal Fragmentation. Trevor J. M. Bench-Capon; Giovanni Soda & A. Min Tjoa, ed.,'dexa' 99, LNCS 1677, Springer, pp. 771-780. 1999. [11] Son, J. & Kim, M. An Adaptable Vertical Partitioning Method in Distributed Systems. The Journal of Systems and Software. 73(3), 2004, pp. 551 561. [12] Agrawal, S., Narasayya, V. & Yang, B. Integrating Vertical and Horizontal Partitioning into Automated Physical Database Design. SIGMOD 2004, Paris, France, ACM 2004, pp. 359-370. [13] Tamhankar, A. & Ram, S. Database Fragmentation and Allocation: An Integrated Methodology and Case Study. IEEE Transactions on Systems, Man. and Cybernetics-Part A. Systems and Humans. 28(3), 1998, pp. 288 305. [14] Lim, S. & Ng, Y. Vertical Fragmentation and Allocation in Distributed Deductive Database Systems. The Journal of Information Systems. 22(1), 1997, pp. 1-24. [15] Costa, R. & Lifschitz, S. Database Allocation Strategies for Parallel BLAST Evaluation on Clusters. Distributed and Parallel Databases. 13, 2003, pp. 99-127. [16] Ma, H., Scchewe, K. & Wang, Q. Distribution design for higher-order data models, Data and Knowledge Engineering. 60, 2007, pp. 400-434. [17] Huang, Y. & Chen, J. Fragment Allocation in Distributed Database Design. Journal of Information Science and Engineering. 17, 2001, pp. 491-506. [18] Navathe, S., Karlapalem, K. & Minyoung, R. A mixed fragmentation methodology for initial distributed database design. Journal of Computer and Software Engineering. 3(4), 1995, pp.395-425.