CHAPTER 1 INTRODUCTION

Transcription

1 INTRODUCTION CHAPTER 1 This chapter provides a high-level overview of distributed database. It firstly describes the characteristics, general architecture, challenges and problem areas of distributed database. Then, it presents primary objective of our research. This chapter ends with a discussion on the organization for the rest of the thesis. 1.1 DISTRIBUTED DATABASE Today s business environment has an increasing need for distributed database and client/server applications as the desire for reliable, scalable and accessible information is steadily rising. Distributed database systems provide an improvement on communication and data processing due to its data distribution throughout different network sites. Not only is data access faster but a single-point of failure is less likely to occur and it provides local control of data for users. However, there is some complexity when attempting to manage and control distributed database systems. Distributed network computing environments have become a cost-effective and popular choice to achieve high performance and to solve large scale computational problems. Unlike past supercomputers, a distributed database computing system can be used as multi-purpose computing platform to run diverse high performance parallel applications. The developments in computer networking technology and database systems technology resulted in the development of distributed databases in the mid 1970s. It was felt that many applications would be distributed in the future and therefore, the databases had to be distributed also. Although, many definitions of a distributed database system have been given, there is no standard definition. A distributed database system includes a Distributed Database Management System (DDBMS), a distributed database and a network for interconnections. The objective of a DDBMS is to control the management of a Distributed Data Base (DDB) in such a way that it appears to the user as a centralized database. For general purposes, a database is a collection of data that is stored and maintained at one central location. A database is controlled by a database management system. The user 1

2 interacts with the database management system in order to utilize the database and transform data into information. Furthermore, a database offers many advantages compared to a simple file system with regard to speed, accuracy and accessibility such as: shared access, minimal redundancy, data consistency, data integrity and controlled access [CPP2003]. All of these aspects are enforced by a database management system. Dr. Edgar F. Codd designed a relational model to solve pre-existing model problems while at IBM in the late 1960 s. This relational model was built on mathematical principles which he expounded upon in a book entitled A Relational Model of Data for Large Shared Databanks [SSO2004]. A relational database is a set of tables (also called relations) that are separated into predefined categories. Each table contains records which are the horizontal rows that contain one related group of data. The vertical columns are known as the attributes. Data that is stored on two or more tables establishes a link between the tables based on one or more field values common in both tables. A relational database [TMC2004a] uses a standard user and application program interface called Structured Query Language (SQL). This program language uses statements to access and retrieve queries from the database. Relational databases are the most commonly used due to the reasonable ease of creating and accessing information as well as extending new data categories. When dealing with intricate data or complex relationships, object databases are more commonly used. Object databases, in contrast to relational databases, store objects rather than data such as integers, strings or real numbers. Each object consists of attributes which define the characteristics of an object. Objects also contain methods that define the behavior of an object (also known as procedures and functions). When storing data in an object database there are two main types of methods, one technique labels each object with a unique ID. Every unique ID is defined in a subclass of its own base class where inheritance is used to determine attributes. A second method is utilizing virtual memory mapping for object storage and management. Advantages of object databases with regard to relational databases allow more concurrency control, a decrease in paging and easy navigation. However, there are some disadvantages of object databases compared to relational databases such as: less effective with simple data and relationships, slow access speed and the fact that relational databases provide suitable standards oppose to those for object database systems [AAM2004]. 2

3 Hierarchical databases are organized in a tree like structure where tables act as the root of the database with other tables branching out. Relationships in such a system are thought of in terms of children and parents, such that a child may only have one parent but a parent can have multiple children. Parents and children are connected by links called pointers where a parent may have many pointers to each child. This relationship assumes that data is accessible for the user. On the other hand, hierarchical database systems are complex to use and require application developers to program routing through the linked records. In a hierarchical database, all possible access points must be predetermined and followed accordingly for a successful database otherwise access patterns not included can be extremely difficult to implement [EIC1994]. On the other hand, network databases alleviate some of the problem incorporated with hierarchical databases such as data redundancy. The network model represents the data in the form of a network of records and sets which are related to each other, forming a network of links [PAF1992]. The relationships are represented in terms of records, record types and sets rather than hierarchy. Records are sets of related data values which are equivalent to rows in a relational database model. Record types are a set of records and set types are relationships of one or more record types. The network model just like hierarchical model allows having a many-to-many relationship. Unfortunately, the network model is far more difficult to implement and maintain than what was needed by real end users to solve real problems [GDP1988]. Each database may involve different database management systems and different architectures that distribute the execution of transactions. A distributed database system consists of loosely coupled sites that share no physical component. Database systems that run on each site are independent of each other. Providing the appearance of a centralized database system is one of the many objectives of a distributed database system. Such an image is accomplished by using the following transparencies: Location Transparency, Performance Transparency, Copy Transparency, Naming Transparency, Transaction Transparency, Fragment Transparency, Schema Change Transparency and Local DBMS Transparency. These transparencies are believed to incorporate the desired functions of a distributed database system. Other goals of a successful distributed database include free object naming where free object naming means that it allows different users the ability to access the same object with different names or different 3

4 objects with the same internal name. Thus, giving the user complete freedom in naming the objects while sharing data without naming conflicts. 1.2 CLASSIFICATION OF DISTRIBUTED DATABASE Distributed database can be classified into: Homogeneous DBMS: It has multiple data collections and integrates multiple data resources. Homogeneous systems are similar to a centralized system but instead of preserving all data in a single place, these data are distributed among several places communicated by the network. Local users do not exist and all of them access to the database through a global interface. Heterogeneous DBMS: It is a system which interconnects already existing autonomous database systems to support global applications that access data items in more than one database. Other names proposed for such a system are: federated database system, multidatabase system, decentralized system etc. These systems are characterized by the autonomy of the individual sites as well as the cooperation among them. Three different aspects of the autonomy are: Design Autonomy: The individual sites may differ with respect to data models, physical design, data definition and manipulation languages, query processing strategies, concurrency control, recovery mechanisms etc. One reason for heterogeneity is that when the individual systems were designed, they were unaware of the intended interconnection with the other sites. Execution Autonomy: Each site executes its own local transactions and also subtransactions of the global transactions. All these transactions are treated in the same way. Thus, the site is entitled to decide when and how to execute a sub-transaction and commit it as soon as first execution is complete without waiting for the commitment of the entire global transaction. Communication Autonomy: The sites may be willing to share with other sites only some, not all, data and transaction processing information. Also each site communicates with other sites only when it finds it convenient in terms of bandwidth availability and data locality. Consequently, each site might be inaccessible to the other sites for long periods of time. 4

5 1.3 CHARACTERISTICS OF DISTRIBUTED DATABASE Availability and Reliability: The availability is defined as the probability that the system will be up continuously during a given time period. Reliability is defined as the probability that the system will be up at a given time. These important system parameters are improved with the DDBS. In the centralized database system, if any component of the database goes down, the entire system will go down whereas in the distributed database, only the affected site is down and the rest of the system will not be affected. Further more, if the data is replicated at the different sites, the effect is greatly minimized. Performance Improvement: When large database is distributed onto a number of sites, the local subset of the database is a lot smaller which will improve the size of transactions and the processing time. For the transactions that need access to more than one site, the processing can proceed in parallel improving response time. Communication via Computer Network: The ability to communicate via a computer network to send and receive data and queries from/to other sites on the network. DDBMS Catalog Maintenance: To keep track of the database distribution and replication among the different sites. This is maintained in the DDBMS catalog. Distributed Transactions: A distributed transaction is a transaction which operates on data located at more than one site. A distributed transaction is divided (by the transaction manager of the originating site) into a number of sub-transactions which will be executed by many nodes. The adaptation of the new concept of distributed transactions provides the ability of devising a strategy to execute a transaction that involves accessing more than one site. Replicated Data Consistency: The ability to maintain the consistency of replicated data across the network. 1.4 GENERAL ARCHITECTURE OF DISTRIBUTED DATABASE A distributed database is a set of databases stored on multiple computers that typically appears to applications as a single database [ZSZ2009]. Consequently, an application can simultaneously access and modify the data in several databases in a network. The computers in a distributed system communicate with one another through various communication media, such as high-speed networks or telephone lines. They do not share main memory or 5

6 disks. The computers in a distributed system may vary in size and function, ranging from workstations to mainframe systems. A database link connection allows local users to access data on a remote database. For this connection to occur, each database in the distributed system must have a unique global database name in the network domain. Database Technology Computer Networks Integration Distribution Distributed Database Systems Integration Integration Centralization Figure 1.1 Conceptual View of Distributed Database The global database name uniquely identifies a database server in a distributed system. As a result of it, users have access to the database at their location and they can access the data relevant to their tasks without interfering with the work of others. As shown in the Figure 1.1, the distributed database systems are simply a matter of integrating the database technologies over the computer network. Thus, the tradeoff will be between the integration and the centralization of the data. Site 1 Site 2 Site 5 Communication Network Site 4 Site 3 Figure 1.2 Centralized DBMS on a Network 6

7 Site 1 Site 2 Site 5 Communication Network Site 4 Site 3 Figure 1.3 Distributed DBMS Environment The main difference between centralized and distributed databases is that the distributed databases are typically geographically separated, separately administered and have slower interconnection. Figure 1.2 and Figure 1.3 clearly state the difference between the centralized and distributed DBMS. Also in distributed databases, we differentiate between local and global transactions. A local transaction is one that accesses data only from sites where the transaction originated. A global transaction, on the other hand, is one that either accesses data in a site different from the one at which the transaction was initiated or accesses data in several different sites Components of Distributed Database DDBMS comprises of following components: Database Manager: is the software responsible for processing a segment of the distributed database as shown in Figure 1.4. Distributed Database Management System: is defined as the software which governs a Distributed Database System. It supplies the user with the illusion of using a centralized database. User Request Interface: known some times as a customer user interface, which is usually a client program that acts as an interface to the distributed transaction manager. A customizable user interface is provided for entering requested parameters related to a 7

8 database query. The customized parameter user interface provides parameter entry dialogs/windows in correlation to a data view (e.g. form or report) that is produced according to a database query. The parameters entered may provide for modification of the data view. Also, the manager of the database may structure data views of a database to automatically include prompts for parameters before results are returned by the database. These prompts may be customized by the manager and may be provided according to dialogs such as pop-ups, pull-down menus, fly-outs or a variety of other user interface components. 8

9 Distributed Transaction Manager: is a program that translates requests from the user into actionable requests for the database managers which are typically distributed. A distributed database system is made of both the Distributed Transaction Manager (DTM) and the Data Base Manager (DBM). 1.5 CHALLENGES IN DISTRIBUTED DATABASE Distributed Database Design (Fragmentation, Replication, and Allocation): Data Fragmentation: It allows breaking a single object into two or more segments or fragments. Each fragment can be stored at any site over a computer network. Information about the fragmentation is stored in the distributed data catalog from which it is accessed by the transaction processor to process user requests. Data fragmentation strategies are based at the table level and consist of dividing a table into logical fragments. Three types of such data fragmentations are: Horizontal fragmentation refers to the division of a relation into subsets of rows. Vertical fragmentation refers to the division of a relation into attribute subsets. Mixed fragmentation refers to a combination of horizontal and vertical strategies. Data Replication: Data replication refers to the storage of data copies at multiple sites served by a computer network. Fragmented copies can be stored at several sites to serve specific information requirements. Because the existence of fragmentation copies can enhance data availability and response time, data copies can help to reduce communication and total query costs. Replicated data is subjected to the mutual consistency rule. The mutual consistency rule requires that all copies of data fragments be identical. Three replication scenarios exist: A fully replicated database stores multiple copies of each database fragment at multiple sites. It can be impractical due to the amount of overhead it imposes. A partially replicated database stores multiple copies of some database fragments at multiple sites. It is handled well by the most databases. A non-replicated database stores each database fragment at a single site. Data Allocation: Data allocation describes the process of deciding where to locate data. Data allocation strategies are as follows: With centralized data allocation, the entire database is stored at one site. 9

10 With partitioned data allocation, the database is divided into several disjointed parts and stored at several sites. With replicated allocation, copies of one or more database fragments are stored at several sites. Data distribution over a computer network is achieved through data partition, through data replication or through a combination of both. Distributed Query Processing: Query processing deals with designing algorithms that analyze queries and convert them into a series of data manipulation operations. The problem is how to decide on a strategy for executing every query over the network in the most cost effective way. The factors to be considered are the distribution of data, communication costs and lack of sufficient locally available information. Heterogeneous Databases: When there is no homogeneity among the databases at various sites either in terms of different ways of logically structuring data(data models) or in terms of mechanisms provided for accessing the data(data language), it becomes necessary to provide a translation mechanism between database systems. Distributed Concurrency Control: Concurrency control is an essential element for correctness in any system where two or more database transactions can access the same data concurrently. A well established concurrency control theory exists for database systems: serializability theory which allows effectively designing and analyzing concurrency control methods and mechanisms. To ensure correctness, a DBMS usually guarantees that only serializable transaction schedules are generated, unless serializability is intentionally relaxed. For maintaining correctness in cases of failed transactions (which can always happen) schedules also need to have the recoverability property. Distributed database system design of concurrency and recovery has to consider different aspects other than of those of centralized database systems. These aspects include: Concurrency has to maintain the multiple data copies as consistent. Recovery on the other hand has to make a copy consistent with others whenever a site recovers from a failure. Failure of communication links Failure of individual sites. Deadlocks on multiple sites. 10

11 If concurrent transactions are allowed in an uncontrolled manner, some unexpected result may occur. Here are some typical examples: Lost Update Problem: A second transaction writes a new value of a data-item (datum) on top of a first value written by a first concurrent transaction resulting in the loss of first value. The concurrently running transactions waiting for first value will end with incorrect results. The Dirty Read Problem: Transactions read a value written by a transaction that has been later aborted. This value disappears from the database upon abort and should not have been read by any transaction (dirty read). The reading transactions end with incorrect results. The Incorrect Summary Problem: While one transaction takes a summary over values of a repeated data-item, a second transaction updates some instances of that data-item. The resulting summary does not reflect a correct result for any (usually needed for correctness) precedence order between the two transactions (if one is executed before the other) but rather some random result, depending on the timing of the updates and whether a certain update result has been included in the summary or not. 1.6 PROBLEM AREAS IN DISTRIBUTED DATABASE Reliability of Distributed DBMS: When a failure occurs and various sites become either inoperable or inaccessible, the databases at operational sites must remain consistent and up to date. Furthermore, when the computer system or the network recovers from the failure, the distributed database system should be able to recover and bring the databases at the failed sites up-to-date. Distributed Directory Management: A directory contains information (such as description and locations) about data items in the database. A directory may be global to entire distributed database system or local to each site. It can be centralized at one site or distributed over several sites. Distributed Deadlock Management: The competition among users for access to a set of resources can result in a deadlock if the synchronization mechanism is based on locking. 11

12 Security of Distributed DBMS: The major issues in security are authentication, identification and enforcing appropriate access controls. Databases provide many layers and types of information security, typically specified in the data dictionary, including: Access control: Access Control is a system which enables an authority to control access to areas and resources in a given physical facility or computer-based information system. An access control system, within the field of physical security, is generally seen as the second layer in the security. Authentication: Authentication is the act of establishing or confirming something (or someone) as authentic i.e. the claims made by or about the subject are true. Encryption: In cryptography, encryption is the process of transforming information (referred to as plaintext) using an algorithm (called cipher) to make it unreadable to anyone except those possessing special knowledge, usually referred to as a key integrity Distributed Query Optimization: A database feature that reduces the amount of data transfer required between sites when a transaction retrieves data from remote tables referenced in a distributed SQL statement. Distributed query optimization uses cost-based optimization to find or generate SQL expressions that extract only the necessary data from remote tables, process that data at a remote site or sometimes at the local site and send the results to the local site for final processing. This operation reduces the amount of required data transfer when compared to the time it takes to transfer all the table data to the local site for processing. Load Balancing: A load balancing scheme comprises of three phases: information collection, decision making based on information and data migration. Load balancing or load distribution refers to the general practice of evenly distributing a load. Load balancing is the process by which inbound Internet Protocol (IP) traffic can be distributed across multiple servers. Typically, two or more web servers are employed in a load balancing scheme. In case, one of the servers begins to get overloaded, the requests are forwarded to another server. Load balancing brings down the service time by allowing multiple servers to handle the requests. This service time is reduced by using a load balancer to identify which server has the appropriate availability to receive the traffic. Checkpointing and Recovery (Fault Tolerance): The failure probability of the computing process increases greatly along with enlarging scale of the system. If a failure occurs in a 12

13 computing process and there is no appropriate method to protect it, more cost will be wasted for restarting the program. Check pointing and rollback recovery are the techniques that allow distributed computing to progress in spite of a failure and provide fault-tolerance in distributed systems. Cache Management: Caching popular objects close to clients is a fundamental technique for improving the performance and scalability of a system. Caching enables requests to be satisfied by a nearby copy and hence reduces not only the access latency but also the burden on the network as well as the server. A cache mechanism consists of two basic procedures, i.e., the cache access algorithms and cache replacement policies. Cache access algorithms describe how clients and servers exchange messages and maintain the consistency between the cached copies at clients and the original copies at servers. They are widely used in distributed systems for improving system performance, especially, access latency. A replacement policy describes what data items need to be evicted from the cache when there is no available cache space for storing a copy of the newly accessed data item. Replacement policies are important to the effectiveness of cache mechanisms. A well-designed replacement policy can significantly improve system performance. Caching frequently asked queries is an effective way to improve the performance of both centralized and distributed database systems. 1.7 RESEARCH OBJECTIVES The objective of this research is to improve the understanding of distributed database environment and contribute to the advancement in the areas of concurrency control, load balancing, network traffic management, check pointing and security strategies in distributed database. The present research contributes as follows: The concurrency control in distributed database, its characteristics, challenges, its basic model and performance is analyzed. Related work based on different existing concurrency control algorithms is investigated. A priority based load balancing algorithm is proposed and implemented using Java which balances the load on different nodes working in homogeneous environment in a fragmented distributed database. Memory and CPU utilization based priority method is used and data locality is also taken into consideration along with process waiting time 13

14 and data transmission time. A mobile, network efficient, cost effective multilayer peer to peer distributed model for E-Polling System is proposed. Modifications are made in traditional voting system by incorporating a system generated unique ID in order to reduce chances of duplicate or bogus voting. This system can cast and count votes with higher accuracy and efficiency which reduces the rate of mistakes made in manual methods to a greater extent. The problem of dynamic page generation delays in web sites has been addressed by the proposed Dynamic Content Acceleration (DCA) solution. A fragment-level caching approach is utilized which focuses on re-using HTML fragments of dynamic pages. The result has been evaluated in terms of processing time. A decentralized and cost effective check pointing algorithm suitable for cluster federation is proposed and implemented using java. A single message based communication strategy for cluster federation in distributed database is proposed and evaluated in terms of communication cost incurred and compared with existing algorithms also. Addressing security demands under fixed budgets and deadline constraints are becoming extremely challenging, time consuming and resource intensive. A framework that embeds security capabilities into distributed database by replicating different predefined security policies at different sites using multilevel secure database management system is proposed. Furthermore, a new optimal-bandwidth check pointing algorithm involving only active processes, suitable for network failure prone applications in distributed systems is presented and implemented in java. The algorithm overhead in terms of communication cost and execution time is evaluated and compared with other existing algorithms. 1.8 THESIS ORGANIZATION Chapter 2 starts with a brief description of problem areas of distributed database and then specifies and discusses the research work implemented in areas like concurrency control, load balancing, query optimization, traffic control, check pointing and recovery. A memory along with CPU utilization and data locality based dynamic load balancing algorithm for fragmented distributed database is presented in Chapter 3. Chapter 4 proposes a mobile, network efficient, cost effective, multilayer, peer to peer, distributed model for E-Polling 14

15 System. In Chapter 5, a query optimization model has been proposed which works on the concept of caching entire pages of dynamically generated content. In Chapter 6, a checkpointing algorithm for Cluster Federation has been developed resulting in reduced transmission delay, communication cost, better bandwidth utilization and faster speed of execution. Chapter 7 introduces a framework that embeds autonomic capabilities into distributed database by replicating different predefined security policies at different sites using multilevel secure database management system. 15