chapater 7 : Distributed Database Management Systems

chapater 7 : Distributed Database Management Systems Distributed Database Management System When an organization is geographically dispersed, it may choose to store its databases on a central database server or to distribute them to local servers (or a combination of both). A distributed database is a single logical database that is spread physically across computers in multiple locations that are connected by a data communications network. We emphasize that a distributed database is truly a database, not a loose collection of files. The distributed database is still centrally administered as a corporate resource while providing local flexibility and customization DDBMS is a centralized application that manages a distributed database. This database system synchronizes data periodically and ensures that any change in data made by users is universally updated in the database. Distributed DBMS To have a distributed database, there must be a database management system that coordinates the access to data at the various nodes. We will call such a system a distributed DBMS. Although each site may have a DBMS managing the local database at that site, a distributed DBMS will perform the following functions 1. Keep track of where data are located in a distributed data dictionary. This means, in part, presenting one logical database and schema to developers and users. 2. Determine the location from which to retrieve requested data and the location at which to process each part of a distributed query without any special actions by the developer or user. 3. If necessary, translate the request at one node using a local DBMS into the proper request to another node using a different DBMS and data model and return data to the requesting node in the format accepted by that node. 4. Provide data management functions, such as security, concurrency and deadlock control, global query optimization, and automatic failure recording and recovery. 5. Provide consistency among copies of data across the remote sites (e.g., by using multiphase commit protocols). 1 P a g e

6. Be scalable. Scalability is the ability to grow, reduce in size, and become more heterogeneous as the needs of the business change. Thus, a distributed database must be dynamic and be able to change within reasonable limits and without having to be redesigned. Scalability also means that there are easy ways for new sites to be added (or to subscribe) and to be initialized (e.g., with replicated data). Homogeneous distributed database In a homogeneous distributed database system, all sites have identical databasemanagement system software, are aware of one another, and agree to cooperate in processing users requests. Heterogeneous distributed database, heterogeneous distributed database, different sitesmay use different schemas, and different database-management system software. The sites may not be aware of one another, and they may provide only limited facilities for cooperation in transaction processing. Distributed Data Storage Consider a relation r that is to be stored in the database. There are two approaches to storing this relation in the distributed database: Replication. The system maintains several identical replicas (copies) of the relation, and stores each replica at a different site. The alternative to replication is to store only one copy of relation r. Fragmentation. The system partitions the relation into several fragments, and stores each fragment at a different site. Fragmentation and replication can be combined: A relation can be partitioned into several fragments and there may be several replicas of each fragment. In the following subsections, we elaborate on each of these techniques. 2 P a g e

Data Replication If relation r is replicated, a copy of relation r is stored in two or more sites. In the most extreme case, we have full replication, in which a copy is stored in every site in the system. There are a number of advantages and disadvantages to replication. Availability. If one of the sites containing relation r fails, then the relation r can be found in another site. Thus, the system can continue to process queries involving r, despite the failure of one site.most extreme case, we have full replication, in which a copy is stored in every site in the system. Increased parallelism. In the case where the majority of accesses to the relation r result in only the reading of the relation, then several sites can process queries involving r in parallel. The more replicas of r there are, the greater the chance that the needed data will be found in the site where the transaction is executing. Hence, data replication minimizes movement of data between sites. Increased overhead on update. The system must ensure that all replicas of a relation r are consistent; otherwise, erroneous computations may result. Thus, whenever r is updated, the update must be propagated to all sites containing replicas. The result is increased overhead. For example, in a banking system, where account information is replicated in various sites, it is necessary to ensure that the balance in a particular account agrees in all sites. Data Fragmentation If relation r is fragmented, r is divided into a number of fragments r1, r2,..., rn. These fragments contain sufficient information to allow reconstruction of the original relation r. There are two different schemes for fragmenting a relation: horizontal fragmentation and vertical fragmentation. Horizontal fragmentation splits the relation by assigning each tuple of r to one or more fragments. Vertical fragmentation splits the relation by decomposing the scheme R of relation r. horizontal fragmentation In horizontal fragmentation, a relation r is partitioned into a number of subsets, r1, r2,..., rn. Each tuple of relation r must belong to at least one of the fragments, so that the original relation can be reconstructed, if needed. 3 P a g e

vertical fragmentation Vertical fragmentation refers to the division of a relation into attribute (column) subsets. Each subset (fragment) is stored at a different node, and each fragment has unique columns with the exception of the key column, which is common to all fragments. Example, Consider the following table Customer_Id Name Area Payment Type Sex 1 BOB London Credit Card Male 2 Mike Manchester Cash Male 3 Ruby London Cash Female Horizontal Fragmentation are subsets of tuples (rows) Fragment 1 Customer_Id Name Area Payment Type Sex 1 BOB London Credit Card Male 2 Mike Manchester Cash Male Fragment 2 Customer_Id Name Area Payment Type Sex 3 Ruby London Cash Female 4 P a g e

Vertical fragmentation are subset of attributes Fragment 1 Customer_Id Name Area Sex 1 BOB London Male 2 Mike Manchester Male 3 Ruby London Female Fragment 2 Customer_Id Payment Type 1 Credit Card 2 Cash 3 Cash Components of Distributed Database System Hardware Communications Media Software Each processing site (or node) that forms the database system can consist of various types of hardware. Nodes can be mainframes, minicomputers or microcomputers. Homogeneousnodes combine the same type of hardware whereasheterogeneous nodes combine a mixture of hardware. Network hardware and software allows each node to communicate and exchange data with other nodes that comprise the network. Local area networks typically use cables to transmit data from node to node, whereas telephone lines or satellites are used for more widely dispersed sites. A distributed database management system is a collection of data processors and transaction processors. Data Processors Are programs that store and retrieve data at local sites. A DP could be an 5 P a g e

(DPs) Transaction processors (TPs) independent database management system such as Access or Oracle, or it could be a subset of the distributed database management system. Are programs that control and co-ordinate query and transaction data requests from local or remote sites. Data requests are analysed by the TP to determine update or retrieval locations required by the data request. TP's do this by accessing the Distributed Data Catalog (DDC) which contains a description of the entire database. Once specific data locations are determined, the TP then transfers the data requests to the appropriate data processors. A TP could already exist as part of the distributed database management system or it could be specifically written. TP features can also be manually incorporated into queries and transactions, and you will see examples of this when we explore distributed database transparency features in the next section. Distributed Database Design Designing a distributed computing system involves taking decisions on the placement of data and programs in a computer network nodes, and network design itself. In the case of distributed databases, assuming that the network has been designed already and there is a copy of the DBMS software on each node in the network where data are stored, it rep a g e 6mains to focus our attention on the distribution of data. There are in general several design alternatives. Top-down approach: first the general concepts, the global framework are defined, after then the details. Down-top approach: first the detail modules are defined, after then the global framework. If the system is built up from a scratch, the top-down method is more accepted. If the system should match to existing systems or some modules are yet ready, the down-top method is usually used. 6 P a g e