Clustering. Oracle Server Concepts Manual. Database Systems Concepts Silberschatz/ Korth Sec. 10.7

Oracle Server Concepts Manual Database Systems Concepts Silberschatz/ Korth Sec. 10.7 Fundamentals of Database Systems Elmasri/Navathe Sec. 5.10 Stephen Mc Kearney, 2001. 1

Overview Intra-file What types of clustering exist? Definition How is it implemented? When is it used? Index How is it implemented in Oracle? in Oracle How do you decide to cluster data? Inter-file How does clustering work? Advantages & Disadvantages? Applications Criteria for How does clustering compare to B + -Trees? in Pages Advantages Disadvantages Comparison Compare clustered and unclustered? Unclustered Relations Clustered Relations 2 Stephen Mc Kearney, 2001. 2

Definition means that records related to each other are stored physically beside each other. Frank 3 is a method of storing data on a disc. A cluster is used to store tuples from one or more relations physically close to other tuples in the database. The purpose of clustering is to speed up the performance of certain types of queries. When tuples that are physically close to each other are retrieved they are retrieved more quickly than tuples that are not physically close to each other. Because clustering affects how the data is actually stored on the disc, the decision to use clustering in the database is part of the physical database design process. does not affect the applications that access the relations which have been clustered. Clustered and unclustered relations appear the same to users of the system. Stephen Mc Kearney, 2001. 3

Intra-file Data items in a single file are stored together. Supplier 1 Supplier 2 Supplier 3 Supplier n Suppliers are stored in the order they are most often retrieved 4 In intra-file clustering records in a single file are stored close to related records in the same file. For example, if suppliers are normally ordered by their supplier number then each supplier would be stored to the supplier with the next highest supplier number. Stephen Mc Kearney, 2001. 4

Inter-file Data items in two or more files are stored together. Supplier 1 Shipment A Shipment B Supplier 2 Shipment C Shipment D Shipment E Supplier 3 Shipment F Shipment G Shipments from one file are stored beside suppliers in another file. 5 In inter-file clustering records from one file are stored close to records from another file. For example, a shipment from a shipments file would be stored close to the supplier of the shipment. Stephen Mc Kearney, 2001. 5

Data in Pages Disc These pages will be slower to retrieve. The disc must rotate further to read each page. These pages will be quicker to retrieve. The disc must rotate less to read each page. Data that is stored close together will be quicker to retrieve. 7 affects the physical position of data on the disc. When two data items are stored on the same page on the disc, they can be read with one page read operation. Because the computer reads one page at a time, data items stored on the same page will be read at the same time. When two data items are stored on pages that are close to each other on the disc, they can be read with two page read operations. Because the pages occur one after another there is no disc head movement between reads (no seek time). When two data items are stored in separate locations on the disc, they can be read with two page read operations and a seek operation. Because the pages occur at separate locations on the disc the disc head must move to a new position on the disc to read the second page. Stephen Mc Kearney, 2001. 7

Unclustered Relations Adapted from Oracle7 Concepts Server Manual 8 Unclustered relations are stored in their own pages on the disc. That is, each page will contain tuples from one relation only. The pages may be positioned anywhere on the disc. Therefore, to join two relations at least two pages must be read from the disc - one page for each relation. For example, in the above example, the emp relation (table) is stored at one location on the disc and the dept relation (table) is stored at another location. Stephen Mc Kearney, 2001. 8

Clustered Relations Adapted from Oracle7 Concepts Server Manual 9 Clustered relations are stored using a cluster key. Each relation belonging to the cluster has an attribute corresponding to the cluster key. Each block will store tuples with a particular cluster key value. For example, in the above example, the cluster key is deptno and all the departments and employees with deptno=10 are stored together. This type of cluster will improve the performance of queries that join the emp and the dept relations. Note that the cluster key value is only stored once for each distinct value. For example, the value deptno=10 is only stored once and all tuples with deptno=10 are stored together. Stephen Mc Kearney, 2001. 9

Advantages Advantages Speeds up some queries Uses less space Supplier 1 These shipments are for supplier 1. Shipment A Shipment B Supplier 2 Shipment C Shipment D Shipment E Supplier 3 Shipment F A query for all shipments of supplier 1 will be quick because all the shipments for supplier 1 follow immediately after supplier 1. Shipment G 11 will speed up some database queries. For example, a cluster consisting of suppliers and shipments will speed up queries that request all the shipments for a particular supplier. The cluster improves the supplier/shipment query because the data for each shipment is stored on the same page as the corresponding supplier. Hence, when the supplier record is read the set of shipments is also read. The cluster key value that is used to cluster relations is only stored once in each page. This may save disc space. Stephen Mc Kearney, 2001. 11

Disadvantages Disadvantages Slows down some queries Slows down writes Supplier 1 To read all the shipment records the supplier records must also be read. Shipment A Shipment B Supplier 2 Shipment C Shipment D Shipment E Supplier 3 A query for all shipments will be slow because the shipments are not stored together on the disc. Shipment F Shipment G 12 will slow down certain types of queries. For example, the cluster on suppliers and shipments will slow down queries that ask for all shipments. The cluster slows down the all shipments query because the shipments are stored with each supplier. To read all the shipments the DBMS must also read the supplier data. Inserting new records into a cluster may also be slow. For example, adding a new shipment for supplier 1 will involve making space after shipment B. Stephen Mc Kearney, 2001. 12

Applications 1 - Hierarchies ER Diagram Customer Order Order Line Cluster Customer 1 Order 1 ER Instance Customer 1 Order Line 1 Order Line 2 Order 2 Order Line 1 Order Line 2 Order 1 Order 2 Order 3 Order 3 Order Line 1 Customer 2 Order Line 1 Order Line 2 Order Line 1 Order Line 2 Order Line 1 Order Line 2 A hierarchy of customer to orders to order lines. 14 is used when the data has a hierarchical structure. For instance, in the example above, the cluster would be used when the most common queries will retrieve all the orders and order lines for a customer. A cluster to store the above structure would cluster all the order lines with their corresponding orders and then the orders and order lines would be stored with their corresponding customer. Stephen Mc Kearney, 2001. 14

Applications 2 - Lists List of Products Cluster Product 1 Product 1 Product 2 Product 3 Product 2 Product 3 15 A cluster may be used when queries will retrieve lists of data items. For example, in the above example, the cluster of products will improve queries requesting all the products. Stephen Mc Kearney, 2001. 15

Applications 3 - SQL Joins Equi-joins SELECT name, address, deptname FROM emp, dept WHERE emp.deptno = dept.deptno The emp and dept relations may be clustered on the deptno attribute. 16 A cluster may be used to cluster relations that are frequently joined together. In the above example, the relations emp and dept may be clustered on the deptno attribute. The value of each deptno will be stored once together with all the corresponding emp and dept tuples. Stephen Mc Kearney, 2001. 16

Index Deptno Records 10 Dept Page P1 Index on Deptno 10 20 30 Employee Employee Employee 20 Dept Employee Employee Employee All records with deptno=10 Page P2 All records with deptno=20 30 Dept Employee Employee Page P3 Employee All records with deptno=30 18 The DBMS uses a clustering index when it implements a cluster. The clustering index is used to index the cluster key. This allows the DBMS to efficiently access the data in the cluster. The cluster index contains an entry for each cluster key value. The index may be a B + -Tree Ref: Elmasri, sec 6.1.2 Stephen Mc Kearney, 2001. 18

in Oracle Create a cluster CREATE CLUSTER emp_dept (deptno NUMBER(3)); Create a cluster index CREATE INDEX emp_dept_index ON CLUSTER emp_dept; Create Tables CREATE TABLE dept (deptno NUMBER(3), ) CLUSTER emp_dept (deptno) PRIMARY KEY (deptno); CREATE TABLE emp (empno NUMBER(5), deptno NUMBER(3), ) CLUSTER emp_dept (deptno) FOREIGN KEY (deptno) REFERENCES dept; 19 There are three steps required to create a cluster in Oracle: 1. Create the cluster The space for the cluster is allocated on the disc. 2. Create the cluster index Oracle requires a cluster index to be able to access the cluster. Therefore, the cluster index must exist before data can be added to the cluster. 3. Create the tables When the tables are created a parameter is added to the CREATE TABLE command indicating the cluster to which the table will belong. Once the cluster has been created the normal data manipulation commands (INSERT, DELETE, UPDATE, SELECT) may be used. Therefore, using a cluster to improve the performance of a database does not affect the application programs that access the data. Stephen Mc Kearney, 2001. 19

Overview Intra-file What types of clustering exist? Definition How is it implemented? When is it used? Index How is it implemented in Oracle? in Oracle How do you decide to cluster data? Inter-file How does clustering work? Advantages & Disadvantages? Criteria for How does clustering Applications compare to B + -Trees? in Pages Advantages Disadvantages Comparison Compare clustered and unclustered? Unclustered Relations Clustered Relations 20 Stephen Mc Kearney, 2001. 20

Criteria for Query Requirements Joins Lists Hierarchies Space Requirements may save space Update Requirements may slow updates 21 Deciding to cluster a set of relations depends on three factors: Query requirements improves joins between relations because it stores related tuples together in the same page. When the most common queries involve joining two relations, a cluster may improve performance. Space requirements Because each cluster key value is only stored once, storing relations in a cluster can use less storage space than storing the same relations separately. If storage space is restricted clustering the data may save space. Update requirements Cluster are difficult to update because space must be left to allow for additional clustered tuples. If space is not available, it may be necessary to move tuples between pages. Stephen Mc Kearney, 2001. 21

Comparison with Other Techniques B + -Tree Fast access to individual tuples Does not affect the order of data Can be ignored if not useful Easy to create and delete Cluster Fast access across relations Changes the order of the data Must be searched to access data Difficult to create and delete 22 A B+-Tree is designed to provide fast access to individual tuples in a relation. A cluster is designed to improve the performance of queries that join two or more relations together. A B+-Tree does not affect the order of the actual data. Although the index may be ordered, the actual data remains unordered. A cluster orders the actual data. A B+-Tree does not have to be used to answer a query. It is possible to access the data directly if using the B+-Tree is too inefficient. As a cluster affects the physical ordering of the data, the cluster must be accessed to retrieve the data. Hence, a cluster will slow down certain queries. A B+-Tree index is easy to create and delete because it is separate from the data. A cluster is difficult to create or change because it must be created before the data is added to the database. Deleting a cluster will destroy the data. Stephen Mc Kearney, 2001. 22

Partitioned Table CREATE TABLE sales ( acct_no NUMBER(5), acct_name CHAR(30), amount_of_sale NUMBER(6), week_no INTEGER ) PARTITION BY RANGE ( week_no ). (PARTITION sales1 VALUES LESS THAN ( 4 ) TABLESPACE ts0, PARTITION sales2 VALUES LESS THAN ( 8 ) TABLESPACE ts1,... PARTITION sales13 VALUES LESS THAN ( 52 ) TABLESPACE ts12 ); Oracle Concepts Manual 23 Stephen Mc Kearney, 2001. 23

Partitioned Index 1 Oracle Concepts Manual 24 Stephen Mc Kearney, 2001. 24

Equipartitioned Tables Oracle Concepts Manual Better availability and reliability 27 Stephen Mc Kearney, 2001. 27

Disc Striping Oracle Concepts Manual 28 Stephen Mc Kearney, 2001. 28