Record Storage, File Organization, and Indexes

Similar documents
Overview of Storage and Indexing. Data on External Storage. Alternative File Organizations. Chapter 8

Overview of Storage and Indexing

Physical Data Organization

Chapter 8: Structures for Files. Truong Quynh Chi Spring- 2013

Lecture 1: Data Storage & Index

DATABASE DESIGN - 1DL400

File Management. Chapter 12

6. Storage and File Structures

Practical Database Design and Tuning

Database Systems. Session 8 Main Theme. Physical Database Design, Query Execution Concepts and Database Programming Techniques

CIS 631 Database Management Systems Sample Final Exam

Operating Systems CSE 410, Spring File Management. Stephen Wagner Michigan State University

Record Storage and Primary File Organization

Chapter 13. Disk Storage, Basic File Structures, and Hashing

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Unit Storage Structures 1. Storage Structures. Unit 4.3

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Physical Design. Phases of database design. Physical design: Inputs.

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1

Physical Database Design and Tuning

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

1. Physical Database Design in Relational Databases (1)

Chapter 6: Physical Database Design and Performance. Database Development Process. Physical Design Process. Physical Database Design

Chapter 13 Disk Storage, Basic File Structures, and Hashing.

Data storage Tree indexes

Storage and File Structure

INTRODUCTION The collection of data that makes up a computerized database must be stored physically on some computer storage medium.

Big Data and Scripting. Part 4: Memory Hierarchies

Chapter 12 File Management. Roadmap

Chapter 12 File Management

Chapter 12 File Management

B-Trees. Algorithms and data structures for external memory as opposed to the main memory B-Trees. B -trees

Data Structures and Algorithm Analysis (CSC317) Intro/Review of Data Structures Focus on dynamic sets

Physical DB design and tuning: outline

Physical Database Design Process. Physical Database Design Process. Major Inputs to Physical Database. Components of Physical Database Design

Chapter 13. Chapter Outline. Disk Storage, Basic File Structures, and Hashing

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

Storage in Database Systems. CMPSCI 445 Fall 2010

CHAPTER 17: File Management

University of Massachusetts Amherst Department of Computer Science Prof. Yanlei Diao

DATABASDESIGN FÖR INGENJÖRER - 1DL124

MS SQL Performance (Tuning) Best Practices:

Previous Lectures. B-Trees. External storage. Two types of memory. B-trees. Main principles

Databases and Information Systems 1 Part 3: Storage Structures and Indices

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)?

B+ Tree Properties B+ Tree Searching B+ Tree Insertion B+ Tree Deletion Static Hashing Extendable Hashing Questions in pass papers

Chapter 13: Query Processing. Basic Steps in Query Processing

Storing Data: Disks and Files. Disks and Files. Why Not Store Everything in Main Memory? Chapter 7

Database Design Patterns. Winter Lecture 24

Chapter 13 File and Database Systems

Chapter 13 File and Database Systems

File Management. Chapter 12

CS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team

SQL Query Evaluation. Winter Lecture 23

Outline. File Management Tanenbaum, Chapter 4. Files. File Management. Objectives for a File Management System

Heaps & Priority Queues in the C++ STL 2-3 Trees

Comp 5311 Database Management Systems. 16. Review 2 (Physical Level)

Binary Heap Algorithms

Two Parts. Filesystem Interface. Filesystem design. Interface the user sees. Implementing the interface

A. TRUE-FALSE: GROUP 2 PRACTICE EXAMPLES FOR THE REVIEW QUIZ:

Chapter 5: Logical Database Design and the Relational Model Part 2: Normalization. Introduction to Normalization. Normal Forms.

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

Principles of Database Management Systems. Overview. Principles of Data Layout. Topic for today. "Executive Summary": here.

From Last Time: Remove (Delete) Operation

Concurrency Control. Chapter 17. Comp 521 Files and Databases Fall

Files. Files. Files. Files. Files. File Organisation. What s it all about? What s in a file?

Deduplication Demystified: How to determine the right approach for your business

æ A collection of interrelated and persistent data èusually referred to as the database èdbèè.

Index Internals Heikki Linnakangas / Pivotal

Efficient Processing of Joins on Set-valued Attributes


In-Memory Databases MemSQL

Review of Hashing: Integer Keys

Outline BST Operations Worst case Average case Balancing AVL Red-black B-trees. Binary Search Trees. Lecturer: Georgy Gimel farb

Brief background What the presentation is meant to cover: Relational, OLTP databases the type that 90% of our applications use

Ordered Lists and Binary Trees

Part 5: More Data Structures for Relations

Storage Management for Files of Dynamic Records

Advanced Oracle SQL Tuning

Data Structures for Databases

Query Processing C H A P T E R12. Practice Exercises

Tune That SQL for Supercharged DB2 Performance! Craig S. Mullins, Corporate Technologist, NEON Enterprise Software, Inc.

Symbol Tables. Introduction

Question #1: What are the main issues considered during physical database design?

Normalisation and Data Storage Devices

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

The Classical Architecture. Storage 1 / 36

Learning Outcomes. COMP202 Complexity of Algorithms. Binary Search Trees and Other Search Trees

CSE 326: Data Structures B-Trees and B+ Trees

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

1Z0-117 Oracle Database 11g Release 2: SQL Tuning. Oracle

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001

DATA STRUCTURES USING C

Introduction to IR Systems: Supporting Boolean Text Search. Information Retrieval. IR vs. DBMS. Chapter 27, Part A

Analysis of Algorithms I: Binary Search Trees

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Data Deduplication and Tivoli Storage Manager

Algorithms for merged indexes

Guide to Performance and Tuning: Query Performance and Sampled Selectivity

Optimizing Performance. Training Division New Delhi

Transcription:

Record Storage, File Organization, and Indexes ISM6217 - Advanced Database Updated October 2005 1 Physical Database Design Phase! Inputs into the Physical Design Phase " Logical (implementation) model " Documentation/Definitions " DBMS Characteristics " Response time requirements " Security, Backup, Recovery, Retention, Integrity requirements! Logical Design Phase focuses on What; Physical Design Phases focuses on How. Updated October 2005 2

Physical Database Design Phase! Outputs: " Produces a Description of the Implementation of the Database on Secondary Storage. " Describes the storage structures and access methods used to achieve efficient access to the data.! Focus on Data Processing Efficiency Updated October 2005 3 Steps in Physical Database Design! Translate Logical (Implementation) Data Model for Target DBMS " Design base relations and constraints Updated October 2005 4

Steps in Physical Database Design! Design Physical Representation " Analyze transactions " Choose file organizations " Choose secondary indexes Our focus today " Introduction of controlled redundancy (Denormalization) " Estimate disk space requirements! There are additional steps that we ll talk about next week. Updated October 2005 5 Files and Records! Records " Contain fields (fixed and variable length fields) " Fixed-length " Variable-length " Spanned " Unspanned Updated October 2005 6

Files and Records! Block (page): The amount of data read or written in one I/O operation.! Blocking Factor: The number of physical records per block.! File Organization: How the physical records in a file are arranged on the disk.! Access Method: How the data can be retrieved based on the file organization. Updated October 2005 7 Index Classification! Primary vs. secondary: If search key contains primary key, then called primary index. " Unique index: Search key contains a candidate key.! Clustered vs. unclustered: If order of data records is the same as, or `close to, order of data entries, then called clustered index. " A file can be clustered on at most one search key. " Cost of retrieving data records through index varies greatly based on whether index is clustered or not! Updated October 2005 8

Hash-Based Indexes! Hash function calculates the address of the page on which the record is stored.! The field that s used in the hash function is called the hash field. " Called a hash key if the hash field is also a key field.! Good for equality searches! Not good for range searches Updated October 2005 9 Hash File Updated October 2005 10

Tree Indexes! Many DBMS use tree data structures to hold data/indexes.! Tree contains a hierarchy of nodes! Root node has no parent! Other nodes have one parent! Nodes can (but don t have to) have child nodes.! A node with no children is called a leaf node.! Balanced tree: depth (number of levels) between root and each leaf node is the same. Updated October 2005 11 B+ Tree! A B-tree that follows certain rules: " If the root is not a leaf node, it must have at least two children. " For a tree of order n, each node (other than root/leaf nodes) must have between n/2 and n pointers and children. " (other rules) " Tree must always be balanced! Every path from the root to a leaf must have same length. # This is what matters.! This means that it always takes about the same time to access any record. Updated October 2005 12

B+ Tree Indexes Non-leaf Pages Leaf Pages $ Leaf pages contain data entries, and are chained (prev & next) $ Non-leaf pages contain index entries and direct searches: index entry P 0 K 1 P 1 K 2 P 2 K m P m Updated October 2005 13 Example B+ Tree Root 17 Entries <= 17 Entries > 17 5 13 27 30 2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39*! Insert/delete: Find data entry in leaf, then change it. Need to adjust parent sometimes. " And change sometimes bubbles up the tree Updated October 2005 14

Ordered Indexes! An ordered file that stores the index field and a pointer.! Requires a binary search on the index file.! Types of Ordered Indexes " Primary " Clustering " Secondary Updated October 2005 15 Primary Indexes! An ordered, fixed-length file that stores the primary key and a pointer.! Each record in the index file corresponds to each block in the ordered data file.! Sparse / nondense index Updated October 2005 16

A Primary Index Updated October 2005 17 Clustering Indexes! When the records of a data file are ordered on a non-key field, this field is called a clustering field.! The clustering field does not have a distinct value for each record in the data file.! A clustering index is an ordered file that stores the clustering field and a pointer. Updated October 2005 18

Clustering Indexes! There is one record in the index file for each distinct value of the clustering field.! Sparse / nondense index Updated October 2005 19 Updated October 2005 20

Secondary Indexes! A secondary index is an ordered file that stores the nonordering field of a data file and a pointer.! If the indexing field is a secondary or candidate key, then the index is dense.! If the indexing field is a nonkey field, then the index is sparse.! Introduces the idea of adding a level of indirection Updated October 2005 21 Secondary Index on a Secondary Key Updated October 2005 22

Secondary Index on a Nonkey field Updated October 2005 23 Indexed Sequential Data File Organization! Ordered data file with a multilevel primary index file based on the ordering key field.! IBM s ISAM is an indexed sequential file organization with a two-level index. Updated October 2005 24

Understanding the Workload! For each query in the workload: " Which relations does it access? " Which attributes are retrieved? " Which attributes are involved in selection/join conditions? How selective are these conditions likely to be?! For each update in the workload: " Which attributes are involved in selection/join conditions? How selective are these conditions likely to be? " The type of update (INSERT/DELETE/UPDATE), and the attributes that are affected. Updated October 2005 25 Choice of Indexes! What indexes should we create? " Which relations should have indexes? What field(s) should be the search key? Should we build several indexes?! For each index, what kind of an index should it be? " Clustered? Hash/tree? Updated October 2005 26

Choice of Indexes (Contd.)! One approach: Consider the most important queries in turn. Consider the best plan using the current indexes, and see if a better plan is possible with an additional index. If so, create it. " Obviously, this implies that we must understand how a DBMS evaluates queries and creates query evaluation plans! " For now, we discuss simple 1-table queries.! Before creating an index, must also consider the impact on updates in the workload! " Trade-off: Indexes can make queries go faster, updates slower. Require disk space, too. Updated October 2005 27 Index Selection Guidelines! Attributes in WHERE clause are candidates for index keys. " Exact match condition suggests hash index. " Range query suggests tree index.! Clustering is especially useful for range queries; can also help on equality queries if there are many duplicates. Updated October 2005 28

Index Selection Guidelines (2)! Multi-attribute search keys should be considered when a WHERE clause contains several conditions. " Order of attributes is important for range queries. " Such indexes can sometimes enable index-only strategies for important queries.! For index-only strategies, clustering is not important!! Try to choose indexes that benefit as many queries as possible. Since only one index can be clustered per relation, choose it based on important queries that would benefit the most from clustering. Updated October 2005 29 Rules-of-Thumb for Indexes! Index primary key fields " DBMS may do this automatically " Composite keys require composite indexes! Index foreign key fields! Index other fields frequently used in: " WHERE clauses " GROUP BY clauses " ORDER BY clauses! Consider composite indexes when fields are used together in conditions Updated October 2005 30

Index Method Benefits/Costs! Heap file: " Efficient storage; fast scanning; fast inserts " Slow searches; slow deletes! Sorted file: " Efficient storage; faster searches than heap " Slow inserts and deletes Updated October 2005 31 Index Method Benefits/Costs (2)! Clustered file: " Efficient storage; faster searches than heap " Faster inserts/deletes than sorted file " Only one clustering factor per file! Unclustered tree: " Fast searches, fast inserts/deletes " Slow scans and range searches with many matches! Hash: " Same as unclustered, but faster on equality searches " Does not support range searches Updated October 2005 32

Summary! Many alternative file organizations exist, each appropriate in some situation.! If selection queries are frequent, sorting the file or building an index is important. " Hash-based indexes only good for equality search. " Sorted files and tree-based indexes best for range search; also good for equality search. (Files rarely kept sorted in practice; B+ tree index is better.)! Index is a collection of data entries plus a way to quickly find entries with given key values. Updated October 2005 33 Summary (Contd.)! Data entries can be actual data records, <key, rid> pairs, or <key, rid-list> pairs.! Can have several indexes on a given file of data records, each with a different search key.! Indexes can be classified as clustered vs. unclustered, primary vs. secondary, and dense vs. sparse. Differences have important consequences for utility/performance. Updated October 2005 34

Summary (Contd.)! Understanding the nature of the workload for the application, and the performance goals, is essential to developing a good design. " What are the important queries and updates? What attributes/relations are involved?! Indexes must be chosen to speed up important queries (and perhaps some updates!). " Index maintenance overhead on updates to key fields. " Choose indexes that can help many queries, if possible. " Build indexes to support index-only strategies. " Clustering is an important decision; only one index on a given relation can be clustered! " Order of fields in composite index key can be important. Updated October 2005 35