BM307 File Organization



Similar documents
Fundamental Algorithms

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1

Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search

Chapter 13 Disk Storage, Basic File Structures, and Hashing.

Physical Data Organization

Record Storage and Primary File Organization

Chapter 13. Chapter Outline. Disk Storage, Basic File Structures, and Hashing

DATA STRUCTURES USING C

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

Chapter 13. Disk Storage, Basic File Structures, and Hashing

Tables so far. set() get() delete() BST Average O(lg n) O(lg n) O(lg n) Worst O(n) O(n) O(n) RB Tree Average O(lg n) O(lg n) O(lg n)

CHAPTER 13: DISK STORAGE, BASIC FILE STRUCTURES, AND HASHING

Chapter 6: Physical Database Design and Performance. Database Development Process. Physical Design Process. Physical Database Design

Operating Systems CSE 410, Spring File Management. Stephen Wagner Michigan State University

INTRODUCTION The collection of data that makes up a computerized database must be stored physically on some computer storage medium.

A Comparison of Dictionary Implementations

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

File Management. Chapter 12

Lecture 1: Data Storage & Index

Chapter 8: Structures for Files. Truong Quynh Chi Spring- 2013

Storage and File Structure

New Hash Function Construction for Textual and Geometric Data Retrieval

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Chapter 13: Query Processing. Basic Steps in Query Processing

Quiz 4 Solutions EECS 211: FUNDAMENTALS OF COMPUTER PROGRAMMING II. 1 Q u i z 4 S o l u t i o n s

Zabin Visram Room CS115 CS126 Searching. Binary Search

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

Review of Hashing: Integer Keys

A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES

Merkle Hash Trees for Distributed Audit Logs

Hash Tables. Computer Science E-119 Harvard Extension School Fall 2012 David G. Sullivan, Ph.D. Data Dictionary Revisited

Database Systems. Session 8 Main Theme. Physical Database Design, Query Execution Concepts and Database Programming Techniques

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS

File System Management

Data storage Tree indexes

DATABASE DESIGN - 1DL400

Binary Heaps * * * * * * * / / \ / \ / \ / \ / \ * * * * * * * * * * * / / \ / \ / / \ / \ * * * * * * * * * *

6. Storage and File Structures

CS 464/564 Introduction to Database Management System Instructor: Abdullah Mueen

Unit Storage Structures 1. Storage Structures. Unit 4.3

10CS35: Data Structures Using C

Binary Heap Algorithms

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

The string of digits in the binary number system represents the quantity

Practical Survey on Hash Tables. Aurelian Țuțuianu

Chapter 12 File Management. Roadmap

Chapter 12 File Management

Cuckoo Filter: Practically Better Than Bloom

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

Previous Lectures. B-Trees. External storage. Two types of memory. B-trees. Main principles

Binary Trees and Huffman Encoding Binary Search Trees

Data Structures and Data Manipulation

DATABASDESIGN FÖR INGENJÖRER - 1DL124

Unit 5.1 The Database Concept

Symbol Tables. Introduction

1. The memory address of the first element of an array is called A. floor address B. foundation addressc. first address D.

CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions. Linda Shapiro Spring 2016

Structure for String Keys

Chapter 1 File Organization 1.0 OBJECTIVES 1.1 INTRODUCTION 1.2 STORAGE DEVICES CHARACTERISTICS

Positional Numbering System

5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes.

A Survey on Efficient Hashing Techniques in Software Configuration Management

1. Relational database accesses data in a sequential form. (Figures 7.1, 7.2)

Image Compression through DCT and Huffman Coding Technique

Outline. Introduction Linear Search. Transpose sequential search Interpolation search Binary search Fibonacci search Other search techniques

Chapter 8: Bags and Sets

DNS LOOKUP SYSTEM DATA STRUCTURES AND ALGORITHMS PROJECT REPORT

Lecture 6: Binary Search Trees CSCI Algorithms I. Andrew Rosenberg

Unsigned Conversions from Decimal or to Decimal and other Number Systems

Storage Management for Files of Dynamic Records

Number Representation

2) What is the structure of an organization? Explain how IT support at different organizational levels.

Data Warehousing und Data Mining

Secondary Storage. Any modern computer system will incorporate (at least) two levels of storage: magnetic disk/optical devices/tape systems

History-Independent Cuckoo Hashing

Physical Database Design Process. Physical Database Design Process. Major Inputs to Physical Database. Components of Physical Database Design

Memory is implemented as an array of electronic switches

Database 2 Lecture I. Alessandro Artale

Levent EREN A-306 Office Phone: INTRODUCTION TO DIGITAL LOGIC

Chapter 12 File Management

B-Trees. Algorithms and data structures for external memory as opposed to the main memory B-Trees. B -trees

Binary search algorithm

System Software Prof. Dr. H. Mössenböck

Lecture 2. Binary and Hexadecimal Numbers

COS 318: Operating Systems

Binary Search Trees. Data in each node. Larger than the data in its left child Smaller than the data in its right child

Implementation and Comparison of Various Digital Signature Algorithms. -Nazia Sarang Boise State University

GUJARAT TECHNOLOGICAL UNIVERSITY, AHMEDABAD, GUJARAT. Course Curriculum. DATA STRUCTURES (Code: )

root node level: internal node edge leaf node Data Structures & Algorithms McQuain

EE 261 Introduction to Logic Circuits. Module #2 Number Systems

The Classical Architecture. Storage 1 / 36

How to create database in GlycomcsPortal?

Performance Tuning for the Teradata Database

Storage in Database Systems. CMPSCI 445 Fall 2010

Project Group High- performance Flexible File System 2010 / 2011

Sample Questions Csci 1112 A. Bellaachia

Analysis of Binary Search algorithm and Selection Sort algorithm

Overview of Storage and Indexing. Data on External Storage. Alternative File Organizations. Chapter 8

Battleships Searching Algorithms

Transcription:

BM307 File Organization Gazi University Computer Engineering Department 9/24/2014 1

Index Sequential File Organization Binary Search Interpolation Search Self-Organizing Sequential Search Direct File Organization Locating Information Hashing Functions Collision Resolution Coalesced Hashing 9/24/2014 2

File Organization Goal Organizing files efficiently in terms of both space and performance File Organization File Access sequential sequential indexed sequential sequential & direct direct direct (random) 9/24/2014 3

File Access Types Sequential accessing multiple records (often an entire file) and usually according to a predefined order Direct (random) locating a single record Question How can we have an effective organization? Answer matching the type of organization with the type of intended access 9/24/2014 4

Sequential File Organization Background Fields (eg.: Employee name, number) Records contain data about individual entities Files (eg.: employee list) Primary Key field(s) which uniquely distinguishes a record from all others Secondary Key all the remaining fields 9/24/2014 5

Sequential File Organization File consists of records of the same format Fixed-length records Variable-length records Sequential File Organization (i+1) st element of a file is stored immediately after the i th element. 9/24/2014 6

Sequential File Organization Sequential access moving from one record in the file to the next by incrementing the address of the current record by the record size Direct access processing a single record directly if we know subscript 9/24/2014 7

Sequential File Organization Probe access to a distinct location Sequential Search In an entire file of N records N/2 probes are needed in average Need to probe entire file for an unseccessful retrieval Computational complexity O(N) Appropriate when N is small Performance improvement? Sorting 9/24/2014 8

Eg. - Sequential Search 100000 records, each record size is 400 bytes, block size is 2400 bytes. Sequential search time for retrieving 10000 records? Each probe one block of data (100000*400)/2400 = 16667 blocks Reading time for one block 0.84ms (IBM 3380) Time requirement for each record (16667/2)*0.84 = 7 sec. For 10000 records 7sec * 10000 = 19 hours Better organization is needed!! 9/24/2014 9

Sequential File Organization Binary Search Requires sorting Compares the key of the sought record with the middle record of the file Half of the file is eliminated in each turn Computational complexity O(log 2 n) Eg. the key of the sought record 17 9/24/2014 10

Sequential File Organization Binary Search (Algoritma) 9/24/2014 11

Sequential File Organization Interpolation Search Approximate relative position Eg.: Searching a name in a telephone book Choses the next position for a comparison based upon the estimated position of the sought keyrelative to the remainder of the file to be searched key[sought] key [LOWER] NEXT := LOWER + (UPPER-LOWER) key[upper] key [LOWER] Worst case computational complexity O(n) Average case computational complexity O(log 2 log 2 n) Its performance improves as the distribution of keys becomes more uniform 9/24/2014 12

binary search should be preferred when the data is stored in primary memory Why? interpolation search should be preferred when the data is stored in auxilary memory Why? 9/24/2014 13

binary search should be preferred when the data is stored in primary memory The additional calculations needed for the interpolation search cancel any savings gained from fewer probes interpolation search should be preferred when the data is stored in auxilary memory An access of auxiliary storage is an order of magnitude greater than the time required for the additional calculations 9/24/2014 14

Sequential File Organization Self-Organizing Sequential Search Modifies the order of records Moves the most frequently retrieved records to the beginning of the file Most popular algorithms: Move_to_front Transpose Count 9/24/2014 15

Sequential File Organization Move_to_front The sought record is moved to the front position of the file Potential of making big mistakes if a record accessed, moved to the front of the file, and then rarely if ever accessed again! A linked implementation is preferable even though it takes more storage Appropriate when space is not limited and locality of access is important Essentially the same as the LRU (least recently used) paging algorithm used by operating systems 9/24/2014 16

Eg. - Move_to_front The records are accessed in the order of fileediting a b c d e f g h i j k l m n o p r q s t v w y z f a b c d e g h i j k l m n o p r q s t v w y z i f a b c d e g h j k l m n o p r q s t v w y z l i f a b c d e g h j k m n o p r q s t v w y z e l i f a b c d g h j k m n o p r q s t v w y z e l i f a b c d g h j k m n o p r q s t v w y z d e l i f a b c g h j k m n o p r q s t v w y z i d e l f a b c g h j k m n o p r q s t v w y z t i d e l f a b c g h j k m n o p r q s v w y z i t d e l f a b c g h j k m n o p r q s v w y z n i t d e l f a b c g h j k m o p r q s v w y z g n i t d e l f a b c h j k m o p r q s v w y z 9/24/2014 17

Sequential File Organization Transpose Interchanges the sought record with its immediate predecessor More stable than the Move_to_front algorithm A record needs to be accessed many times before it is moved to the front of the list Easily implemented Does not need additional space Should be used when space is premium 9/24/2014 18

Eg. - Transpose The records are accessed in the order of fileediting a b c d e f g h i j k l m n o p r q s t v w y z a b c d f e g h i j k l m n o p r q s t v w y z a b c d f e g i h j k l m n o p r q s t v w y z a b c d e f g i h j k l m n o p r q s t v w y z a b c e d f g i h j k l m n o p r q s t v w y z a b c d e f g i h j k l m n o p r q s t v w y z a b c d e f i g h j k l m n o p r q s t v w y z a b c d e f i g h j k l m n o p r q t s v w y z a b c d e i f g h j k l m n o p r q t s v w y z a b c d e i f g h j k l n m o p r q t s v w y z a b c d e i g f h j k l n m o p r q t s v w y z 9/24/2014 19

Sequential File Organization Count Keeps count of the number of accesses of each record The file is always ordered in a decreasing order of frequency of access Requires extra sorage to keep the count Use it only when the counts are needed for another purpose 9/24/2014 20

Direct File Organization Ideally, we want to go directly to the address where the record is stored A key can be unique address one probe 0 0 Key space 1 1 correspondence Address space 999-99-9999 999-99-9999 More address space than needed Eg.1 billion addresses for 300 million people 9/24/2014 21

Direct File Organization Converting information into a unique address Eg. : Airline reservation system Flight numbers from 1 to 999 Days are numbered from 1 to 366 Flight number and day of the year could be concatenated to determine the location Location = flight number day of the year, address range 001001-999366 (???367 -???999 would not exist) Location = day of the year flight number, address range 001001-366999 9/24/2014 22

Direct File Organization The key converts to a probable address If we remove most of the empty spaces in the address space, we have lost the 1-1 correspondence btw keys & addresses Hashing functions are used to map the wider range of key values into the narrower range of address values Hash (key) probable address Initial probable address home address Hashing function should Evenly distribute the keys among the addresses Executes efficiently 9/24/2014 23

Direct File Organization A collision occurs when two distinct keys map to the same address 0 0 Key space Address space 1200 999-99-9999 Hashing is then composed of two aspects; The function The collision resolution method 9/24/2014 24

Direct File Organization Hashing Functions 9/24/2014 25

Direct File Organization Hashing Functions Squaring Taking square of a key and then substringing or truncating a portion of the result Radix conversion The key is considered to be in a base other than 10 ans is then converted into a number in base 10 Eg.: Base 11 1234 = 1 * 11 3 + 2 * 11 2 + 3 * 11 1 + 4 * 11 0 = 1331 + 242 + 33 + 4 = 1610 substringing or truncation could then be used 9/24/2014 26

Direct File Organization Hashing Functions Polynomial hashing The key is divided by a polynomial f(information area) cyclic check bytes Alphabetic keys Alphabetic or alphanumeric key values can be input to a hashing function if the values are interpreted as integers 9/24/2014 27

collisions Direct File Organization Collisions For a given set of data, one hashing function may distribute the keys more evenly over the address space than another A hashing function that has a large number of collisions is said to exhibit primary clustering It is better to have a slightly more expensive hashing function for data that need to be stored on auxiliary storage Another method for reducing collisions is reducing the packing factor Packing factor = number of records stored total number of storage locations 9/24/2014 28 storage

Direct File Organization Collision Resolution Collision resolution with links Collision resolution without links Static positioning of records Dynamic positioning of records Collision resolution with pseudolinks 9/24/2014 29

Direct File Organization Collision resolution with links If multiple synonyms occur for a particular home address, we form a chain of synonym records Disadvantage extra storage is needed Collision resolution without links We can use implied links by applying a convention, or set of rules for deciding where to go next A simple convention is to look at the next location in memory Advantage NO extra storage is needed 9/24/2014 30

Direct File Organization Coalesced Hashing Occurs when we attempt to insert a record with a home address that is already occupied by a record from a chain with a different home address The two chains with records having different home addresses coalesce or grow together X,D, Y were inserted 9/24/2014 31

Direct File Organization Coalesced Hashing (Eg.) Hash (key) = key mod 11 27, 18, 29, 28, 39, 13, 16 Average # of probes 1.8 42 & 17 added 9/24/2014 32

Direct File Organization Coalesced Hashing Discussion Packing factor of the final table = 9/11 (82%) One method of reducing coalescing is to reduce the packing factor It would be advisable to place the most frequently accessed records early in the insertion process Deleting a record is complicated If coalescing has occurred, a simple deletion procedure is to move a record later in the probe chain into the position of the deleted record Final table after deleting 39 ----------> 9/24/2014 33

Direct File Organization Coalesced Hashing Variants Table organization (whether or not a seperate overflow area is used) The manner of linking a colliding item into a chain The manner of choosing unoccupied locations Table Organization Table primary area + overflow area Adres factor = (primary area ) / (total table size) Best performance when the adres factor is 0.86 9/24/2014 34

Direct File Organization Coalesced Hashing Variants Late Insertion Standart Colesced Hashing (LISCH) New records are inserted at the end ofa probe chain Lack of a cellar Late Insertion Coalesced Hashing (LICH) Uses a cellar Eg. Keys: 27, 18, 29, 28, 39, 13, 16, 42, 17 hashing function: key mod 7 Average # of probes 1.3 (It was 1.8 for LISCH) In general, for a 90 percent packing factor, using a cellar will reduce the number of probes by about 6 percent compared with LISCH 9/24/2014 35

Direct File Organization Coalesced Hashing Variants Early Insertion Standart Colesced Hashing (EISCH) İnserts a new record into a position on the probe chain immediately after the record srored at its home address İnsertion of the record with key 17 according to EISCH algorithm: Hash (key) = key mod 11 9/24/2014 36

Direct File Organization Coalesced Hashing Variants Random Early Insertion Standart Colesced Hashing (REISCH) Choosing a random unoccupied location for the new insertion Gives only a 1% improvement over EISCH Random Late Insertion Standart Colesced Hashing (RLISCH) Bidirectional Late Insertion Standart Colesced Hashing (BLISCH) Choosing the overflow location for a collision insertion by alternating the selection between the top and bottom of the table Bidirectional Early Insertion Standart Colesced Hashing (BEISCH) 9/24/2014 37

Direct File Organization Coalesced Hashing Comparison 9/24/2014 38