Datenbanksysteme II: Hashing. Ulf Leser

Similar documents
Chapter 13. Disk Storage, Basic File Structures, and Hashing

Data Warehousing und Data Mining

DATABASE DESIGN - 1DL400

Chapter 13 Disk Storage, Basic File Structures, and Hashing.

Review of Hashing: Integer Keys

Chapter 13. Chapter Outline. Disk Storage, Basic File Structures, and Hashing

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1

Lecture 1: Data Storage & Index

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

INTRODUCTION The collection of data that makes up a computerized database must be stored physically on some computer storage medium.

Record Storage and Primary File Organization

CHAPTER 13: DISK STORAGE, BASIC FILE STRUCTURES, AND HASHING

- Easy to insert & delete in O(1) time - Don t need to estimate total memory needed. - Hard to search in less than O(n) time

Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search

Hash Tables. Computer Science E-119 Harvard Extension School Fall 2012 David G. Sullivan, Ph.D. Data Dictionary Revisited

B+ Tree Properties B+ Tree Searching B+ Tree Insertion B+ Tree Deletion Static Hashing Extendable Hashing Questions in pass papers

The Advantages and Disadvantages of Network Computing Nodes

Chapter 8: Structures for Files. Truong Quynh Chi Spring- 2013

Partitioning under the hood in MySQL 5.5

Physical Data Organization

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Performance Tuning for the Teradata Database

Unit Storage Structures 1. Storage Structures. Unit 4.3

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

Datenbanksysteme II: Implementation of Database Systems Implementing Joins

Real-Time Software. Basic Scheduling and Response-Time Analysis. René Rydhof Hansen. 21. september 2010

Chapter 13: Query Processing. Basic Steps in Query Processing

Lecture 18: Applications of Dynamic Programming Steven Skiena. Department of Computer Science State University of New York Stony Brook, NY

Chapter 6: Physical Database Design and Performance. Database Development Process. Physical Design Process. Physical Database Design

Multi-dimensional index structures Part I: motivation

CS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team

Chapter 13 File and Database Systems

Chapter 13 File and Database Systems

Binary Heap Algorithms

CS514: Intermediate Course in Computer Systems

File System Management

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

Operating Systems CSE 410, Spring File Management. Stephen Wagner Michigan State University

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc()

Disks and RAID. Profs. Bracy and Van Renesse. based on slides by Prof. Sirer

What Is Recursion? Recursion. Binary search example postponed to end of lecture

Binary Search Trees. Data in each node. Larger than the data in its left child Smaller than the data in its right child

Data Storage - II: Efficient Usage & Errors

DATA STRUCTURES USING C

Physical DB design and tuning: outline

1. The memory address of the first element of an array is called A. floor address B. foundation addressc. first address D.

Organization of Programming Languages CS320/520N. Lecture 05. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Fundamental Algorithms

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

Database Systems. Session 8 Main Theme. Physical Database Design, Query Execution Concepts and Database Programming Techniques

SQL Query Evaluation. Winter Lecture 23

The Tower of Hanoi. Recursion Solution. Recursive Function. Time Complexity. Recursive Thinking. Why Recursion? n! = n* (n-1)!

File Management Chapters 10, 11, 12

recursion, O(n), linked lists 6/14

Data Structures Fibonacci Heaps, Amortized Analysis

CS 464/564 Introduction to Database Management System Instructor: Abdullah Mueen

Part 5: More Data Structures for Relations

Introduction to Data Structures

New Hash Function Construction for Textual and Geometric Data Retrieval

Overview of Storage and Indexing

Storage Management for Files of Dynamic Records

10CS35: Data Structures Using C

Chapter 11 I/O Management and Disk Scheduling

B-Trees. Algorithms and data structures for external memory as opposed to the main memory B-Trees. B -trees

Raima Database Manager Version 14.0 In-memory Database Engine

6. Storage and File Structures

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #

Dynamic Programming. Lecture Overview Introduction

Storage and File Structure

Storage in Database Systems. CMPSCI 445 Fall 2010

Lecture 2 February 12, 2003

COS 318: Operating Systems

14.1 Rent-or-buy problem

Introduction to Algorithms March 10, 2004 Massachusetts Institute of Technology Professors Erik Demaine and Shafi Goldwasser Quiz 1.

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Heaps & Priority Queues in the C++ STL 2-3 Trees

HP Service Manager Shared Memory Guide

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

I/O Management. General Computer Architecture. Goals for I/O. Levels of I/O. Naming. I/O Management. COMP755 Advanced Operating Systems 1

CSE 326: Data Structures B-Trees and B+ Trees

Binary Search Trees. A Generic Tree. Binary Trees. Nodes in a binary search tree ( B-S-T) are of the form. P parent. Key. Satellite data L R

Abstract Data Type. EECS 281: Data Structures and Algorithms. The Foundation: Data Structures and Abstract Data Types

Database 2 Lecture I. Alessandro Artale

Previous Lectures. B-Trees. External storage. Two types of memory. B-trees. Main principles

Linux Process Scheduling Policy

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions. Linda Shapiro Spring 2016

Converting a Number from Decimal to Binary

CIS 631 Database Management Systems Sample Final Exam

S. Muthusundari. Research Scholar, Dept of CSE, Sathyabama University Chennai, India Dr. R. M.

Sample Questions Csci 1112 A. Bellaachia

CSE 2123 Collections: Sets and Iterators (Hash functions and Trees) Jeremy Morris

Course: Programming II - Abstract Data Types. The ADT Stack. A stack. The ADT Stack and Recursion Slide Number 1

Overview of Storage and Indexing. Data on External Storage. Alternative File Organizations. Chapter 8

Transcription:

Datenbanksysteme II: Hashing Ulf Leser

Content of this Lecture Hashing Extensible Hashing Linear Hashing Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 2

Sorting or Hashing Sorted or indexed files Typically log(n) IO for searching / deletions Overhead for keeping order in file or in index Low overhead (overflows) brings danger of degradation Multiple orders require multiple indexes multiple overhead Good support for range queries Can we do better on average? under certain circumstances? Hash files Can provide access in 1 IO Support searching for multiple attributes (with some overhead) Require notable overhead if table size changes considerably Are bad at range queries Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 3

Hash Files Set of buckets ( 1 blocks) B 0,...,B m-1, m>1 Hash function h(k) = {0,..., m-1} on a set K of values Hash table H (bucket directory) of size m with ptrs to B i s Hash files are structured according to one attribute only Hash table Buckets with overflow pages Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 4

Example Hash function on Name h (Name) = 0 if last character M 1 if last character N Why last char? Bucket 0 Bond George Victoria Bucket 1 Adams Carter Truman Wilson Washington Search Adams 1. h(adams)=1 2. Bucket 1, Block 0? Success Search Wilson 1. h(wilson)=1 2. Bucket 1, Block 0? 3. Bucket 1, Block 1? Success Search Elisabeth 1. h(elisabeth)=0 2. Bucket 0, Block 0? Failure Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 5

Efficiency of Hashing Given: n records, R records per block, m buckets Assume hash table is in main memory Average number of blocks per bucket: n / (m*r) Assuming a (perfect) uniformly distributing hash function Search n / (m*r) / 2 for successful search n / (m*r) for unsuccessful search Insert n / (m*r) if end of bucket cannot be accessed directly n / (2m*R) if free space in one of the bucket If H =m large enough and good hash function: 1 IO Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 6

Problems with Hashing Hashing may degenerate to sequential scans (heap file) If number of buckets static and too small If hash function produces large skew Extending hash range requires complete rehashing But fine-grained hashing require many buckets from the start Hash Function Examples: Modulo Function, Bit-Shifting Desirable: uniform mapping of hash keys onto m Ideal (i.e. uniform) mapping possible if data distribution and number of records are known in advance which is unusual Often, application-independent hash functions are used Incorporating knowledge on expected distribution of keys Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 7

Content of this Lecture Hashing Extensible Hashing Linear Hashing Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 8

Extensible Hashing Hashing as described is a static index structure Structure (buckets, hash function) is fixed once and never changed Better: Hash function should automatically adapt to size of data set and to distribution of values Principle idea Hash function generates (long) bitstring Should distribute values evenly on every position of bitstring Use growing prefix of this bitstring as index in hash table Size of prefix adapts to number of records Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 9

Hash functions h: K {0,1}* Size of bitstring should be long enough for mapping into as many buckets as maximally desired Though we do not use them all most of the time Example: inverse person IDs h(004) = 001000000... (4=0..0100) h(006) = 011000000... (6=0..0110) h(007) = 111000000... (7 =0..0111) h(013) = 101100000... (13 =0..01101) h(018) = 010010000... (18 =0..010010) h(032) = 000001000... (32 =0..0100000) H(048) = 000011000... (48 =0..0110000) Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 10

Example Parameters d: global depth of hash table, size of longest prefix currently used t: local depth of each bucket, size of prefix used in this bucket Example Let a bucket store two records Start with two buckets and 1 bit for identification (d=t 1 =t 2 =1) Keys as bitstring inverse h d=1 (k) 2125 100001001101 101100100001 1 2126 100001001110 011100100001 0 2127 100001001111 111100100001 1 Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 11

Example cont d k as bitstring inverse h d=1 2125 100001001101 101100100001 1 2126 100001001110 011100100001 0 2127 100001001111 111100100001 1 2129 100001010001 100010100001 1 New record with x=2129 Bucket for 1 full Needs to be split Double hash table, d++ Pointers to non-splitting blocks remain unchanged Block is split and records are distributed according to bits until new d Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 12

More Complex Example Assume reversed bit hash function on integers Currently four buckets in use Global depth d=3 Local depth t between 1 and 3 Size of global directory: 2 d =8 4 Bucket: 001 6 Bucket: 0118 Bucket: 000 32 48 Bucket: 1 7 13 Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 13

Example: Hash Table 000 001 010 011 100 101 110 111 4 Bucket: 001 6 Bucket: 0118 Bucket: 000 32 48 Bucket: 1 7 13 Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 14

Inserting Values Current content 40 = 101000 32 = 100000 18 = 010010 13 = 001101 12 = 001100 7 = 000111 6 = 000110 4 = 000100 000 001 010 011 100 101 110 111 INSERT( 28) 28 = 011100 h(28)=001110 d=t; Overflow 000: 32, 40; t=3 001: 4, 12; t=3 01: 6, 18; t=2 7, 13; t=1 Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 15

Splitting Deep Buckets 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 32, 40; t=3 12, 28; t=4 4; t=4 6, 18; t=2 h(12) = 001100 h(4) = 001000 h(28) = 001110 7, 13; t=1 Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 16

Next Insert 12, 28; t=4 32, 40; t=3 6, 18; t=2 4; t=4 0000 0001 0010 Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 17 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 INSERT( 5) 5 = 000101 h(5)=101000 7, 13; t=1 d t: Overflow but no dir duplication

Splitting Shallow Buckets Assume we split overflowed bucket B All records r B, h(r) has the same length-t prefix If we split at next position (t++) Generate new bucket and rehash records This might generate empty buckets Other bucket might still be overflowed repeat split In the example, we rehash 5=101000, 7=111000, 13=101100 Hence, one split suffices (with block prefixes 101 and 111) But, if we had 5=10100, 13=101100, 21=101010? Might even force a deep split with increase in d Suboptimal space usage Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 18

Summary Advantages Partly adapts to growing or shrinking number of records No rehashing of the entire table Essentially no limit in number of records Very fast if directory can be cached and h is well chosen Disadvantages Directory needs to be maintained requires LOG/REDO information, locking in split operations, storage, etc. No adaptation of bucket fill degree many buckets might be almost empty, few almost full Directory doubling is a unforeseen costly operation Everything smooth for a long time, suddenly one operation takes ages Values are not sorted, no range queries Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 19

Content of this Lecture Hashing Extensible Hashing Linear Hashing Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 20

Linear Hashing Similar scheme as in extensible hashing, but Don t double directory on overflow, but increase one-by-one Guaranteed lower bound on bucket fill-degree Tolerate some overflow blocks in buckets Hopefully few on average, if hash function spreads evenly Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 21

Overview h generates bitstring of length x, read right to left Parameters i: Current number of bits from x used As i grows, more bits are considered If h generates x bits, we use a 1 a 2 a i for the last i bits of h(k) n: Total number of buckets currently used Only the first n values of bitstrings of length i have their own buckets Sometimes, i must be increased - later r: Total number of records 011101010110 grows i Fix threshold t linear hashing guarantees that r/n<t As r increases, we sometimes increase n such that always r/n<t Linear hashing guarantees average fill-degree But does not prevent chaining in case of bad hash function x Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 22

Insert(k): First Action Insert k Let m by the integer value encoded by the i last bits of h(k) If m<n Store k in bucket m, potentially using overflow blocks Obviously, we store k in a bucket that does exist If m n Bucket m does not exist There exist buckets 0 n-1 We redirect k in a bucket that does exist Flip i-th bit (from the right) of m to 0 and store k in this bucket Algorithm ensures that here the i th (highest) bit must be 1 This flipping also needs to be done when searching keys Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 23

Insert( k): Second Action Check threshold; if r/n t, then If n=2 i No more room to add another bucket Set i++ This is only a conceptual increase no physical action Proceed (now n<2 i ) If n<2 i There is still/now room on our address space We add (n+1)th bucket and set n++ We need to choose which bucket to split We do not split the bucket where we just inserted (why should we?) We do not search for overflowed buckets (too costly) Instead, we follow a cyclic scheme Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 24

Which Bucket to Split We split buckets in fixed, cyclic order Split bucket with number n-2 i-1 As n increases, this pointer cycles through all buckets Let n=1a 2 a 3 a i ; then we split block a 2 a 3 a i into 0a 2 a 3 a i and 1a 2 a 3 a i Requires redistribution of bucket with key a 2 a 3 a i This is one of the buckets where we had put redirected records with h(k)>n This is not necessarily an overflowed bucket Only total fill degree is guaranteed Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 25

Buckets are Split in Fixed Order Assume we would split after every insert i n Existing buckets Bucket to split Generates 1 2=10 0,1 0 00 10 2 3=11 00,10 1 4=100 00,10 01,11 3 5=101 000,100 10, 01,11 6=110 000,100 001,101 10,11 7=111 000,100,001,101, 010,110, 11 1 01 11 00 000 100 01 001 101 10 010 110 11 011 111 Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 26

Example Assume 2 records in one block, x=4, t=1.49, i=1 Start 1a) Insert 0101 m=1<n=10 b Insert into bucket 1 But now r/n>t 0 0000 1010 1 1111 0 0000 1010 1 1111 0101 1b) Since n=2 i =2=10 b We need more address space Increase i (virtually) Add bucket number 2=10 b n=10 b =1a 1 : Split bucket 0 into 10 and 00 n++ 00 0000 01 1111 0101 10 1010 Split Yet unsplit Split Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 27

Example 2 2) Insert 0001 m=1, bucket exists Insert into m Requires overflow block (We skip the necessary split) 3a) Insert 0111 m=3=n=11 b Bucket doesn t exist Flip and redirect to 01 00 0000 01 1111 0101 10 1010 00 0000 01 1111 0101 10 1010 0001 0001 0111 3b) r/n=6/3>t We split n<4, so no need to increase i Add bucket number 3=11b Since n=3=11 b, with split 01 Delete overflow block 00 0000 01 0001 0101 10 1010 11 1111 0111 Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 28

Example 3 4a) Insert 0011 m=3=11 b < n=4=100 b Insert into 11 b 00 0000 01 0001 0101 10 1010 11 1111 0111 0011 4b) We must split again Since n=2 i, increase i Nothing to do physically ( Think a leading 0) 00 0000 01 0001 0101 10 1010 11 1111 0111 0011 Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 29

Example 4 4c) Split Add block number 4=100 b Split 000 b into 000 b and 100 b 000 0000 001 0001 0101 010 1010 011 1111 0111 100-0011 We keep the average bucket filling But we have unevenly filled buckets some empty, some overflow Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 30

Observations Due to the extension mechanism: 2 i-1 n < 2 i Whenever n reaches 2 i, i is increased => 2 i doubles and n=2 i /2 (for the new i) n as binary number always has the form 1b 1 b 2...b i-1 As defined: m<2 i But possibly: m>n If we drop the leading 1 in m, we get m new <m/2 Since n>2 i-1, m new n Thus, the chosen bucket must already exist We don t know when it will be split Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 31

Summary Advantages Adapts to varying number of records Slower growth and on average better space usage compared to extensible hashing If buckets are sequential on disk, we don t need a directory Compute m: look in m th block (possible after flipping) Disadvantages Can degrade, as buckets are split in fixed order No adaptation to skewed value distribution Ulf Leser: Implementation of Database Systems, Winter Semester 2012/2013 32