Announcements. CSE332: Data Abstractions. Lecture 9: B Trees. Today. Our goal. M-ary Search Tree. M-ary Search Tree. Ruth Anderson Winter 2011

Similar documents
CSE 326: Data Structures B-Trees and B+ Trees

From Last Time: Remove (Delete) Operation

Previous Lectures. B-Trees. External storage. Two types of memory. B-trees. Main principles

B-Trees. Algorithms and data structures for external memory as opposed to the main memory B-Trees. B -trees

Outline BST Operations Worst case Average case Balancing AVL Red-black B-trees. Binary Search Trees. Lecturer: Georgy Gimel farb

Binary Heap Algorithms

Binary Search Trees CMPSC 122

Binary Search Trees. A Generic Tree. Binary Trees. Nodes in a binary search tree ( B-S-T) are of the form. P parent. Key. Satellite data L R

Binary Search Trees. Data in each node. Larger than the data in its left child Smaller than the data in its right child

A binary search tree or BST is a binary tree that is either empty or in which the data element of each node has a key, and:

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

Binary Heaps * * * * * * * / / \ / \ / \ / \ / \ * * * * * * * * * * * / / \ / \ / / \ / \ * * * * * * * * * *

- Easy to insert & delete in O(1) time - Don t need to estimate total memory needed. - Hard to search in less than O(n) time

B+ Tree Properties B+ Tree Searching B+ Tree Insertion B+ Tree Deletion Static Hashing Extendable Hashing Questions in pass papers

Big Data and Scripting. Part 4: Memory Hierarchies

Heaps & Priority Queues in the C++ STL 2-3 Trees

Converting a Number from Decimal to Binary

Binary Trees and Huffman Encoding Binary Search Trees

Persistent Binary Search Trees

Questions 1 through 25 are worth 2 points each. Choose one best answer for each.

TREE BASIC TERMINOLOGIES

A Comparison of Dictionary Implementations

Ordered Lists and Binary Trees

Data storage Tree indexes

recursion, O(n), linked lists 6/14

Analysis of Algorithms I: Binary Search Trees

Learning Outcomes. COMP202 Complexity of Algorithms. Binary Search Trees and Other Search Trees

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

Binary Heaps. CSE 373 Data Structures

S. Muthusundari. Research Scholar, Dept of CSE, Sathyabama University Chennai, India Dr. R. M.

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

A binary search tree is a binary tree with a special property called the BST-property, which is given as follows:

root node level: internal node edge leaf node Data Structures & Algorithms McQuain

1. The memory address of the first element of an array is called A. floor address B. foundation addressc. first address D.

How To Create A Tree From A Tree In Runtime (For A Tree)

Persistent Data Structures

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

Review of Hashing: Integer Keys

M-way Trees and B-Trees

Binary Search Trees 3/20/14

Data Structures and Algorithms

Lecture Notes on Binary Search Trees

Chapter 14 The Binary Search Tree

PES Institute of Technology-BSC QUESTION BANK

Physical Data Organization

Algorithms. Margaret M. Fleck. 18 October 2010

Lecture 1: Data Storage & Index

CS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team

CSE373: Data Structures and Algorithms Lecture 1: Introduction; ADTs; Stacks/Queues. Linda Shapiro Spring 2016

Lecture 6: Binary Search Trees CSCI Algorithms I. Andrew Rosenberg

Why Use Binary Trees?

Introduction to Algorithms March 10, 2004 Massachusetts Institute of Technology Professors Erik Demaine and Shafi Goldwasser Quiz 1.

DATABASE DESIGN - 1DL400

Lecture Notes on Binary Search Trees

Data Structures and Data Manipulation

Symbol Tables. Introduction

Binary Search Trees (BST)

Cpt S 223. School of EECS, WSU

Analysis of Algorithms I: Optimal Binary Search Trees

Operations: search;; min;; max;; predecessor;; successor. Time O(h) with h height of the tree (more on later).

Data Structures. Jaehyun Park. CS 97SI Stanford University. June 29, 2015

10CS35: Data Structures Using C

6 March Array Implementation of Binary Trees

CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Linda Shapiro Winter 2015

Motivation Suppose we have a database of people We want to gure out who is related to whom Initially, we only have a list of people, and information a

Hash Tables. Computer Science E-119 Harvard Extension School Fall 2012 David G. Sullivan, Ph.D. Data Dictionary Revisited

Full and Complete Binary Trees

External Memory Geometric Data Structures

Data Structures Fibonacci Heaps, Amortized Analysis

The ADT Binary Search Tree

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

Class Overview. CSE 326: Data Structures. Goals. Goals. Data Structures. Goals. Introduction

DATA STRUCTURES USING C

CS711008Z Algorithm Design and Analysis

Data Structures. Algorithm Performance and Big O Analysis

Data Warehousing und Data Mining

International Journal of Software and Web Sciences (IJSWS)

Fundamental Algorithms

Databases and Information Systems 1 Part 3: Storage Structures and Indices

The Tower of Hanoi. Recursion Solution. Recursive Function. Time Complexity. Recursive Thinking. Why Recursion? n! = n* (n-1)!

Binary Search Trees. basic implementations randomized BSTs deletion in BSTs

File Management. Chapter 12

Analysis of Binary Search algorithm and Selection Sort algorithm

Dynamic Programming. Lecture Overview Introduction

Rotation Operation for Binary Search Trees Idea:

APP INVENTOR. Test Review

Introduction Advantages and Disadvantages Algorithm TIME COMPLEXITY. Splay Tree. Cheruku Ravi Teja. November 14, 2011

Persistent Data Structures and Planar Point Location

Exam study sheet for CS2711. List of topics

5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes.

Algorithms and Data Structures

Chapter 13: Query Processing. Basic Steps in Query Processing

Data Structure [Question Bank]

Data Structure with C

Introduction to Data Structures and Algorithms

Class Notes CS Creating and Using a Huffman Code. Ref: Weiss, page 433

IMPLEMENTING CLASSIFICATION FOR INDIAN STOCK MARKET USING CART ALGORITHM WITH B+ TREE

Dynamic 3-sided Planar Range Queries with Expected Doubly Logarithmic Time

Data Structures and Algorithms Written Examination

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Transcription:

Announcements CSE2: Data Abstractions Project 2 posted! Partner selection due by 11pm Tues 1/25 at the latest. Homework due Friday Jan 28 st at the BEGINNING of lecture Lecture 9: B Trees Ruth Anderson Winter 2011 1/24/2011 2 Today Our goal Dictionaries B-Trees Problem: A dictionary with so much data most of it is on disk Desire: A balanced tree (logarithmic height) that is even shallower than AVL trees so that we can minimize disk accesses and exploit disk-block size A key idea: Increase the branching factor of our tree 1/24/2011 1/24/2011 4 M-ary Search Tree Build some sort of search tree with branching factor M: Have an array of sorted children (Node[]) Choose M to fit snugly into a disk block (1 access for array) M-ary Search Tree Perfect tree of height h has (M h+1-1)/(m-1) nodes (textbook, page 4) What is the height of this tree? What is the worst case running time of find? 1/24/2011 5 # hops for find: If balanced, using log M n instead of log 2 n If M=256, that s an 8x improvement Example: M = 256 and n = 2 that s 5 instead of To decide which branch to take, divide into portions Binary tree: Less than node value or greater? M-ary: In range 1? In range 2? In range?... In range M? Runtime of find if balanced: O(log 2 M log M n) Hmmm log M n is the height we traverse. Why the log 2 M multiplier? log 2 M: At each step, find the correct child branch to take using binary search 1/24/2011 6

Problems with M-ary search trees What should the order property be? How would you rebalance (ideally without more disk accesses)? Storing real data at inner-nodes (like in a BST) seems kind of wasteful To access the node, will have to load data from disk, even though most of the time we won t use it So let s use the branching-factor idea, but for a different kind of balanced tree Not a binary search tree But still logarithmic height for any M > 2 1/24/2011 7 B+ Trees (we and the book say B Trees ) Two types of nodes: internal nodes & leaves Each internal node has room for up to M-1 keys and M children No other data; all data at the leaves! Order property: Subtree between keys x and y x< contains only data that is x and < y (notice the ) Leaf nodes have up to L sorted data items As usual, we ll ignore the along for the ride data in our examples Remember no data at non-leaves 7 21 x<7 7 x< x<21 1/24/2011 8 21 x Remember: Leaves store data Internal nodes are signposts Find 7 21 Structure Properties x< x<7 7 x< x<21 Different from BST in that we don t store values at internal nodes But find is still an easy root-to-leaf recursive algorithm 21 x At each internal node do binary search on (up to) M-1 keys to find the branch to take At the leaf do binary search on the (up to) L data items But to get logarithmic running time, we need a balance condition 1/24/2011 9 Root (special case) If tree has L items, root is a leaf (occurs when starting up, otherwise unusual) Else has between 2 and M children Internal nodes Have between M/2 and M children, i.e., at least half full Leaf nodes All leaves at the same depth Have between L/2 and L data items, i.e., at least half full (Any M > 2 and L will work; picked based on disk-block size) 1/24/2011 10 Example Note on notation: Inner nodes drawn horizontally, leaves vertically to distinguish. Include empty cells Suppose M=4 (max # pointers in internal node) and L=5 (max # data items at leaf) All internal nodes have at least 2 children All leaves have at least data items (only showing keys) All leaves at same depth 44 6 20 27 4 50 Balanced enough Not hard to show height h is logarithmic in number of data items n Let M > 2 (if M = 2, then a list tree is legal no good!) Because all nodes are at least half full (except root may have only 2 children) and all leaves are at the same level, the minimum number of data items n for a height h>0 tree is 1 2 4 6 8 9 10 17 19 20 22 24 27 28 2 4 8 9 41 44 47 49 50 60 70 n 2 M/2 h-1 L/2 minimum number of leaves minimum data per leaf 1/24/2011 11 1/24/2011

Disk Friendliness Maintaining balance What makes B trees so disk friendly? Many keys stored in one internal node All brought into memory in one disk access IF we pick M wisely Makes the binary search over M-1 keys totally worth it (insignificant compared to disk access times) Internal nodes contain only keys Any find wants only one data item; wasteful to load unnecessary items with internal nodes So only bring one leaf of data items into memory Data-item size doesn t affect what M is 1/24/2011 1 So this seems like a great data structure (and it is) But we haven t implemented the other dictionary operations yet insert delete As with AVL trees, the hard part is maintaining structure properties Example: for insert, there might not be room at the correct leaf 1/24/2011 Building a B-Tree (insertions) M = L = The empty B- Tree (the root will be a leaf at the beginning) M = L = Insert() Insert() Insert() 1/24/2011 Just need to keep data in order Insert(0)??? 0 When we overflow a leaf, we split it into 2 leaves Parent gains another child If there is no parent (like here), we create one; how do we pick the key shown in it? Smallest element in right tree 1/24/2011 0 Split leaf again 2 2 2 0 Insert(2) 0 2 Insert(6) 0 2 6 0 2 6 Insert() 0 2 6 2 Insert() 0 2 6 M = L = 1/24/2011 17 2 Split the internal node (in this case, the root) 2 M = L = 1/24/2011 0 2 6 What now?

Insertion Algorithm Insert(,,,8) 1. Insert the data in its leaf in sorted order 0 2 2 6 0 2 Note: Given the leaves and the structure of the tree, we M = L = can always fill in internal node keys; the smallest value in my right branch 1/24/2011 19 2 6 8 2. If the leaf now has L+1 items, overflow! Split the leaf into two nodes: Original leaf with (L+1)/2 smaller items New leaf with (L+1)/2 = L/2 larger items Attach the new child to the parent Adding new key to parent in sorted order. If step (2) caused the parent to have M+1 children, overflow! 1/24/2011 20 Insertion algorithm continued Efficiency of insert. If an internal node has M+1 children Split the node into two nodes Original node with (M+1)/2 smaller items New node with (M+1)/2 = M/2 larger items Attach the new child to the parent Adding new key to parent in sorted order Splitting at a node (step ) could make the parent overflow too So repeat step up the tree until a node doesn t overflow If the root overflows, make a new root with two children This is the only case that increases the tree height 1/24/2011 21 Find correct leaf: O(log 2 M log M n) Insert in leaf: O(L) Split leaf: O(L) Split parents all the way up to root: O(M log M n) Total: O(L + M log M n) But it s not that bad: Splits are not that common (only required when a node is FULL, M and L are likely to be large, and after a split, will be half empty) Splitting the root is extremely rare Remember disk accesses were the name of the game: O(log M n) 1/24/2011 22 B-Tree Reminder: Another dictionary And Now for Deletion Before we talk about deletion, just keep in mind overall idea: Large data sets won t fit entirely in memory Disk access is slow Set up tree so we do one disk access per node in tree Then our goal is to keep tree shallow as possible Balanced binary tree is a good start, but we can do better than log 2 n height In an M-ary tree, height drops to log M n Why not set M really really high? Height 1 tree Instead, set M so that each node fits in a disk block Delete(2) 2 2 0 6 8 6 0 6 8 1/24/2011 2 Easy case: Leaf still has enough data; just remove M = L = 1/24/2011 24

Delete() 6 6 6 6 6 6 6 6 0 8 0 8 0 8 0 8 M = L = 1/24/2011 25 M = L = 1/24/2011 26 Delete() 6 6 6 6 6 6 6 6 0 8 0 8 0 8 0 8 M = L = 1/24/2011 27 M = L = 1/24/2011 28 6 6 Delete() 6 6 6 6 6 6 0 8 0 8 0 8 0 8 M = L = 1/24/2011 29 M = L = 1/24/2011 0

6 Delete() 6 6 6 6 0 6 0 6 6 0 8 8 8 8 0 M = L = 1/24/2011 1 M = L = 1/24/2011 2 6 6 6 6 6 6 8 6 8 6 8 0 8 0 0 0 M = L = 1/24/2011 M = L = 1/24/2011 4 Deletion Algorithm Deletion algorithm continued 1. Remove the data from its leaf 2. If the leaf now has L/2-1, underflow! If a neighbor has > L/2 items, adopt and update parent Else merge node with neighbor Guaranteed to have a legal number of items Parent now has one less node. If step (2) caused the parent to have M/2-1 children, underflow!. If an internal node has M/2-1 children If a neighbor has > M/2 items, adopt and update parent Else merge node with neighbor Guaranteed to have a legal number of items Parent now has one less node, may need to continue up the tree If we merge all the way up through the root, that s fine unless the root went from 2 children to 1 In that case, delete the root and make child the root This is the only case that decreases tree height 1/24/2011 5 1/24/2011 6

Efficiency of delete Insert vs delete comparison Find correct leaf: O(log 2 M log M n) Remove from leaf: O(L) Adopt from or merge with neighbor: O(L) Adopt or merge all the way up to root: O(M log M n) Total: O(L + M log M n) Insert Find correct leaf: Insert in leaf: Split leaf: Split parents all the way up to root: O(log 2 M log M n) O(L) O(L) O(M log M n) But it s not that bad: Merges are not that common Remember disk accesses were the name of the game: O(log M n) Delete Find correct leaf: Remove from leaf: Adopt/merge from/with neighbor leaf: Adopt or merge all the way up to root: O(log 2 M log M n) O(L) O(L) O(M log M n) 1/24/2011 7 1/24/2011 8 B Trees in Java? For most of our data structures, we have encouraged writing highlevel, reusable code, such as in Java with generics It is worth knowing enough about how Java works to understand why this is probably a bad idea for B trees Assuming our goal is efficient number of disk accesses Java has many advantages, but it wasn t designed for this If you just want a balanced tree with worst-case logarithmic operations, no problem If M=, this is called a 2- tree If M=4, this is called a 2--4 tree The key issue is extra levels of indirection 1/24/2011 9 Naïve approaches Even if we assume data items have int keys, you cannot get the data representation you want for really big data interface Keyed<E> { int key(e); } class BTreeNode<E implements Keyed<E>> { static final int M = 8; int[] keys = new int[m-1]; BTreeNode<E>[] children = new BTreeNode[M]; int numchildren = 0; } class BTreeLeaf<E> { static final int L = 2; E[] data = (E[])new Object[L]; int numitems = 0; } 1/24/2011 What that looks like All the red references indicate unnecessary indirection The moral BTreeNode ( objects with header words ) M-1 20 (larger array) The whole idea behind B trees was to keep related data in contiguous memory 70 M (larger array) But that s the best you can do in Java Again, the advantage is generic, reusable code But for your performance-critical web-index, not the way to implement your B-Tree for terabytes of data BTreeLeaf (data objects not in contiguous memory) L (larger array) 20 C# may have better support for flattening objects into arrays C and C++ definitely do Levels of indirection matter! 1/24/2011 41 1/24/2011 42

Conclusion: Balanced Trees Balanced trees make good dictionaries because they guarantee logarithmic-time find, insert, and delete Essential and beautiful computer science But only if you can maintain balance within the time bound AVL trees maintain balance by tracking height and allowing all children to differ in height by at most 1 B trees maintain balance by keeping nodes at least half full and all leaves at same height Other great balanced trees (see text; worth knowing they exist) Red-black trees: all leaves have depth within a factor of 2 Splay trees: self-adjusting; amortized guarantee; no extra space for height information 1/24/2011 4