1, Big Data and Scripting Part 4: Memory Hierarchies
2, Model and Definitions memory size: M machine words total storage (on disk) of N elements (N is very large) disk size unlimited (for our considerations) block of B machine words I/O operation: reading/writing one block topic provide data structures using external memory minimize I/O operations
3, B-Trees basic idea store elements identified by keys keys are sortable (e.g. from N) construct tree with two types of nodes: leaves list of elements and keys keys within some interval [k 1, k n ] inner nodes including root list of (sorted) keys k 1 <... < k n, n B list of children c 0,..., c n elements with key k i k < k i+1 in leaf below c i
4, B-Trees example 4,7,11 11,21,30 14,18,20 24,27,30 inner nodes 1 2 3 1,4 4,7 7,11 11,14 14,18 18,20 20,24 24,27 27,30 4 5 6 9 10 11 12 13 14 15 17 18 19 21 22 23 25 26 27 28 stored content use O(B) keys, addresses in inner nodes use B/size of content in leaf nodes each node fits in one block
5, B-Trees applications fast storage and retrieval of (key, value) pairs keep dynamically sorted list, e.g. priority queue range reporting elements with keys in range k 1, k 2 can be extracted as subtree usage in external memory-scenario priority -assessment which data blocks are most likely needed in the near future insert into B-Tree using priority as key keep only top of the B-Tree in memory
6, B-Trees retrieval of an element retrieve element with key k current=root; // root of the tree (is inner node) while(current is inner node){ choose i such that k i k < k i+1 current=c i ; // switch to corresponding child } return(element with key k in current); choose i with binary search find element in leaf also by binary search
7, B-Trees logarithmic access all leaves have same distance to root node level of a node: distance to leaves weight of a node: number of leaves in sub-tree balance invariant: every node has at least B/2 and at most B children with invariant: descending one level reduces leaves below current node by O(B) at most O(log B N) descends to leaf B is constant O(1) time in each node note: larger B lesser height of root
8, B-Trees storage in external memory store each node (inner and leaves) in block on disk inner nodes: 1/2 block size for key-intervals 1/2 block size for pointers to children each inner node has (up to) 1/2 (block size) children leaves: e.g. list of keys pointing to position of values in block disk usage: depends on N and size of values, assume k blocks for storage height: O(log B k), on each level: B i blocks O(k) blocks for indexing
9, B-Trees: inserting elements B-Trees, administer dynamic data structures consequently: data insertion and deletion problem: balance invariant insert element current=leaf for element insert element in current while(size(current)> B){ current=split(current); }
10, B-Trees: splitting a node keep balance invariant for insertions by splitting large nodes idea: split node adjust addressing in parent split(node) find median m create two new nodes for keys m and > m insert new interval border in parent return parent (for recursion) node is larger than B, split results in two nodes B/2 new interval border in parent may lead to overrun recursion
11, B-Trees: deleting elements analogous to insertion problem: nodes can shrink below size B/2 repair by merging into parent node if overrun in parent node: repair analogous to insertion
12, Buffer Trees so far: every update causes read/write operations and possible reordering of the tree buffer trees avoid frequent updates by buffering operations every node has buffer of pending operations when buffer overruns it is flushed: load content into memory sort and execute operations on sub trees operations on sub trees are written to corresponding buffers new updates are placed in root buffer before balancing operations, buffers of involved nodes are flushed
13, Implementing external priority queues priority queue insert elements with keys extract element with lowest key (i.e. highest priority) can be implemented with dynamic structure that ensures order of elements B-Trees are an example
14, Implementing external priority queues Implementation with Buffered Trees keep root buffer in memory keep leftmost leaves in memory all buffers from root to leftmost leaves are kept empty only top of queue is accessed in retrieval top of queue equals leftmost leaves corresponding buffers empty top of queue is sorted leftmost leaves in memory top of queue in memory rest of queue is sorted on demand