Big Data & Scripting storage networks and distributed file systems 1,
2, in the remainder we use networks of computing nodes to enable computations on even larger datasets for a computation, each node will work on the part of the dataset that is locally available to the node computing nodes will have a partial, local copy of the whole dataset an optimal scenario will distribute the data in advance using nodes in parallel for storage and computations general setting nodes connected by network each node has external memory (e.g. hard disk) in addition: internal memory and computing capacity in this part we consider only storage and distribution of data
3, design issues for storage networks Space and Access balance even distribution of data to machines Availability implement redundancy and tolerance for data loss Resource Efficiency use resources in useful way (don t waste space) Access Efficiency provide fast access to stored data Heterogeneity integrate different types of hardware Adaptivity storage of growing amounts of data Locality minimize degree of communication for data access
4, storage networks model n nodes N 1,..., N n node N i has capacity C i total capacity: S = n i=1 C i, i.e. space for S blocks in total blocks stored on N i : F i (filling state) nodes are connected by network: N i can send data to N j for arbitrary i, j data is accessed by users from outside: retrieve a set of blocks (for now) retrieve the result of an operation on a set of blocks (later)
5, balancing problem: consider a simplified scenario with C i constant, i.e. all nodes have the same capacity and distribute m blocks to n nodes subject to: minimize i F i m/n (close to equal distribution) and minimize max i F i (minimize max load)
6, striping all objects combined to single stream of data divide data into blocks B i divide blocks into striping units U i of k blocks each store striping unit i to node N (i mod n) at position i div n stripe unit D1 D2 D3 D4 0 1 8 9 2 3 10 11 4 5 12 13 6 7 14 15 Stripe 0 Stripe 1 block advantage: units in one stripe can be read in parallel
7, striping: size of striping unit k? assumptions: operations tend to involve adjacent blocks example: one big file (e.g. large csv table) spanning several blocks several data accesses in parallel e.g. different users using different files small k high bandwidth (access in parallel) many parallel accesses block each other large k low bandwidth (most files in single node) parallel accesses (to different files) are distributed among nodes choice of k only depends on access-structure and average node performance 1 1 Chen, Patterson, Maximizing performance in a striped disk array, 1990
8, striping: advantages/disadvantages advantages perfectly balanced data distribution simple addressing/storage scheme disadvantages modifying stored data (blocks) block deletion yields holes (new data at the end or into holes) fragmentation (additional indexing adding and removing nodes (machines) addition could be solved by new striping removal leads to (partial) redistribution solutions exists, but striping is best for static scenarios
9, balancing: centralized approach idea one central address and positioning node master coordinates all data access, knows state of nodes store new blocks to nodes with lowest filling state adding/removing storage nodes is straightforward data access: client sends operation to server (read/write, add, delete) server answers with address of node to interact with operation is executed between client and node
10, centralized approach: advantages/disadvantages advantages optimal data distribution can be guaranteed operations can be synchronized disadvantages address and positioning node is bottleneck one centralized dictionary block id node return to access schemes later
11, balancing: distribution by hashing treat nodes as bins, use hash function h() for distribution write block B to node N h(b) load factor α >> 1 (many blocks per node) the balls to bins model usual assumption in hashing: α < 1, avoid collision here: α >> 1 achieve balanced distribution of blocks (balls) to nodes (bins) optimal distribution: m/n blocks (out of m) on each node (out of n) question: can we guarantee that maximum elements in one bin is not too large?
balancing: distribution by hashing when using the distribution of a hash function directly, the fill states of the bins tend to be unbalanced bin fill state, m=10.000 blocks in n=100 bins (blocks in bin) m/n 50 0 50 0 20 40 60 80 100 bin experiment: distribute 10.000 blocks to 100 bins expected fill state: 100 blocks per bin 12,
13, balancing: distribution by hashing the simple case : m elements, n bins, m > n log n, assumption h(x) uniform distributed then with high probability: expected number of elements in most B i : m/n bin with m/n + Θ ( mln(n)/n ) additional load more than with high probability m/n (compared to opt.) In a system with some parameter n, an event X appears with high probability if P(X) 1 1 n α for some constant α > 0. similar cases often denoted as P(X) = 1 o(1)
14, balancing: greedy improvement the expected distribution O(m/n) in each node is good bins with higher load can block computations and data access improvement: greedy(d) for each block, choose d 2 nodes N i1,..., N id find b = arg min k {1,...,d} F ik (break ties arbitrary) place block in N b example: consider block h(b) and blocks to the left and right retrieval: recalculate addresses and test all (in parallel)
balancing: greedy improvement experiment: comparing default choice and greedy improvement bin fill states, m=10.000 blocks in n=100 bins, (2 alternatives in greedy) direct greedy (blocks in bin) m/n 50 0 50 0 20 40 60 80 100 bin each greedy insert uses bin from h(b) 1, h(b), h(b + 1) with minimal fill state 15,
16, analysis of greedy(d) 2 theorem: maximal load Insert m blocks into n nodes using greedy(d), then with high probability: max i F i is ln(ln(n))/ln(d) Θ(m/n) theorem: number of overloaded bins Let γ be a suitable constant. If m balls are distributed into n bins using strategy greedy(d), with probability > 1 1 at most n n exp( d i ) bins have load > m + i + γ. n 1. the maximal load is not too extreme 2. only few bins with much more than the optimal load exist 2 c.f. Berenbrink, Czumaj, Steger, Vöcking, Balanced allocations: The heavily loaded case, 2000
17, heterogeneity implicit assumption above: C i = C j all nodes have equal capacities useful assumption but not realistic heterogeneity: arbitrary hardware for nodes in general C i C j (differing capacities) load balancing is more complicated more freedom of hardware choice e.g. upgrade with constantly larger nodes
18, heterogeneity: virtual buckets the hashing approach can be extended to heterogeneous settings by subdividing all node capacities into virtual buckets choose largest common storage unit C as size of virtual bucket real capacities C i should be approx. multiples of C: C i k i C with k i N every node N i is split into k i buckets s.t. K = i k i (K is the total number of buckets) hash function maps blocks to {1,..., K} (buckets) second mapping m : {1,..., K} {1,..., N}, with {m 1 (i)} = k i map K buckets to N nodes number of buckets for each node corresponding to node size
19, availability: prevent data loss avoid loss of data, i.e. ensure that stored data is available motivational example storage network with N uniform nodes probability of node failure within one month is p P(node survives a month) = (1 p) P(N nodes survive k months) = (1 p) N k failure probability exponential in number of nodes and time failures will happen eventually can not be avoided with fail-safe hardware use redundancy to handle failures
20, availability: implementing redundancy basic principle store additional information (more than only the given data) use that information to recover in case of partial data loss two basic approaches mirroring store data elements several times parity codes create additional information to recover missing bits
21, availability: redundancy by mirroring idea (simple version) for each block store r duplicate on different nodes failure rate for one node p probability of loosing block: p r problem: need rm space instead of r when node fails: create copies of all blocks on failed node from duplicates on update of nodes: update all duplicates
22, availability: parity codes assume string of bits s = s 1 s 2 s 3... s n e.g. 0110101001001110 parity: p(s) = i s i mod 2, e.g. 0 if one bit of s is lost, e.g. s = s 1 xs 3... s n, was x =1 or x =0? use parity of available part: { 0, if p(s) = p(s x = ) 1, else one additional bit allows recovering of one arbitrary lost bit can be extended to larger amounts of missing bits one example: Hamming code store additional bits instead of duplicates and restore on data loss often implemented on hardware level
23, adaptivity capacity is constantly extended by adding nodes problem: rehashing for every new node to expensive idea: adaptive hash function hash function with adaptive range change of range avoids total reorganization, but rearranges only (small) portion of input values when new nodes are added, only a few blocks have to be rearranged
24, adaptivity: adaptive hashing basic idea position nodes in space S for each block determine position in S by hash function store block on nearest node find nearest position for arbitrary point by binary search adapt to new/removed nodes: removing/adding points in space reassign neighboring blocks problem: when node is removed, all blocks go to neighbor(s) when node is added, takes huge load from neighbors refine using multiple positions for each node
25, adaptivity: adaptive hashing use one-dimensional ring [0, 1) as space (distance using modulo) assign k positions to each node i: P i 1,..., P i k every block is mapped to [0, 1)-position by hash function h block positioning determine hash value h(b) for block assign block B to nearest node by position: arg min min{ h(b) Pj i, 1 h(b) Pj i } i j adding a node create new positions for node reassign blocks from neighboring positions remove node reassign blocks remove positions, remove node
26, adaptivity: adaptive hashing the points P i j of node i can be determined by hash-functions for each insertion, a search for the nearest point has to be done until now: homogeneous setting (C i constant) heterogeneous settings: model different sizes by additional points reflect capacity by corresponding number of points using the virtual blocks approach large number of points