Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Transcription

1 Four Orders of Magnitude: Running Large Scale Accumulo Clusters Aaron Cordova Accumulo Summit, June 2014

2 Scale, Security, Schema

3 Scale

4 to scale 1 - (vt) to change the size of something

5 let s scale the cluster up to twice the original size

6 to scale 2 - (vi) to function properly at a large scale

7 Accumulo scales

8 What is Large Scale?

9 Notebook Computer 16 GB DRAM 512 GB Flash Storage 2.3 GHz quad-core i7 CPU

10 Modern Server 100s of GB DRAM 10s of TB on disk 10s of cores

11 Large Scale Laptop Server 10 Node Cluster 100 Nodes 1000 Nodes 10,000 Nodes 10 GB 100 GB 1 TB 10 TB 100 TB 1 PB 10 PB 100 PB In RAM On Disk

12 Data Composition Original Raw Derivative QFDs Indexes January February March April

13 Accumulo Scales From GB to PB, Accumulo keeps two things low: Administrative effort Scan latency

14 Scan Latency

15 Administrative Overhead Failed Machines Admin Intervention

16 Accumulo Scales From GB to PB three things grow linearly: Total storage size Ingest Rate Concurrent scans

17 Ingest Benchmark Millions of entries per second

18 AWB Benchmark

19 1000 machines

20 100 M entries written per second

21 408 terabytes

22 7.56 trillion total entries

23 Graph Benchmark

24 1200 machines

25 4.4 trillion vertices

26 70.4 trillion edges

27 149 M edges traversed per second

28 1 petabyte

29 Graph Analysis Billions of Edges ,000 1, Twitter Yahoo! Facebook Accumulo

30 Accumulo is designed after Google s BigTable

31 BigTable powers hundreds of applications at Google

32 BigTable serves 2+ exabytes

33 600 M queries per second organization wide

34 From 10 to 10,000

35 Starting with ten machines 10 1

36 One rack

37 1 TB RAM

38 TB Disk

39 Hardware failures rare

40 Test Application Designs

41 Designing Applications for Scale

42 Keys to Scaling 1. Live writes go to all servers 2. User requests are satisfied by few scans 3. Turning updates into inserts

43 Keys to Scaling Writes on all servers Few Scans

44 Hash / UUID Keys Key Value RowID Col Value usera:name Bob af362de4 Bob usera:age 43 usera:account $30 b23dc4be b98de2ff Annie Joe userb:name Annie c48e2ade $30 userb:age 32 userb:account $25 c7e43fb2 $25 d938ff3d 32 userc:name Joe e2e4dac4 59 userc:age 59 e98f2eab3 43 Uniform writes

45 Monitor Participating Tablet Servers MyTable Servers Hosted Tablets Ingest r1n k r1n k r2n k r2n k

46 Hash / UUID Keys RowID Col Value get(usera) af362de4 b23dc4be b98de2ff Bob Annie Joe c48e2ade $30 c7e43fb2 $25 d938ff3d 32 e2e4dac4 59 e98f2eab x 1-entry scans on 3 servers

47 Keys to Scaling Writes on all servers Few Scans Hash / UUID Keys

48 Group for Locality Key usera:name Value Bob RowID Col Value af362de4 name Annie usera:age 43 userb:name usera:account Annie $30 userb:age userb:name 32 Annie userc:name userb:age Fred 32 userc:age userb:account 29 $25 af362de4 age 32 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob userd:name userc:name Joe e2e4dac4 age 43 userd:age userc:age 59 e2e4dac4 account $30 Still fairly uniform writes

49 Group for Locality RowID Col Value get(usera) af362de4 name Annie af362de4 age 32 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30 1 x 3-entry scan on 1 server

50 Keys to Scaling Writes on all servers Few Scans Grouped Keys

51 Temporal Keys Key Value RowID Col Value usera:name Bob 44 usera:age userb:name Annie userb:age 32 userc:name Fred userc:age 29 userd:name Joe userd:age 59

52 Temporal Keys Key Value RowID Col Value usera:name Bob 44 usera:age userb:name Annie 23 userb:age userc:name Fred userc:age 29 userd:name Joe userd:age 59

53 Temporal Keys Key Value RowID Col Value usera:name Bob 44 usera:age userb:name Annie 23 userb:age userc:name Fred 31 userc:age userd:name Joe 25 userd:age Always write to one server

54 No write parallelism

55 Temporal Keys RowID Col Value get( to ) Fetching ranges uses few scans

56 Keys to Scaling Writes on all servers Few Scans Temporal Keys

57 Binned Temporal Keys Key Value RowID Col Value usera:name Bob 44 0_ usera:age userb:name Annie 23 userb:age 32 1_ userc:name Fred userc:age 29 userd:name Joe 2_ userd:age 59 Uniform Writes

58 Binned Temporal Keys Key Value RowID Col Value usera:name Bob 44 usera:age _ _ userb:name Annie 23 userb:age userc:name Fred 31 1_ _ userc:age userd:name Joe 2_ userd:age 59 2_ Uniform Writes

59 Binned Temporal Keys Key Value RowID Col Value usera:name Bob 44 usera:age userb:name Annie 23 userb:age userc:name Fred 31 userc:age userd:name Joe 25 userd:age _ _ _ _ _ _ _ _ Uniform Writes

60 Binned Temporal Keys get( to ) RowID Col Value 0_ _ _ _ _ _ _ _ One scan per bin

61 Keys to Scaling Writes on all servers Few Scans Binned Temporal Keys

62 Keys to Scaling Key design is critical Group data under common row IDs to reduce scans Prepend bins to row IDs to increase write parallelism

63 Splits Pre-split or organic splits Going from dev to production, can ingest a representative sample, obtain split points and use them to pre-split a larger system Hundreds or thousands of tablets per server is ok Want at least one tablet per server

64 Effect of Compression Similar sorted keys compress well May need more data than you think to auto-split

65 Inserts are fast 10s of thousands per second per machine

66 Updates *can* be

67 Update Types Overwrite Combine Complex

68 Update - Overwrite Performance same as insert Ignore (don t read) existing value Accumulo s Versioning Iterator does the overwrite

69 Update - Overwrite RowID Col Value af362de4 name Annie userb:age -> 34 af362de4 age 32 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30

70 Update - Overwrite RowID Col Value af362de4 name Annie userb:age -> 34 af362de4 age 34 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30

71 Update - Combine Things like X = X + 1 Normally one would have to read the old value to do this, but Accumulo Iterators allow multiple inserts to be combined at scan time, or compaction Performance is same as inserts

72 Update - Combine RowID Col Value af362de4 name Annie userb:account -> +10 af362de4 age 34 af362de4 account $25 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30

73 Update - Combine RowID Col Value af362de4 name Annie userb:account -> +10 af362de4 age 34 af362de4 account $25 af362de4 account $10 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30

74 Update - Combine RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $25 af362de4 account $10 getaccount(userb) $35 c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30

75 Update - Combine RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $35 After compaction c48e2ade name Joe c48e2ade age 59 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30

76 Update - Complex Some updates require looking at more data than Iterators have access to - such as multiple rows These require reading the data out in order to write the new value Performance will be much slower

77 Update - Complex userc:account = getbalance(usera) + getbalance(userb) RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $35 c48e2ade name Joe = 65 c48e2ade age 59 c48e2ade account $40 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30

78 Update - Complex userc:account = getbalance(usera) + getbalance(userb) RowID Col Value af362de4 name Annie af362de4 age 34 af362de4 account $35 c48e2ade name Joe = 65 c48e2ade age 59 c48e2ade account $65 e2e4dac4 name Bob e2e4dac4 age 43 e2e4dac4 account $30

79 Planning a Larger-Scale Cluster

80 Storage vs Ingest Ingest Rate 1x1TB 12x3TB Millions of Entries per second ,000 12,000 10,000 1,200 1, Storage Terabytes

81 Model for Ingest Rates N - Number of machines S - Single Server throughput (entries / second) A - Aggregate Cluster throughput (entries / second) A = 0.85 log 2 N * N * S Expect 85% increase in write rate when doubling the size of the cluster

82 Estimating Machines Required N - Number of machines S - Single Server throughput (entries / second) A - Target Aggregate throughput (entries / second) N = 2 (log (A/S) / ) 2 Expect 85% increase in write rate when doubling the size of the cluster

83 Predicted Cluster Sizes Number of Machines Millions of Entries per Second

84 100 Machines 10 2

85 Multiple racks

86 10 TB RAM

87 100 TB - 1PB Disk

88 Some hardware failures in the first week (burn in)

89 Expect 3 failed HDs in first 3 mo

90 Another 4 within the first year research.google.com/en/us/archive/disk_failures.pdf

91 Can process the 1000 Genomes data set 260 TB

92 Can store and index the Common Crawl Corpus commoncrawl.org! 2.8 Billion web pages 541 TB

93 One year of Twitter 182 trillion tweets 483 TB / /d564001ds1.htm

94 Deploying an Application Users Clients Tablet Servers

95 May not see the affect of writing to disk for a while

96 1000 machines 10 3

97 Multiple rows of racks

98 100 TB RAM

99 1-10 PB Disk

100 Hardware failure is a regular occurrence

101 Hard drive failure about every 5 days (average). Will be skewed towards beginning of! the year

102 Can traverse the brain graph 70 trillion edges, 1 PB

103 Facebook Graph 1s of PB xldb2012_wed_1105_dhrubaborthakur.pdf

104 Netflix Video Master Copies 3.14 PB

105 World of Warcraft Backend Storage 1.3 PB wows-back-end-10-data-centers cores/

106 Webpages, live on the Internet 14.3 Trillion total-number-of-websites-size-of.html

107 Things like the difference between two compression algorithms start to make a big difference

108 Use range compactions to affect changes on portions of table

109 Lay off Zookeeper

110 Watch Garbage Collector and Namenode ops

111 Garbage Collection > 5 minutes?

112 Start thinking about NameNode Federation

113 Accumulo 1.6

114 Multiple NameNodes Accumulo Namenode Namenode DataNodes DataNodes Multiple HDFS Clusters

115 Multiple NameNodes Accumulo Namenode Namenode DataNodes Multiple NameNodes, shared DataNodes (Federation. Requires Hadoop 2.0)

116 More Namenodes = higher risk of one going down.! Can use HA Namenodes in conjunction w/ Federation

117 10,000 machines 10 4

118 You, my friend, are here to kick a** and chew bubble gum

119 1 PB RAM

120 PB Disk

121 1 hardware failure every hour on average

122 Entire Internet Archive 15 PB internet-archive-wayback-machine-brewster-kahle

123 A year s worth of data from the Large Hadron Collider 15 PB

124 0.1% of all Internet traffic in PB total-number-of-websites-size-of.html

125 Facebook Messaging Data 10s of PB xldb2012_wed_1105_dhrubaborthakur.pdf

126 Facebook Photos 240 billion High 10s of PB xldb2012_wed_1105_dhrubaborthakur.pdf

127 Must use multiple NameNodes

128 Can tune back heartbeats, periodicity of central processes in general

129 Can combine multiple PB data sets

130 Up to 10 quadrillion entries in a single table

131 While maintaining sub-second lookup times

132 Only with Accumulo 1.6

133 Dealing with data over time

134 Data Over Time - Patterns Initial Load Increasing Velocity Focus on Recency Historical Summaries

135 Initial Load Get a pile of old data into Accumulo fast Latency not important (data is old) Throughput critical

136 Bulk Load RFiles

137 Bulk Loading MapReduce RFiles Accumulo

138 Increasing velocity

139 If your data isn t big today, wait a little while

140 Accumulo scales up dynamically, online. No downtime

141 The first scale, can change size

142 Scaling Up Clients Accumulo HDFS 3 physical servers Each running a Tablet Server process and a Data Node process

143 Scaling Up Clients Accumulo HDFS Start 3 new Tablet Server procs 3 new Data node processes

144 Scaling Up Clients Accumulo HDFS master immediately assigns tablets

145 Clients Scaling Up Clients immediately begin querying new Tablet Servers Accumulo HDFS

146 Scaling Up Clients Accumulo HDFS new Tablet Servers read data from old Data nodes

147 Scaling Up Clients Accumulo HDFS new Tablet Servers write data to new Data Nodes

148 Never really seen anyone do this

149 Except myself

150 20 machines in Amazon EC2

151 to 400 machines

152 all during the same MapReduce job reading data out of Accumulo, summarizing, and writing back

153 Scaled back down to 20 machines when done

154 Just killed Tablet Servers

155 Decommissioned Data Nodes for safe data consolidation to remaining 20 nodes

156 Other ways to go from 10 x to 10 x+1

157 Accumulo Table Export

158 followed by HDFS DistCP to new cluster

159 Maybe new replication feature

160 Newer Data is Read more Often

161 Accumulo keeps newly written data in memory

162 Block Cache can keep recently queried data in memory

163 Combining Iterators make maintaining summaries of large amounts of raw events easy

164 Reduces storage burden

165 Historical Summaries Unique Entities Stored Raw Events Processed April May June July

166 Age-off iterator can automatically remove data over a certain age

167 IBM estimates 2.5 exabytes of data is created every day what-is-big-data.html

168 90% of available data created in last 2 years what-is-big-data.html

169 25 new 10k node Accumulo clusters per day

170 Accumulo is doing it s part to get in front of the big data trend

171 Questions?

172 @aaroncordova