Decentralized Deduplication in SAN Cluster File Systems

Transcription

1 Decentralized Deduplication in SAN Cluster File Systems Austin T. Clements Irfan Ahmad Murali Vilayannur Jinyuan Li VMware, Inc. MIT CSAIL Transcript of talk given on Wednesday, June 17 at USENIX ATEC 09.

2 Storage Area Networks Storage area networks form the backbone of most virtualized data centers today.

3 Storage Area Networks These provide high-performance, direct, block-level, SCSI access to shared disks to multiple hosts simultaneously and this enables us to run high-performance shared file systems such as VMware s VMFS.

4 Storage Area Networks This, in turn, lets us run lots and lots of virtual machines off our shared file system.

5 Storage Area Networks This, unfortunately, causes a problems because now that we ve got lots and lots of virtual machines on our shared file system, we have lots and lots of duplicate data. This is particularly bad if you re running something like, say, a virtual desktop infrastructure where you re serving maybe hundreds of very similar desktop virtual machines out of the same file system. In fact, we measured just how bad this was by taking 113 virtual machines from a live, corporate virtual desktop infrastructure cluster and deduplicating them.

6 Storage Area Networks 1.3 TB We started with 1.3 terabytes of disk data...

7 Storage Area Networks 237 GB 1.3 TB We wound up with 237 gigabytes.

8 Storage Area Networks 237 GB 5X reduction 1.3 TB That s a 5X reduction in space. Clearly there s a problem here.

9 Deduplication So, what are we going to do about it? Well, we re going to deduplicate the file system. We re going to apply the classic trick of using content hashes to identify duplicate data.

10 Deduplication As you can imagine, we can break up our virtual machines into 4 kilobyte blocks and...

11 Deduplication whenever a virtual machine does a write to one of its blocks then the host that s running that virtual machine...

12 Deduplication is going to compute the hash of that block s new content...

13 Deduplication 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. and it s going to go out to some shared index that s stored on the shared disk. It s going to look up that hash in the index and...

14 Deduplication 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. if it finds that it s already in the index, then it s simply going to reuse the block that s already on the disk.

15 Deduplication 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. Otherwise, it s going to have to go and allocate space on the disk, write the block to it...

16 Deduplication 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. and update the index to reflect that block s presence. Now, this works great if you ve got a centralized host with exclusive access to the file system, but it turns out it doesn t work at all in our decentralized setting and there are a couple of reasons for this.

17 The Problem Slow Lots of IO in the write path. Can t cache the index. One is that it s just plain slow. We ve now introduced all sorts of new IO into the write path. It used to be that when a virtual machine did a write, it went pretty directly to the underlying disk; there wasn t much translation involved. Now, we ve introduced all of this extra random access IO to this complex on-disk index structure that needs to happen on every single write. Most systems mitigate this problem by aggressively caching the index on the host, but we can t even do that because in our decentralized setting you run into all sorts of terrible cache coherence problems when you try to do that. And, actually...

18 The Problem Slow Lots of IO in the write path. Can t cache the index. Very Slow Writes require allocation and thus coordination. No hope of disk locality. It s worse than that. It s very slow. Every single write that actually makes it to disk requires allocation and with our shared disk we have to coordinate allocation to prevent hosts from allocating the same space at the same time and coordination is expensive. Furthermore, when you re writing all over your disk, you loose all hope of maintaining disk locality.

19 The Problem Slow Lots of IO in the write path. Can t cache the index. Very Slow Writes require allocation and thus coordination. No hope of disk locality. Hopelessly Slow Multi-host lock contention on shared index. And, actually, it s just hopelessly slow. We ve got this shared index structure, which means we need to coordinate all of our access to this index structure, which means that we re going to need locks, which means we re going to have lots and lots of horrible lock contention between all these hosts that are trying to update the index. In our particular environment, that s really, really bad because locking is really, really expensive, which means that all of your writes become really, really expensive. Clearly this is just not going to work.

20 Three-Stage Deduplication Decentralized Deduplication To address these problems, we developed a decentralized deduplication system for SAN cluster file systems

21 Three-Stage Deduplication DeDe Or just DeDe for short.

22 Three-Stage Deduplication DeDe DeDe breaks up the deduplication process into three separate stages: write monitoring, local deduplication, and cross-host deduplication. Breaking the problem up like this gives us all sorts of really nice properties.

23 Three-Stage Deduplication DeDe Out-of-band deduplication of live, primary storage Process duplicates efficiently, in large batches Minimize contention on the index Resilient to stale index information Unique blocks remain mutable and sequential = No overhead for blocks that don t benefit from deduplication For one, we can do this on live, primary storage. Remember, this is not an archival system; this is a live file system. We can actually do the deduplication process out-of-band, which means that we can basically do our writes now and worry about deduplication later. Because we re doing all of our deduplication out of band anyways, we can just batch it up and do it in really large, periodic batches, which means that these last two stages can happen just every once in a while, and they can take care of all of the possible duplicates that have accumulated in the mean time and process them very efficiently and in a very large batch. Because we re doing this only periodically, that means that we minimize contention on the index, which minimizes the communication between the hosts. Furthermore, it lets us be resilient to stale index information. Our index can actually be wrong. It can be out of date because we re only updating it every once in a while, which means that we can deal with things like hosts failing when they haven t reported writes to the file system and all sorts of other good things.

24 Three-Stage Deduplication DeDe Out-of-band deduplication of live, primary storage Process duplicates efficiently, in large batches Minimize contention on the index Resilient to stale index information Unique blocks remain mutable and sequential = No overhead for blocks that don t benefit from deduplication We also get the extra neat property that our unique blocks that is, blocks that just have a single reference to them can actually remain mutable we can update them in place when we want to which means that they can also remain sequential on disk. This has the nice implication that precisely the blocks that are not benefiting from deduplication anyways the unique blocks actually have no additional IO overhead because they remain mutable and sequential.

25 Three-Stage Deduplication DeDe These three stages form a pipeline.

26 Three-Stage Deduplication DeDe Every single host in the cluster runs all three of these stages independently.

27 Three-Stage Deduplication DeDe The stages communicate with each other both within the hosts and between the hosts via regular files stored in the shared file system itself.

28 Three-Stage Deduplication DeDe In the first stage, write monitoring, every host independently monitors all of its writes to the file system. It computes the hashes of the blocks as they go out to disk and buffers them up in memory and...

29 Three-Stage Deduplication DeDe f4bea9.. f2a4d2.. 0e7a ba2b.. d5e341.. bc ab ee b1f28.. 9b575b.. 94c9c bc7.. periodically flushes them out to on-disk, per-file write logs. Note that there s no coordination going on here; there s no contention between the hosts either because they re all monitoring their own writes and their all modifying separate files that they all individually have locked. Now, when a given host s write logs grow to be large enough; that is, it s done enough modifications to the file system since it last deduplicated its changes, then...

30 Three-Stage Deduplication DeDe f4bea9.. f2a4d2.. 0e7a ba2b.. d5e341.. bc ab ee b1f28.. 9b575b.. 94c9c bc7.. it fires up the local deduplication stage.

31 Three-Stage Deduplication DeDe f4bea9.. f2a4d2.. 0e7a ba2b.. d5e341.. bc ab ee b1f28.. 9b575b.. 94c9c bc7.. 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. This first goes out and locks the shared, on-disk index of all of the blocks in the file system.

32 Three-Stage Deduplication DeDe f4bea9.. f2a4d2.. 0e7a ba2b.. d5e341.. bc ab ee b1f28.. 9b575b.. 94c9c bc7.. 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. Then it picks up all the information from the write logs that is, all the hashes of the recently modified blocks and...

33 Three-Stage Deduplication DeDe ab ee b1f28.. 9b575b.. 94c9c bc7.. 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. it updates the index to reflect all these recent modifications to the file system. In the process, if it finds that some recently modified block now has a hash that s already in the index, that means that it s found a duplicate...

34 Three-Stage Deduplication DeDe ab ee b1f28.. 9b575b.. 94c9c bc7.. 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. and it can eliminate that. Likewise, when this host is done performing local deduplication, it gives up its lock on the index and...

35 Three-Stage Deduplication DeDe ab ee b1f28.. 9b575b.. 94c9c bc7.. 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. some other host can come along and lock the index and consume its own write logs and remove any duplicates in files that it has exclusive access to.

36 Three-Stage Deduplication DeDe 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. Of course, there s a catch to this. Local deduplication can only modify the metadata of files that the host running the stage has access to; that it has an exclusive lock on.

37 Three-Stage Deduplication DeDe 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. This means that, when we first find two duplicate blocks this is the first time that we identify two blocks with the same content they might be in files that are locked by two different hosts. This is a decentralized setting; we can t just go mucking around in the metadata of files that we don t have locks on, which means that we need some other approach.

38 Three-Stage Deduplication DeDe? 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9..??? When this does happen, the local deduplication stage produces a merge request. These are stored, like the write logs, along with each individual file in the file system. It basically says, I ve got a block, you ve got a block; I think they re the same. You deal with it.

39 Three-Stage Deduplication DeDe? 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9..??? And then, later on, each host will periodically run a cross-host deduplication stage, which picks up all the merge requests for the files that it has exclusive access to, verifies that the blocks that claim to be duplicates are actually duplicates and deduplicates them.

40 Three-Stage Deduplication DeDe? 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9..??? When we put all three of these stages together, we can deduplicate the entire shared file system with very little coordination between the hosts. In fact, the only time that they really communicate at all is with these merge requests at the very end, and every that is completely asynchronous.

41 Implementation We actually built this system. It runs on top of VMware ESX Server.

42 Implementation We use VMFS as our cluster file system. We had to make a few minor modifications to it, which I ll describe in a moment.

43 Implementation On top of VMFS, we run a write monitoring kernel module.

44 Implementation 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. We also have our shared on-disk index structure.

45 Implementation 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. And, finally, we have a deduplication daemon that runs on top of the whole thing and coordinates all of these stages.

46 VMFS Compare-and-share VMFS already supports copy-on-write blocks, but it doesn t present the interface that we need for deduplication, so we added a new compare-and-share operation to it. The basic idea here is that we provide it with two files that we have exclusively locked; that we have open. We point to the two blocks in those files and we tell the file system, We think these are duplicates; go deal with it.

47 VMFS Compare-and-share? The file system is responsible then for atomically comparing the content of these two blocks and, if they are in fact identical...

48 VMFS Compare-and-share updating the block pointers in these two files to point to one block, sharing it copy-on-write, and eliminating the block that we no longer need.

49 VMFS Compare-and-share DeDe finds duplicates. VMFS eliminates them. This operation means that it s the responsibility of DeDe to find potential duplicates across the entire shared file system and then it hands them off to VMFS. It s VMFS responsibility to actually eliminate the blocks. This divides up the process very nicely. Now we ve got a way to eliminate blocks, but we need their hashes in order to find them.

50 Write Monitor That s the goal of the write monitor.

51 Write Monitor A lightweight kernel module monitors writes, computes hashes The write monitor s just a very lightweight kernel module that sits inbetween the virtual machines and VMFS. It monitors every write to the file system and it computes the hashes of these blocks as they re going out to disk.

52 Write Monitor A lightweight kernel module monitors writes, computes hashes It buffers the write log in userspace before writing it to disk It buffers up these hashes in memory...

53 Write Monitor A lightweight kernel module monitors writes, computes hashes It buffers the write log in userspace before writing it to disk and periodically flushes them out to disk, separating them out into separate write logs for each file.

54 Write Monitor A lightweight kernel module monitors writes, computes hashes It buffers the write log in userspace before writing it to disk Safe to buffer the log because index is resilient It s safe to buffer these up precisely because our index is resilient to stale information. It s actually alright if this host just crashes and looses all of its buffer or if the host is just so overloaded that it has to shed work and so it can t compute these hashes. That s okay. We ll miss deduplication opportunities, but the system will still be correct.

55 Write Monitor A lightweight kernel module monitors writes, computes hashes It buffers the write log in userspace before writing it to disk Safe to buffer the log because index is resilient 150 MB of regular writes 1 MB sequential log write Because we re buffering, we also have the nice property that this involve very little I/O. We can actually absorb about 150 megabytes worth of writes to regular files in the file system before we have to do even one megabyte sequential write to the logs, so this actually introduces very little I/O overhead. Now we ve got a way to gather the hashes of the blocks as we modify them. We need a way to find blocks with identical hashes.

56 The Index 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. Map from hashes to block locators, list sorted by hash That s what the index is for. This is just a map from content hashes to block locations, just like you d expect. It s actually stored on the shared file system as just a simple sequential list, sorted by hash. This sounds overly simple, but because we re always updating it in these really large batches and these updates are scattered all over the index, it turns out this is actually substantially more efficient that using a much more complicated random access index structure.

57 The Index 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. Unique Shared Map from hashes to block locators, list sorted by hash As I mentioned earlier, we actually distinguish between unique blocks and shared blocks so that we can keep unique blocks mutable and this fact is reflected in the index. Each index entry indicates whether it s a unique block or a shared block and we point to these two different types of blocks differently.

58 The Index 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. Unique Shared Map from hashes to block locators, list sorted by hash Unique blocks are located in files and remain mutable For unique blocks, we simply store the i-number of some existing file and the offset of the block within the file.

59 The Index 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. Unique Shared Map from hashes to block locators, list sorted by hash Unique blocks are located in files and remain mutable This is, of course, backed by some actual block stored somewhere on the disk, but its entirely up to the file system to resolve this block offset into a block address on the disk, just like it would for any regular file system access.

60 The Index 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. Unique Shared Map from hashes to block locators, list sorted by hash Unique blocks are located in files and remain mutable Shared blocks are somewhat more complicated. We can t just use the block address of the block on disk because that actually causes all sorts of really hairy problems. And we don t really want to point to all of the files that contain the block because those could change at any time without us knowing, which means that we d have to go hunting for the shared block whenever we needed it.

61 The Index 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. Unique Shared Map from hashes to block locators, list sorted by hash Unique blocks are located in files and remain mutable A virtual arena stores COW references to all shared blocks So, instead, we introduce a new file. We call it a virtual arena. This virtual arena s just a regular file, just like any other file in the file system, except that...

62 The Index 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. Unique Shared Map from hashes to block locators, list sorted by hash Unique blocks are located in files and remain mutable A virtual arena stores COW references to all shared blocks it consists exclusively of copy-on-write references to blocks that already exist in other files. Thus, in some sense, it contains all of the shared data in the file system, without actually consuming any space other than just the file system metadata that it needs.

63 The Index 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. Unique Shared Map from hashes to block locators, list sorted by hash Unique blocks are located in files and remain mutable A virtual arena stores COW references to all shared blocks Now, whenever we need to refer to a shared block, it s actually pretty simple. We just refer to it by its offset in the virtual arena. Again, the file system is responsible for resolving that offset into an actual block address. We never use the block addresses themselves. So, now we ve got a way to gather the hashes of the blocks that we ve modified. We ve got a way to locate identical blocks.

64 Indexing and Duplicate Elimination 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. Unique Shared Now we need a way to actually combine this all together and perform deduplication.

65 Indexing and Duplicate Elimination 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. Unique Shared Here we have some host that s running some virtual machine and that means...

66 Indexing and Duplicate Elimination 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. Unique Shared that it has an exclusive lock on that virtual machine s files. At some point it s going to decide that it s made enough modifications to the file system and it s time to go hence forth and deduplicate.

67 Indexing and Duplicate Elimination 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. 6cd c.. c277d6.. Unique Shared At that point it picks up the write logs for the files that it has open for this particular virtual machine which contain the hashes of all recently modified blocks.

68 Indexing and Duplicate Elimination 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea c.. 6cd412.. c277d6.. Unique Shared It sorts all those entries by hash. Now it s going to simply walk down the index structure, walk down this sorted list of updates, and merge the two structures together.

69 Indexing and Duplicate Elimination 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea c.. 6cd412.. c277d6.. Unique Shared So we start walking down and we see that our first hash occurs right at the beginning. And, actually, it s not in the index yet, which means that it s a new, unique block. There s only a single reference to it.

70 Indexing and Duplicate Elimination 0e7a ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea c.. 6cd412.. c277d6.. Unique Shared So we make some space in the index for it...

71 Indexing and Duplicate Elimination 0e7a c.. 15ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. 6cd412.. c277d6.. Unique Shared slip it in...

72 Indexing and Duplicate Elimination 0e7a c.. 15ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. 6cd412.. c277d6.. Unique Shared and add a pointer to the block in our file. That s all that we have to do. Just that single record append to the index. We don t have to modify any metadata or anything. Furthermore, this block remains mutable; we can update this block in place. However, that also means that our unique index entries are just hints. This block can be modified at any time. The host doesn t have to update the index to reflect the fact that it has a new hash, which means that every time that we use one of these entries, we have to verify that it s actually correct. This is how our index is resilient to stale information.

73 Indexing and Duplicate Elimination 0e7a c.. 15ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. 6cd412.. c277d6.. Unique Shared Now we can continue moving down the index. We see that our next hash fits in here and it s already in the index. In fact, it s in the index as a unique block, which means that this is our first duplicate of this block.

74 Indexing and Duplicate Elimination 0e7a c.. 15ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. c277d6.. Unique Shared Because it s a unique block in the index, that means that it just points to some existing file. The problem is that that file might be locked by some other host, which means that, again, we can t go mucking around with its metadata.

75 Indexing and Duplicate Elimination 0e7a c.. 15ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. c277d6.. Unique Shared However, we know that these two blocks are duplicates, which means that we have to do something about it. Now, we do know that we ve got an exclusive lock on the file we re currently deduplicating, which means that we re free to muck around with its metadata.

76 Indexing and Duplicate Elimination 0e7a c.. 15ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. c277d6.. Unique Shared We can merge its block into the virtual arena, sharing it copy-on-write.

77 Indexing and Duplicate Elimination 0e7a c.. 15ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. c277d6..? Unique Shared Then we can update the index to point to the virtual arena. Now we can post a merge request, telling the other host, When you get around to it, I think I ve got a duplicate block for you. Check it out and deal with it yourself. At some point in the near future, presumably, that second host is going to check for any merge requests for file that it has exclusive locks on...

78 Indexing and Duplicate Elimination 0e7a c.. 15ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. c277d6.. Unique Shared pick them up, verify that the block is, in fact, still a duplicate; remember that it s unique, so it could change. If it is still a duplicate, it will rewrite the pointer to point to the shared block. We re still scanning this index, so let s go back to that.

79 Indexing and Duplicate Elimination 0e7a c.. 15ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. c277d6.. Unique Shared We keep moving down the index and we see that our last hash fits in here. Again, it s in the index already, but it s a shared block. This makes things a lot easier.

80 Indexing and Duplicate Elimination 0e7a c.. 15ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. Unique Shared We ve already got it in the virtual arena. We already know that it s copy-on-write.

81 Indexing and Duplicate Elimination 0e7a c.. 15ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. Unique Shared We just have to take the current block that we ve got and...

82 Indexing and Duplicate Elimination 0e7a c.. 15ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. Unique Shared update the metadata of that file to point to the existing shared block.

83 Indexing and Duplicate Elimination 0e7a c.. 15ba2b.. 6cd412.. ab bc c277d6.. d5e341.. f2a4d2.. f4bea9.. Unique Shared And we re done. And look! I ve just saved a bunch of space on my storage array.

84 Phew. Okay. So what does this all get us?

85 Evaluation There are a couple of important questions we have to ask.

86 Evaluation How much space does DeDe save? One is: How much space does this actually save us?

87 Evaluation How much space does DeDe save? How much overhead does DeDe introduce? Also, how much overhead does all this complexity that we just introduced add?

88 Evaluation How much space does DeDe save? How much overhead does DeDe introduce? How fast can DeDe deduplicate? And, finally, how fast can we actually deduplicate? We ve got to be able to keep up with the file system or we re going to start missing deduplication opportunities.

89 Space Savings: VDI Cluster Corporate Virtual Desktop Infrastructure cluster Desktop XP VM s 6 12 months of active use Originally cloned from small number of base images... In order to measure the space savings of this we took, as I said, a corporate virtual desktop infrastructure. This consisted of lots of XP desktop virtual machines that have been in active use for six to twelve months, which means that they actually have quite a bit of data in them. However, they were all originally cloned from a small set of original base images, which means that there s still a lot of similarity between them. We believe that this is actually pretty representative of how people usually run clusters like this.

90 Space Savings: VDI Cluster Corporate Virtual Desktop Infrastructure cluster Desktop XP VM s 6 12 months of active use Originally cloned from small number of base images 1.9 TB 113 VM s... In total, we gathered up 113 virtual machines, which was 1.9 terabytes of disk data. We treat zero blocks specially because there are just so darn many of them...

91 Space Savings: VDI Cluster Corporate Virtual Desktop Infrastructure cluster Desktop XP VM s 6 12 months of active use Originally cloned from small number of base images 1.9 TB 1.3 TB 113 VM s... which means that we re actually interested in deduplicating just 1.3 terabytes of data.

92 Space Savings: VDI Cluster 1.3 TB I ve already kind of spoiled the punch line, but for anybody who wasn t paying attention, we started with this 1.3 terabytes of data...

93 Space Savings: VDI Cluster 237 GB 1.3 TB and reduced it to 237 gigabytes of data. But now we can look at what s actually left.

94 Space Savings: VDI Cluster 237 GB 1.3 TB 173 GB unique 61 GB shared If we zoom in on this, we see that only 61 gigabytes of this is actually in shared data. That means that all of that space we just eliminated was just repetitions of those 61 gigabytes of data. In fact, some of these blocks are repeated as many as 200,000 times. That s 800 megabytes wasted on a single 4 kilobyte block. It s ridiculous. And then we have 137 (sic, 173) gigabytes of unique data left over. Now, you re probably wondering... we ve added this index and we ve added this virtual arena and all of this extra stuff. What s the overhead? How much space did we add to this? Did we actually save any space in the end? Actually, the overhead is already on the slide, it was just a little too small for me to label.

95 Space Savings: VDI Cluster 237 GB 1.3 TB 2.7 GB 173 GB unique 61 GB shared It s this 2.7 gigabyte blip over here.

96 Space Savings: VDI Cluster 237 GB 1.3 TB 2.7 GB 173 GB unique 61 GB shared 1.3 GB index file 194 MB v. arena 1.1 GB FS metadata If we zoom in on that, we see that 1.3 gigs of that is the index file. That s all that it takes to index all of the blocks in the file system. 194 megs of that is the virtual arena because it needs the file system metadata to point to all of these blocks. And 1.1 gigs of that is additional, new file system metadata because we had to break all of our blocks into 4 kilobyte blocks. You can see the paper for more details on that. So, in terms of space savings, this is actually really great and the space overhead is very small. What about the runtime overhead?

97 Runtime Effects Write monitoring Disk array caching We very intentionally designed DeDe to do as little as possible in-line precisely to reduce the runtime overhead of it. But we still have to worry about the write monitor, which happens in-line. Also, we have other runtime effects that happen from things like disk array caching. We did a number of benchmarks to measure the runtime overheads.

98 Runtime Effects Write monitoring Disk array caching We did all of these from within a virtual machine so this is all VM I/O just like everything else running on top of a host that was...

99 Runtime Effects Write monitoring Disk array caching EMC CLARiiON CX3-40 attached to a gigantic EMC CLARiiON CX3-40 storage area network.

100 Runtime Overhead: Write Monitoring Worst-case benchmark: 100% sequential write IO, No computation To measure the overhead of the write monitor, we did an absolute worst-case benchmark. This is running inside a virtual machine, performing 100% purely sequential write I/O to the virtual disk with as little computation as we could possibly muster. Absolute worst-case for the write monitor.

101 Runtime Overhead: Write Monitoring Worst-case benchmark: 100% sequential write IO, No computation Baseline Write Monitor CPU 33% 220% And, actually, it s not so great. We do consume a lot of CPU computing hashes. The good news is that this is a totally unrealistic workload. Any realistic workload will actually be performing writes (sic, reads), which don t have any overhead. It ll be performing computation, which means that the gap will actually be much smaller. So this is really worst case.

102 Runtime Overhead: Write Monitoring Worst-case benchmark: 100% sequential write IO, No computation Baseline Write Monitor CPU 33% 220% Bandwidth (MB/s) Latency (ms) The other good news is that we actually didn t have any impact on I/O. Whether or not we re running the write monitor, we re still getting 233 megs per second bandwidth to the disk and 8.6 milliseconds of latency.

103 Runtime Overhead: Write Monitoring Worst-case benchmark: 100% sequential write IO, No computation Baseline Write Monitor CPU 33% 220% Bandwidth (MB/s) Latency (ms) (See paper for database application benchmark) And, again, the overheads that we do have here are much less for a realistic application. You can see the paper for our analysis of a database application benchmark that we ran with the write monitor.

104 Runtime Gains: Disk Array Caching Reduced storage footprint Better caching Less IO On the flip side, we can actually improve runtime performance by better disk array caching. The basic idea here is that we ve just reduced, substantially, our storage footprint, which means that we get much better cache locality, which means that we have to go out to the disk platters much less often.

105 Runtime Gains: Disk Array Caching Reduced storage footprint Better caching Less IO Average boot time (secs) VDI "boot storm" # VMs booting concurrently To measure this, we ran a VDI boot storm benchmark, which is where you run between one and twenty identical virtual machines, boot them all up at the same time, and you see how long they take to boot on average.

106 Runtime Gains: Disk Array Caching Reduced storage footprint Better caching Less IO Average boot time (secs) VDI "boot storm" Full copies # VMs booting concurrently When we do this with fully copied virtual machines, you see, more or less, what you d expect. This is a very I/O-bound workload, which means that when you re running 20 virtual machines, you re doing 20 times as much I/O, which means it takes substantially longer to boot all of this up than one virtual machine.

107 Runtime Gains: Disk Array Caching Reduced storage footprint Better caching Less IO Average boot time (secs) VDI "boot storm" Full copies Deduplicated # VMs booting concurrently If we run deduplicated virtual machines, we see that the line is virtually flat. That s because all the way across the board, whether you re booting one VM or twenty VM s, you re doing basically one VM s worth of I/O. So, for particular workloads, such as VDI boot storms, this is actually really great. We can really improve the performance.

108 Out-of-band Deduplication Rate Index scan COW sharing The other thing that we have to worry about is the out-of-band deduplication rate. We have to be able to keep up with the file system if we expect to be able to deduplicate the file system. Remember that, because we re resilient to stale index entries, we can actually shed load if we really need to, but, it s best if we don t. There are two steps to the deduplication process that we really have to worry about the overhead of. One of them is scanning the index and the other is, when we actually find a duplicate block, sharing it copy-on-write.

109 Out-of-band Deduplication Rate Index scan COW sharing 6.6 GB/sec Virtually no cost for unique blocks For the index scan process, we can process 6.6 gigabytes worth of blocks per second because this is just a purely sequential scan through the index. It s very fast. The nice thing about this is that this is the only thing we have to do for unique blocks. This means that, if you have new or modified unique blocks, we can deal with 6.6 gigabytes worth of those blocks per second. Virtually no overhead for unique blocks.

110 Out-of-band Deduplication Rate Index scan 6.6 GB/sec COW sharing 2.6 MB/sec Virtually no cost for unique blocks COW sharing, on the other hand, is not quite so fast. That we can only do at 2.6 megabytes per second, so this is clearly our bottleneck.

111 Out-of-band Deduplication Rate Index scan 6.6 GB/sec COW sharing 2.6 MB/sec (It s a prototype!) Virtually no cost for unique blocks But, I ll remind you that this is just a prototype. We can substantially improve this number and we already know a number of ways we could do that. Think of this as a lower bound.

112 Out-of-band Deduplication Rate Index scan 6.6 GB/sec COW sharing 2.6 MB/sec (It s a prototype!) Virtually no cost for unique blocks 9 GB of new shared blocks per hour (And provisioning can be special-cased) However, even if we can only share 2.6 megs per second, that still means that we can deal with 9 gigabytes of newly shared blocks per hour. In something like a virtual desktop infrastructure, which is really what we re targeting, that s probably sufficient, actually, because you don t get new shared blocks all that often. The most common source of lots and lots of duplicate blocks in very short time is provisioning operations and those we can special case and do very quickly. So, to wrap all of this up, deduplication is very effective at reducing space. It has some overheads, although for some workloads it actually improves performance. The deduplication rate is acceptable, we believe even though it s just a prototype for realistic workloads.

113 Related Work Centralized archival Venti Data Domain Foundation Centralized primary storage NetApp ASIS Microsoft Single Instance Store Distributed Farsite SAN with Coordinator DDE There s quite a bit of related work out there. Deduplication s been around for a long time. Although most of it is actually archival systems: secondary, backup systems. Other than the idea of using content hashes to identify duplicate blocks and indexes, live file systems are really a different beast. More recently, a lot of work has appeared dealing with live file systems; however, it all requires either a central host and/or exclusive access to the file system, whereas, on the other hand...

114 Conclusion Decentralized, out-of-band, live file system deduplication. DeDe is a completely decentralized system. It performs all deduplication out of band for performance, and it operates on a live, decentralized file system.

115 Conclusion Decentralized, out-of-band, live file system deduplication. Deduplication is effective. In conclusion, deduplication is really effective, especially in environments like this.

116 Conclusion Decentralized, out-of-band, live file system deduplication. Deduplication is effective. Deduplication is hard. Unfortunately, in environments like this, deduplication is also really hard.

117 Conclusion Decentralized, out-of-band, live file system deduplication. Deduplication is effective. Deduplication is hard. Three-stage deduplication has only modest performance overhead. However, using this three-stage approach where we split out write monitoring from local deduplication and from cross-host deduplication, we can actually perform deduplication in this difficult environment, with only modest performance overhead.

118 Conclusion Decentralized, out-of-band, live file system deduplication. Deduplication is effective. Deduplication is hard. Three-stage deduplication has only modest performance overhead. Thank you. Thank you, and I ll take any questions now.