Efficient File Storage Using Content-based Indexing

Efficient File Storage Using Content-based Indexing João Barreto joao.barreto@inesc-id.pt Paulo Ferreira paulo.ferreira@inesc-id.pt Distributed Systems Group - INESC-ID Lisbon Technical University of Lisbon http://www.gsd.inesc-id.pt/

Why Using Content-based Indexing for File Storage? Extending content-based indexing (e.g. as used by LBFS [MCM01]) from network transference of file contents to the context of local file storage is a natural step. Particularly interesting as a storage-efficient support for: Versioning file systems Resource-constrained embedded file systems However, access performance must be acceptable and storage gains significative. 2

Existing Solutions: Chunk Repository Storage Model To some extent, all existing file storage architectures that are based on content-based indexing share a core storage model [CN02, QD02, BF04]. File contents are divided into disjoint chunks of data, each individually stored with a unique hash of its contents in a repository of chunks. The actual files are then stored as sequences of possibly shared references to chunks in the repository. file 1 file 2 file 3 Example using Chunk Repository Model: r 5 r 2 r k r 3 r n r 3 r n r 4 r 1 Legend c i chunk contents h i chunk hash r i chunk reference Chunk Repository h 1 c 1 h 2 c 2 h 3 c 3 h 14 c 14 h 5 c 5 h 1k c 1k h n c n `...... ` 3

Problems of Chunk Repository Storage Model Higher storage penalties as chunk size is decreased: 1. Increased chunk meta-data overhead (mainly with chunk hashes); 2. Increased internal fragmentation, if the chunk repository is stored on a block-based device; 3. Lower chunk compression ratios are achieved. Trade-off restricts the choice of the expected chunk size to relatively high values, hence existing solutions do not fully exploit the similarity that may exist in a file system. Sequential read performance is penalized even if the file being accessed does not share any chunk with other files Since chunks are stored in a randomly organized repository. 4

The Proposed Storage Model Two distinguishing principles: 1. If a file shares no chunks with the remaining file system, it should be stored in a plain form. Succeeding files that share any chunks with that file will reference those portions of its plain contents 2. Hashes of chunks are not stored permanently along with contents of the file system Content similarity detection of new file system data requires indexing whole contents of the file system Performed periodically in background file 1 file 2 file 3 Previous Example using the Proposed Model: c 5 c 2 c k c 3 r 3 r n c 4 c n c 1 Legend c i chunk contents r i chunk reference 5

Chunk Coalescing Optimises cases where consecutive pointers to contiguous shared chunks are detected. Such pointers are replaced by a simple multiple-chunk pointer, thus reducing the storage overhead and allowing faster access performance. In practice, this is comparable (though not always equivalent) to considering a higher chunk size whenever resorting to a lower size would yield no additional similarity gains. Example of Chunk Coalescing file r 3 file c 3 r n c 4 c 1 3 2 c n (before chunk coalescing) file r c 3 3,n 4 c 1 6 (after chunk coalescing)

Advantages In case of no similarity, no storage overhead is imposed and read access performance is identical to that of a regular file system. Storage penalties that resulted from dividing files into smaller chunks are eliminated: In case of no sharing of a chunk, no chunk storage overhead; In case of sharing, storage overhead is negligible and always compensated by the gains resulting from the increased chunk sharing: Hash values are not stored along with contents; Update coalescing optimises chunk reference overhead; In case of an underlying block-based file system, internal fragmentation is, on average, not affected; Data compression may be applied to the unshared portions of each file as a whole, thus achieving higher compression ratios than to individual smaller chunks, ; 7

Current Status Partially functional in a simulator. Currently being implemented as a Linux Virtual File System. References [BF04] J. Barreto and P. Ferreira. A replicated file system for resource constrained mobile devices. In Proceedings of IADIS International Conference on Applied Computing, 2004. [CN02] L. Cox and B. Noble. Pastiche: Making backup cheap and easy. In Proceedings of the Fifth ACM/USENIX Symposium on Operating Systems Design and Implementation, Boston, MA, December 2002. [MCM01] Athicha Muthitacharoen, Benjie Chen, and David Mazieres. A low-bandwidth network file system. In Symposium on Operating Systems Principles, pages 174 187, 2001. [QD02] S. Quinlan and S. Dorward. Venti: a new approach to archival storage. In First USENIX conference on File and Storage Technologies, Monterey,CA, 2002. 8