BlobSeer: Enabling Efficient Lock-Free, Versioning-Based Storage for Massive Data under Heavy Access Concurrency

Transcription

1 BlobSeer: Enabling Efficient Lock-Free, Versioning-Based Storage for Massive Data under Heavy Access Concurrency Gabriel Antoniu 1, Luc Bougé 2, Bogdan Nicolae 3 KerData research team 1 INRIA Rennes - Bretagne-Atlantique, France 2 ENS Cachan - Brittany, France 3 University of Rennes 1, France Second INRIA-UIUC Joint Workshop, Urbana, December

2 New challenges for large-scale data storage Scalable storage management for new-generation, data-oriented high-performance applications Massive, unstructured data objects (Terabytes) Many data objects (10³) High concurrency (10³ concurrent clients) Fine-grain access (Megabytes) Large-scale platforms: large clusters, grids, clouds, petascale machines, desktop grids Applications: distributed, with high-throughput requirements under concurrency Map-Reduce-based data-mining applications High resolution medical image processing Data-intensive HPC simulations Storage services for cloud infrastructures Checkpointing on desktop grids A new research team at INRIA Rennes: KerData - Recently created from the PARIS project-team 2

3 BlobSeer: a BLOB-based approach Generic data-management platform for huge, unstructured data Huge data (TB) Highly concurrent, fine-grain access (MB): R/W/A Prototype available Ph.D. theses: Bogdan Nicolae, Alexandra Carpen Amarie, Diana Moise, Viet-Trung Tran Key design features Decentralized metadata management Beyond MVCC: multiversioning exposed to the user Lock-free concurrent writes (enabled by versioning) A back-end for higher-level, sophisticated data management systems Short term: highly scalable distributed file systems Middle term: storage for cloud services Long term: extremely large distributed databases 3

4 BlobSeer: key design choices Each blob is fragmented into equally-sized pages Allows huge data amounts to be distributed all over the peers Avoids contention for simultaneous accesses to disjoint parts of the data block Metadata : locate pages that make up a given blob Fine-grained and distributed Efficiently managed through a segment tree over a DHT Versioning Update/append: generate new pages rather than overwrite Metadata is extended to incorporate the update Both the old and the new version of the blob are accessible 4

5 BlobSeer: architecture Providers Clients Perform fine grain blob accesses Providers Store the pages of the blob Provider manager Monitors the providers Favors data load balancing Metadata providers Store information about page location Clients Version manager Provider manager Version manager Ensures concurrency control Metadata providers 5

6 How does a read work? 1. Optionally ask the version manager for the latest published version I II Client Providers Metadata providers Version manager 2. Fetch the corresponding metadata from the metadata providers III 3. Contact providers in parallel and fetch the pages in the local buffer 6

7 How does a write work? 1. Get a list of providers that are able to store the pages, one for each page 2. Contact providers in parallel and write the pages to the corresponding providers 3. Get a version number for the update 4. Add new metadata to consolidate the new version 5. Report the new version is ready for publication. I II III IV V Client Providers Metadata providers Version manager Provider manager 7

8 How versioning enables efficient, heavy access concurrency Client #1 Client #2 Providers Metadata providers Version manager Pages are written concurrently by the clients Versions are assigned in the order the clients finish writing Metadata is written concurrently by the clients Publish Publish Versions are published in the order they were assigned 8

9 Metadata zoom (1) Organized as a segment tree Each node covers a range of the blob identified by (offset, size) The first/second half of the range is covered by the left/right child Each leaf corresponds to a page and holds information about its location [0, 4] [0, 2] [2, 2] [0, 1] [1, 1] [2, 1] [3, 1] 9

10 Metadata zoom (2) Each node holds versioning Information [0, 8] Write/Append Add leaves and build subtree up to the root The tree may grow one level Read: descend from the root towards the leaves Tree nodes are distributed among metadata providers Full access concurrency: R/R, R/W, W/W [0, 4] [0, 4] [0, 2] [0, 2] [2, 2] [2, 2] [0, 1] [1, 1] [1, 1] [2, 1] [2, 1] [3, 1] [4, 4] [4, 2] [4, 1] 10

11 How concurrent writes work by example Initial version: v = 1 2 concurrent writers: gray and black Both write their pages independently Gray is first, it is enqueued on the versioning manager and assigned version v2, black follows and gets v3 Both write independently the metadata tree nodes: black is faster and links to (the not yet created node) B2 First to finish is black, it is marked ready Next is gray, being the first means its root gets published and it is dequeued Finally black gets first in the queue and and will be published 11

14 Evaluation: experimental platform Implementation Custom RPC layer based on Boost ASIO Metadata providers rely on a custom simplified DHT Testbed: Grid 5000 Used the nodes of two sites: Rennes and Orsay Each node: x86_64 architecture, 4GB RAM Internode parameters within the same cluster: Bandwidth: 117MB/s with MTU=1500B Latency: 0.1ms 14

15 Benefits of data decentralization Presented at Europar

16 Impact of metadata decentralization under heavy pressure 90 storage machines, on each: 1 data provider 1 metadata provider 90 client machines, on each: 4 writers Each writer writes 128 consecutive pages of 64KB for 50 times Represented: total aggregated bandwidth for all writers Presented at Europar

17 Towards a BLOB-based file system Goal: Build a BLOB-based file system, able to cope with huge data and heavy access concurrency in a large-scale environment Hierarchical approach High-level file system metadata management: the Gfarm grid file system Low-level object management: the BlobSeer BLOB management system Gfarm BlobSeer 17

18 The Gfarm grid file system The Gfarm file system [University of Tsukuba, Japan] A distributed file system designed for working at the Grid scale File can be shared among all nodes and clients Main components Gfarm's metadata server File system nodes Gfarm clients gfmd: Gfarm management daemon gfsd : Gfarm storage daemon 18

19 Why combine Gfarm and BlobSeer? Gfarm POSIX interface User management GSI support BlobSeer Lack of POSIX file system interface Gfarm/BlobSeer POSIX interface User management GSI support File sizes are limited Not suitable for concurrent access No versioning Access concurrency Fine-grain access Versioning Access concurrency Huge file sizes Fine-grain access Versioning General idea: Gfarm handles file metadata, BlobSeer handles file data 19

20 Coupling Gfarm and BlobSeer [1] The first approach Each storage node (gfsd) connects to BlobSeer to store/get Gfarm file data Gfarm 1 The gfsd manage the mapping from Gfarm files to BLOBs 3 2 The gfsd always acts as an intermediary for data transfer BlobSeer 4 20

21 Coupling Gfarm and BlobSeer [1] The first approach Each storage node (gfsd) connects to BlobSeer to store/get Gfarm file data The gfsd manage the mapping from Gfarm files to BLOBs Gfarm The gfsd always acts as an intermediary for data transfer Bottleneck! BlobSeer 4 21

22 Coupling Gfarm and BlobSeer [2] Gfarm Second approach The gfsd maps Gfarm files to BLOBs, and provides the client with the BLOB ID Then, the client directly access data in BlobSeer

23 Experimental evaluation on Grid'5000 [1] Access throughput under concurrency Configuration 1 gfmd 1 gfsd 24 data providers Each client accesses 1GB of a 10GB file Page size 8MB Gfarm sequentializes concurrent accesses Presented at CoreGrid ERCIM Group Workshop,

24 Experimental evaluation on Grid'5000 [2] Access throughput under heavy concurrency Configuration (deployed on 157 nodes) 1 gfmd 1 gfsd Each client accesses 1GB of a 64GB file Page size 8MB Up to 64 concurrent clients 64 data providers 24 metadata providers 1 version manager 1 page manager Presented at CoreGrid ERCIM Group Workshop,

25 Work in progress: Introducting versioning in Gfarm/BlobSeer Clients may access data in a specified file version Not only rollback data when desired, but also access different file versions within the same computation Favors efficient access concurrency Approach Delegate versioning management to BlobSeer A Gfarm file is mapped to a single BLOB A file version is mapped to the corresponding version of the BLOB 25

26 Versioning interface Versioning capability was fully implemented At Gfarm API level ( gfs_get_current_version(gfs_file gf,size_t *version ( gfs_get_latest_version(gfs_file gf,size_t *version ( gfs_set_version(gfs_file gf,size_t version ( gfs_pio_vread(size_t nversion,gfs_file gf, void *buffer, int size, int *np At POSIX file system level Defined some ioctl commands fd = open(argv[1], 0_RDWR); np = pwrite(fd, buffer_w, BUFFER_SIZE,0); ioctl(fd, BLOB_GET_LATEST_VERSION, &nversion); ioctl(fd, BLOB_SET_ACTIVE_VERSION, &nversion); np = pread(fd, buffer_r, BUFFER_SIZE,0); ioctl(fd, BLOB_GET_ACTIVE_VERSION, &nversion); close(fd); 26

27 Work in progress: support for MapReduce Integrating BlobSeer with Yahoo! s Hadoop MapReduce framework Use BlobSeer instead of HDFS Implemented a Java API for BlobSeer Basic file system operations: create, read, write... BlobSeer File System (BSFS) File system namespace - keeps file metadata, maps files to BLOB s Data prefetching Exposing data distribution 27

28 BSFS vs. HDFS: concurrent reads from a shared file 28

29 BSFS vs. HDFS: distributed grep 29

30 Open issues and opportunities for collaboration BSFS/BlobSeer on Petascale architectures: open issues Impact of topology-awareness: multi-level hierarchy Impact of data access patterns in Petascale applications Coupling topology-aware storage ressource management with job scheduling Which fault-tolerant mechanisms to use to ensure a high availability for data and metadata? Which strategy to use for metadata distribution? BSFS/BlobSeer vs. GPFS? BSFS is highly optimized for heavy access concurrency Leverage the versioning support? An in-depth comparison with data-intensive applications with highly concurrent accesses may prove interesting Imagine some cooperation scheme? 30