Ivy: A Read/Write Peer-to-Peer File System

Similar documents
Peer-to-Peer and Grid Computing. Chapter 4: Peer-to-Peer Storage

We mean.network File System

Distributed File Systems. NFS Architecture (1)

Lab 2 : Basic File Server. Introduction

P2P Storage Systems. Prof. Chun-Hsin Wu Dept. Computer Science & Info. Eng. National University of Kaohsiung

A Serverless, Wide-Area Version Control System. Benjie Chen

Ryusuke KONISHI NTT Cyberspace Laboratories NTT Corporation

SiRiUS: Securing Remote Untrusted Storage

COS 318: Operating Systems. File Layout and Directories. Topics. File System Components. Steps to Open A File

Distributed File Systems

Network File System (NFS)

Distributed File Systems. Chapter 10

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Kaseya 2. User Guide. Version 7.0. English

Acknowledgements. Peer to Peer File Storage Systems. Target Uses. P2P File Systems CS 699. Serving data with inexpensive hosts:

Oracle Cluster File System on Linux Version 2. Kurt Hackel Señor Software Developer Oracle Corporation

SiRiUS: Securing Network Storage

Building Secure File Systems out of Byzantine Storage

ViewBox: Integrating Local File System with Cloud Storage Service

tmpfs: A Virtual Memory File System

Last class: Distributed File Systems. Today: NFS, Coda

Course Outline. SQL Server 2014 Performance Tuning and Optimization Course 55144: 5 days Instructor Led

COS 318: Operating Systems

Async: Secure File Synchronization

Big Table A Distributed Storage System For Data

PETASCALE DATA STORAGE INSTITUTE. Petascale storage issues. 3 universities, 5 labs, G. Gibson, CMU, PI

Network File System (NFS) Pradipta De

Sync Security and Privacy Brief

CS435 Introduction to Big Data

A Data De-duplication Access Framework for Solid State Drives

Chapter 11 Distributed File Systems. Distributed File Systems

P2P-based Storage Systems

Evaluating parallel file system security

Load Balancing in Structured Overlay Networks. Tallat M. Shafaat

Amoeba Distributed Operating System

Chord. A scalable peer-to-peer look-up protocol for internet applications

EMC VNX Series: Introduction to SMB 3.0 Support

Storage Systems Autumn Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

Chapter 11: File System Implementation. Operating System Concepts with Java 8 th Edition

A File System Design for the Aeolus Security Platform. Francis Peter McKee

ZooKeeper. Table of contents

Software Execution Protection in the Cloud

HADOOP MOCK TEST HADOOP MOCK TEST I

Cloud Computing at Google. Architecture

Parallels Cloud Storage

Integrating Data Protection Manager with StorTrends itx

Wide-area Network Acceleration for the Developing World. Sunghwan Ihm (Princeton) KyoungSoo Park (KAIST) Vivek S. Pai (Princeton)

Flexible Storage Allocation

Chapter 12 File Management. Roadmap

Chapter 12 File Management

BrightStor ARCserve Backup for Windows

SQL Server 2014 Performance Tuning and Optimization 55144; 5 Days; Instructor-led

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Network Attached Storage. Jinfeng Yang Oct/19/2015

Seminar Presentation for ECE 658 Instructed by: Prof.Anura Jayasumana Distributed File Systems

I. Middleboxes No Longer Considered Harmful II. A Layered Naming Architecture for the Internet

Course 55144B: SQL Server 2014 Performance Tuning and Optimization

Lustre* HSM in the Cloud. Robert Read, Intel HPDD

Plutus: scalable secure file sharing on untrusted storage

An Introduction to Peer-to-Peer Networks

SERVER CLUSTERING TECHNOLOGY & CONCEPT

THE BCS PROFESSIONAL EXAMINATIONS BCS Level 6 Professional Graduate Diploma in IT. April 2009 EXAMINERS' REPORT. Network Information Systems

Hypertable Architecture Overview

REDCENTRIC AGENT FOR HYPER-V VERSION

Distributed File Systems

Setup SSL in SharePoint 2013 Using Domain Certificate

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

PEER TO PEER CLOUD FILE STORAGE ---- OPTIMIZATION OF CHORD AND DHASH. COEN283 Term Project Group 1 Name: Ang Cheng Tiong, Qiong Liu

Merkle Hash Trees for Distributed Audit Logs

Benchmarking FreeBSD. Ivan Voras

Berkeley Ninja Architecture

Using Peer to Peer Dynamic Querying in Grid Information Services

Tivoli Storage Manager Explained

2. Design Criteria Security Considerations System Considerations

The Oracle Universal Server Buffer Manager

Virtual machine interface. Operating system. Physical machine interface

Percona Server features for OpenStack and Trove Ops

Operating Systems CSE 410, Spring File Management. Stephen Wagner Michigan State University

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Fossil an archival file server

Introduction to Git. Markus Kötter Notes. Leinelab Workshop July 28, 2015

NEXT-GENERATION STORAGE EFFICIENCY WITH EMC ISILON SMARTDEDUPE

Raima Database Manager Version 14.0 In-memory Database Engine

Deploying Exchange Server 2007 SP1 on Windows Server 2008

InfiniteGraph: The Distributed Graph Database

IBRIX Fusion 3.1 Release Notes

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

Providing a Shared File System in the Hare POSIX Multikernel. Charles Gruenwald III

How to Implement Multi-way Active/Active Replication SIMPLY

PARALLELS CLOUD STORAGE

QUICKBOOKS 2015 MINIMUM SYSTEM REQUIREMENTS & NETWORK SETUP

G Porcupine. Robert Grimm New York University

An Open Source Wide-Area Distributed File System. Jeffrey Eric Altman jaltman *at* secure-endpoints *dot* com

Configuring Security Features of Session Recording

Fast and secure distributed read-only file system 3

Understanding. Active Directory Replication

SQL Server Training Course Content

BookKeeper overview. Table of contents

map/reduce connected components

Transcription:

Ivy: A Read/Write Peer-to-Peer File System Athicha Muthitacharoen, Robert Morris, Thomer M. Gil, and Benjie Chen OSDI 2002

Overview Motivation Introduction Challenges Design Implementation Performance Discussion

Motivation Existing file systems (e.g. NFS, xfs, Harp) exist Limitation: Centralization of data or metadata. Their solution: A P2P storage system Previous P2P systems (e.g. CFS, PAST) Limitation: Mostly read-only or allow only the publisher to modify Their solution: Support multiple writers

Introduction Ivy is a multi-user read/write peer-to-peer file system. Uses DHash for storage. Does NOT rely on peer storage. Does NOT trust the users. Has NO centralized component.

Distributed Hashing DHash is a distributed peer-to-peer hash table mapping keys to arbitrary values. DHash stores each key/value pair on a set of Internet hosts determined by hashing the key.

Challenges Consistency when there are multiple writers P2P participants are unreliable. Cannot rely on locking. Untrusted participants impose data integrity challenges. Can you UNDO or IGNORE undesired modifications? It has to deal with network partitions and conflicting updates.

Design Ivy uses a set of logs. Linked list of immutable records. Contains changes to the file system Both metadata and data One per participant Records are appended to local log. Records are read from all logs.

Design Logs are arranged in views A view represents file system state Logs are stored in blocks Each record is a block. view block. log-head log-head Log-head block contains the most recent record log records Figure 1: Example Ivy view and logs. White boxes are DHash content-hash blocks; gray boxes are public-key blocks. View block points to log-heads from each participant

Design Logs blocks are stored in DHT Interface: put(key, value), get(key) view block log-head Log-head is a mutable block Identified by the public key of a participant Immutable blocks are contenthashed and cached by participants. log-head log records Figure 1: Example Ivy view and logs. White boxes are DHash content-hash blocks; gray boxes are public-key blocks.

Design Logs are stored in DHT Per-record replication and authentication Cryptographically signed for verification Field Use prev DHash key of next oldest log record head DHash key of log-head seq per-log sequence number timestamp time at which record was created version version vector Fields present in all Ivy log records. Each participant stores DHash key of log-head and prev record. Ivy uses version vectors for ordering of logs

Design Each log record represents updates from a single file system operation Stores 160-bit i-numbers for identifying files and dirs Stores attributes and permissions in log record. However, not used to enforce permissions. Relies on encryption. Logs are stored indefinitely for security reasons.

Design Type Fields Meaning Inode type (file, directory, or symlink), i-number, mode, owner create new inode Write i-number, offset, data write data to a file Link i-number, i-number of directory, name create a directory entry Unlink i-number of directory, name remove a file Rename i-number of directory, name, i-number of new directory, new file name rename a file Prepare i-number of directory, file name for exclusive operations Cancel i-number of directory, file name for exclusive operations SetAttrs i-number, changed attributes change file attributes End none end of log Summary of Ivy log record types. File system creation (mounts as NFS) Creates Inode record for root dir File creation Creates Inode and Link records Lookup operation Checks for Unlink record. File read operation Checks for Write records. Ignores SetAttrs records. File attributes Maintains: mtime, size, ctime, link count Dir listing Checks Link records. Ignores Unlink/Rename

Design Users create periodical snapshots of the system Incremental and persistent. Stored in DHT Obviates the need for reading entire log. Only the participants in the view are trusted. file map snapshot block meta-data i-number α. i-number γ. H(D) H(F) directory inode D H(E). file inode F H(B1) H(B2). directory inode block name n data block B1 data block B2 E i-number γ Figure 2: Snapshot data structure. H(A) isthe DHash contenthash of A.. System can recover from malicious modifications using logs.

Design Provides NFS-like semantics close-to-open consistency writes are deferred until close() DHT allows replication In partitioned state, participants can independently modify files. Allows concurrent updates. Exclusive updates done similar to two-phase commit (Prepare record) Uses versioned logs to handle conflicts automatically. Looks for concurrent version vectors.

Architecture Ivy appears as a file system (similar to NFS) Ivy agent private key DHash Server DHash Server Acts as loop-back NFS server with in-kernel client. Application [System Calls] Ivy Server DHash Server Built as a distributed application on top of DHT + Chord. NFS 3 Client [NFS 3] Kernel DHash Server

Chord Interface: lookup(key) IP O(log N) messages per lookup

Evaluation Workload Modified Andrew Benchmark (MAB) create dir tree, copy files into dir tree, walk dir tree and stat each file, read files, compile files

Evaluation Setup Block cache size: 512 blocks MAB and Ivy server run on GHz AMD Athlon running FreeBSD 4.5 DHash nodes are PlanetLab/RON nodes running FreeBSD 4.5 on 733MHz PIII. No replication (incomplete)

Evaluation Configurations Local vs WAN Function of number of: Users, DHash nodes, Concurrent writers,

Evaluation Phase Ivy (s) NFS (s) Mkdir 0.6 0.5 Create/Write 6.6 0.8 Stat 0.6 0.2 Read 1.0 0.8 Compile 10.0 5.3 Total 18.8 7.6 Table 3: Real-time in seconds to run the MAB with a single Ivy log and all software running on a single machine. The NFS column shows MAB run-time for NFS over a LAN. Single User MAB on LAN 386 NFS RPCs 508 log updates Ivy inserts 8.8MB for 1.6MB data SHA1 expensive Phase Ivy (s) NFS (s) Mkdir 11.2 4.8 Create/Write 89.2 42.0 Stat 65.6 47.8 Read 65.8 55.6 Compile 144.2 130.2 Total 376.0 280.4 Table 4: MAB run-time with four DHash servers on a WAN. The file system contains four logs. Single User MAB on WAN (DHT RTT 9, 16, 82 ms) 1 NFS req causes 3 log-head fetches (total fetches: 3346, bounded: 82ms) Ivy slower due to more n/w trips

Evaluation MAB run-time (seconds) 400 350 300 250 200 150 100 50 0 4 6 8 10 12 14 16 Number of logs (with one active log) MAB run-time (seconds) 500 400 300 200 100 0 8 16 24 32 Number of DHash servers One MAB s run-time (seconds) 800 700 600 500 400 300 200 100 0 1 2 3 4 Number of concurrent MABs Figure 5: MAB run-time as a function of the number of logs. Only one participant is active. Figure 6: Average MAB run-time as the number of DHash servers increases. The error bars indicate standard deviation over different choices of PlanetLab hosts and different mappings of blocks to DHash servers. Figure 7: Average run-time of MAB when several MABs are running concurrently on different hosts on the Internet. The error bars indicate standard deviation over all the MAB runs. Many logs, one writer Little impact: logs are fetched in parallel Many DHash Servers (host to servers avg rtt: 32ms) Runtime grows due to rtf to servers Many Writers Runtime grows because participants fetch each other s log heads

Discussion Ivy does not reclaim log storage space Ivy relies on logs to make updates. Discarding logs to reclaim space can hurt data security. With storage getting cheaper now, this design decision may not turn out to be too expensive

Discussion Ivy provides automatic, application-specific conflict resolution when partition heals. Uses application tools for resolution This may not work for all applications.

Discussion 160-bit i-numbers are generated randomly for files independently at each participant to minimize the probability of collision. What if the same i-numbers are allocated for different files.