National Data Storage data replication in the network Maciej Brzeźniak, Michał Jankowski, Norbert Meyer, PSNC, Supercomputing Dept. 1st Technical meeting in Munich, December 5-6th, 2011 Project funded by: NCBiR for 2011-2013 under KMD2 project (no. NR02-0025-10/2011) Full Polish name of the project: System bezpiecznego przechowywania i współdzielenia danych oraz składowania kopii zapasowych i archiwalnych w Krajowym Magazynie Danych Project partners 10 Polish universities and supercomputing centres:
National Data Storage NDS Overview: NDS Architecture: Design assumptions Overal architecture Data replication in NDS Data Replication modes Replication protocols usage User profiles vs data replication settings Rule-based replication? NDS vs external world vs EUDAT NDS future: NDS2 secure data storage and exchange
NDS - design assumptions Overall assumptions: Avoid SoF - distributed: data & meta-data replication Standard access protocols(tools) to be usable Abstraction of system internals Logical namespace visible to user; Separate namespaces for different user groups Robust implementation(c/c++) within 2 years Tape systems (HSMs) on the back-end (for cost-effieciency) Main applications Archival and backup data storage Effective storage and accesss of large files Multiple small files not welcome
NDS Project status National Data Storage (R&D project: 2007-2009) System architecture & concept Software stack (rpms for CentOS/RHEL) Current NDS deployment: Backup and Archive Services for Science BADSS (Service Platform for e-science) Capacity: 12,5 PB of tapes in 5 sites & performance: 2 PB of disks in 5 sites National Data Storage 2 (R&D project: 2011-2013) Secure storage and data sharing (user-side encryption + integrity control)
NDS highlights Automated, TRANSPARENT, data replication Users do not see the details (if they don t want; they can) They speak to remote virtual filesystem Abstract data access interfaces: File-system view of the data (remote virtual filesystem) NDS is implemented as a user-level code (FUSE library) User access: standard methods: SFTP, WebDAV, GridFTP Storage access: NFS / GridFTP-NFS (each SN exposes at least NFS) Meta-data replication: Automated, transparent Postgress Slony-I + semi-synchronous replication DR, not full HA (no recovery automation)
NDS architecture (1) Overall picture User Metadata DB Database Access Methods Servers (SSH, HTTPs, WebDAV...) VFS for data and meta-data NDS system logic Access Users DB Accounting & limits DB Replica access methods servers (NFS, GridFTP) FS with data migration (HSM) Replication Storage Storage HSM system (NFS) NAS appliance
NDS architecture (2) Data replication & presentation User Metadata DB Database Access Methods Servers (SSH, HTTPs, WebDAV...) VFS for data and meta-data NDS system logic Access Users DB Accounting & limits DB Replica access methods servers (NFS, GridFTP) FS with data migration (HSM) Replication Storage Storage HSM system (NFS) NAS appliance
NDS architecture (3) Data replication & presentation Data Daemon Implements the core NDS system logic (together with MC): I/O serving, filesystem presentation data operations with replication meta-data-related operations Emulates Virtual File System Supports most of POSIX functions: open, close, read, write opendir, readdir, getattr, setattr, rename, link, unlink... Based on FUSE (Filesystem in USErspace) Additional: enforces security policies (access control) optimizes replica access and creation implements limits and accounting
NDS architecture (4) Async. vs sync. replication from VFS perspective: Writing to the system (async mode) VFS: OPEN (new file, O_RDWR O_CREAT) - Register a new logical file in MC (lock for writing) - Create one physical replica - Register replica in MC VFS: WRITE... - Write to localreplica(async.) (QUICK-local, single replica write) - Update meta-data (size, last access etc.) VFS: CLOSE (on anopenedfile) - Flush buffers and close replica(async.) (QUICK) - Update meta-data incl. release write locks - Return to user Asynchronous action: Make replicas - Enqueue replication tasks to replication daemon - Update meta-data - Replications daemon (in the background) does the replicas (typically 3-rd party: SN1->SN2) Writing to the system (sync mode) VFS: OPEN (new file, O_RDWR O_CREAT) - Register a new logical file in MC (lock for writing) - Create physical multiple replicas - Register replicas in MC VFS: WRITE... - Write to all replicas(sync.) (TAKES TIME-remote, multiple sites to write) - Update meta-data (size, last access etc.) VFS: CLOSE (on anopenedfile) - Flush buffers(also to remote replicas) (TAKES TIME) - Update meta-data incl. release write locks - Return to user All replicas already done
NDS architecture (5) Replica access methods (AN-SN) (1) Access Methods Servers (SSH, HTTPs, WebDAV...) VFS for data and meta-data NDS system logic Replica access client (NFS) Replica access client (GridFTP) Access Low latency High bandwidth (eg. 10 Geth) LAN WAN High latency High bandwidth (eg. 1 Geth) NFS GridFTP NFS GridFTP or NFS used when needed LOCAL Storage Replica access method (GridFTP) Replica access method (NFS) Replica access method (GridFTP) Replica access method (NFS) REMOTE Storage
NDS architecture (6) Replica access methods (AN-SN) (2) NFS and GridFTP used where they fit best: Static protocol selection (currently), Dynamic protocol selection e.g. basing on file size (planned) NFS State-less, IOPS friendly Lowoverheadon IOPS operations: small files access meta-data related operations Low performance(mb/s) on long-distance No parallelism (to/from single file) NFS 4.1. (pnfs) on the horizonbut stillnot there Usage in NDS: meta-data related operations accessing replicas on local SNs access to small files on remote SNs (future) Stable, standardised GridFTP State-full, can exploit available badwidth High overhead on IOPS operations: even small files access and meta-data ops require session High performance (MB/s) despite distance Parallelism (up to 256 streams) 64+ streamscansustain1geth link (1000 km-long) Usage in NDS: 3-rd party replication(async. mode) transferring replicas to/from remote SNs Stability issues, even if standard in Grids
NDS architecture (7) GridFTP 3-rd party replication (SN-SN) (3) 3-rd party transmission (SN->SN) used in async. replication mode Access Methods Servers (SSH, HTTPs, WebDAV...) VFS for data and meta-data NDS system logic Replication daemon (GridFTPclient) Access Low latency High bandwidth (eg. 10 Geth) LAN WAN High latency High bandwidth (eg. 1 Geth) GridFTP control conection LOCAL Storage Replica access method (GridFTP) WAN High latency High bandwidth (eg. 1 Geth) Replica access method (GridFTP) REMOTE Storage
NDS architecture in PLATON: Replica access: GridFTP-NFS access to SNs Access Methods Servers (SSH, HTTPs, WebDAV...) VFS for data and meta-data NDS system logic Replica access client (GridFTP) Replica access client (NFS) Access GridFTP GridFTP virtual Storage Replica access method (GridFTP) NFS client Replica access method (GridFTP) NFS client virtual Storage NFS NFS HSM system (NFS) NAS appliance(nfs) FS with data migration (HSM)
NDS architecture (8) Replication-related settings (1) Replication-related parameters of profile: Profile parameter Possible values and meaning Replication mode Asynchronous (default) Synchronous Number of replicas Typically 2 (max. 3) Can be set to any value Allowed storage sites and nodes: Default replica locations Additional replica location Storage media type: Disk vs tape (HSM) Replicas are typically created in default locations Additional replica locations are used in case of failure of default ones Using a combination of allowed storage sites & nodes + knowledge on the deployment infrastructure we can determine media type
NDS architecture (9) Replication-related settings (2) Replication is configured per-profile, NOT per data-object, e.g. directory / file Policies are static cannot be changed dynamically Users can use one or many profiles: Assigned to profiles using DNs of certs => multiple certificates have to be used in order to access different profiles Fast, HA & FT spacefor backups: Replication: SYNC 3 replicas on (1 local + 2 distant) on disks only FT space for archives: Replication: ASYNC 2 replicas on distant nodes; both copies on tapes Safestoragespacefor collaboration: Replication: ASYNC 2 replicas: local on disk + 1 remote in HSM
NDS features vs EUDAT (1) Automated, TRANSPARENT, data replication Safe, transparent replication service case service Abstract data interfaces above and below NDS: SFTP, WebDAV, GridFTP for users Possibleto interfacewith NDS from other systems e.g. 3-rd party transfer to/from NDS Data available through VFS layer on ANs: Possible to add new access methods Some work needed to extend the authentication mechanisms Storage access: NFS / GridFTP Any kind of storage can be used as the backend... as far as it provides NFS service service GridFTP front-end to storage HSM system (NFS) FS with migration (HSM) Access Methods (SSH, HTTPs, WebDAV...) NDS Access Methods servers NDS VFS NDS logic GridFTP front-end to storage NAS appliance (NFS) GridFTP front-end to storage Any other storage(nfs) s data centres?
NDS features vs EUDAT (2) Persistent IDs? User always sees the same logical structure, nevertheless: The replication process: Physical location is transparent Replication process does not affect the logical namespace which Access he uses: The logical structure of VFS is the same everywhere what access method he uses: failures: The logical structure of VFS is presented similarly through different access methods As long as at least one replica is OK Pathto the file ordirectoryisconstant => Is this PID-like feature?
NDS features vs EUDAT (3) User-level metadata: User canassignfree-form textfilesto data objects, they can include metadata This is done through Web-GUI or procfs-like mechanism Medata search possible but not yet implemented (on the roadmap) Can above be somehow re-used in EUDAT Extendability? Functionality can be easily extended as the architecture & interfaces are open (Postgresql, NFS/GridFTP...) micro-services like approach possible: but requires effoert on the NDS consortium side For instance: some basic interfaces to meta-data can be defined (e.g. for searching data meeting some criteria) Example: We currently design and develop a mechanism for periodic data integrity checking (data scrubbing)
NDS2 - features Secure data storage: Data encrypted on the user side Symmetric keys to file stored in the system - protected by user s asymmetric key Integrity control on the user side MD5 s/sha1 digests stored in the system - encrypted Secure data exchange: 2-level access control: ACLs on virtual filesystem level User-side encryption and keys exchange make the sharing safe (e.g. if we don t trust provider) Secure data publication: 2 kinds of storage space: private (for internal users) and public (sanbox ed) Multiple web servers (load-balancing, HA, data synchronisation) to serve data effectively Specialised user-side tools needed: Java GUI for managing file sharing, ACLs, publication and versioning Virtual encrypted (!) filesystem for end-users: both for Linux and Windows Status: R&D project (2011-2013); prototype expected in 2Q2013
Backup slides
NDS1: Summary Data storage & replication:: Implemented @ VFS level: portability and security Robust and lightweight Data replication: Automatic, transparent to users - Sync. and async modes NFS or GridFTP or GridFTP 3rd party used to access//make replicas Meta-data handling & replication: Handles file system-level medata and user-level metadata Logically centralized but DR solutions in place for quick recovery Logical filesystem structure persistency: Physical location-agnostic access Pluggable : Open interfaces, standard interfaces to external world (both user- and storage-side) We can provide custom interfaces to meta-data if needed
NDS architecture: (10) Meta-catalog (1) Functionality: System-level meta-data storage and handling File system structure Data replicas User-level meta-data storage Implementation: C++ library used by Data Daemon Postgres database with Slony-I replication at the backend Separation of namespaces: No sharing among users groups assumed Security by isolation Scalability multiple instances of MC for multiple users groups / insitutions Metadata DB Database
NDS architecture: (11) Meta-catalog (2) Meta-data redundancy: problem: reliability and performance? (1) Postgres database with Slony-I replication Each meta-catalog replicated asynchronously in master-slaves mode (Slony-I) In case of failure of master MC: slave MC is manually selected as master (DR, not full HA, human intervention needed) (2) Semi-synchronous data replication: All operations on metadata synchronously logged to distributed logs In case of failure of master MC: part of operations logged are repeated on the new master (human interv. needed) Comments: Reliability similar to synchronous DBMS replication but mechanism is lighter!!!
PLATON s B/A service access: sftp sftp: Well-known, secure data upload/download method WinSCP example:
PLATON s B/A service access: WebDAV Web Browser-based WebDAV access (read only)
PLATON s B/A service access: WebDAV Windows built-in WebDAV(Web Folders) client supports: mapping NDS filesystem as the Network Drive drag & drop
PLATON s B/A service access: NDS web application
PLATON s B/A service access: NDS Web application filesystem navigation
PLATON s B/A service access: NDS Web application meta-data view
PLATON s B/A service access: NDS MDFS filesystem for meta-data access