LONG TERM RETENTION OF BIG DATA



Similar documents
Strategies and New Technology for Long Term Preservation of Big Data

Magnetic tape storage advances and the growth of archival data

Product Brief: XenData X2500 LTO-6 Digital Video Archive System

LTFS Takes Tape to the Next Level

FUJIFILM Recording Media U.S.A., Inc. LTO Ultrium Technology

Taming Big Data Storage with Crossroads Systems StrongBox

WHITE PAPER WHY ORGANIZATIONS NEED LTO-6 TECHNOLOGY TODAY

Media Cloud Building Practicalities

QUICK REFERENCE GUIDE: KEY FEATURES AND BENEFITS

Archive exchange Format AXF

New LTO Linear Tape File System Helps Make Video Storage Easier

<Insert Picture Here> Cloud Archive Trends and Challenges PASIG Winter 2012

AXF Archive exchange Format: Interchange & Interoperability for Operational Storage and Long-Term Preservation

XenData Product Brief: SX-550 Series Servers for LTO Archives

What is the LTO Program?

XenData Video Edition. Product Brief:

LTFS for Microsoft Windows User Guide

ForgetIT. Grant Agreement No Deliverable D7.1

Why StrongBox Beats Disk for Long-Term Archiving. Here s how to build an accessible, protected long-term storage strategy for $.003 per GB/month.

In ediscovery and Litigation Support Repositories MPeterson, June 2009

Best Practices for Long-Term Retention & Preservation. Michael Peterson, Strategic Research Corp. Gary Zasman, Network Appliance

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

IBM System Storage Portfolio Overview

Applications of LTFS for Cloud Storage Use Cases

EMC AVAMAR. a reason for Cloud. Deduplication backup software Replication for Disaster Recovery

RECOVERY SCALABLE STORAGE

Archive Data Retention & Compliance. Solutions Integrated Storage Appliances. Management Optimized Storage & Migration

The Archival Upheaval Petabyte Pandemonium Developing Your Game Plan Fred Moore President

Addressing Key Storage Objectives LTO Ultrium 5 Technology and Linear Tape File System

EMC BACKUP MEETS BIG DATA

XenData Archive Series Software Technical Overview

Ex Libris Rosetta: A Digital Preservation System Product Description

Solution Brief: Creating Avid Project Archives

AFS Usage and Backups using TiBS at Fermilab. Presented by Kevin Hill

A 5 Year Total Cost of Ownership Study on the Economics of Cloud Storage

Storage Switzerland White Paper Storage Infrastructures for Big Data Workflows

Simple Storage Service (S3)

Price List - Ultrium LTO Data Cartridges Fujifilm, HP, IBM, SONY, Quantum, Maxell, Dell

We look beyond IT. Cloud Offerings

Functionality. Overview. Broad Compatibility. 125 TB LTO Archive System Fast Robotics for Restore Intensive Applications

Storage Backup and Disaster Recovery: Using New Technology to Develop Best Practices

Overview of Storage and Data Management Industry Trends in Long Term Information Retention and Preservation

Open Standards Driven Cloud Storage and Preservation

Object Oriented Storage and the End of File-Level Restores

CrashPlan PRO Enterprise Backup

XenData Product Brief: SX-250 Archive Server for LTO

Long Term Record Retention and XAM

Restoration Technologies. Mike Fishman / EMC Corp.

Disk-to-Disk-to-Offsite Backups for SMBs with Retrospect

DPAD Introduction. EMC Data Protection and Availability Division. Copyright 2011 EMC Corporation. All rights reserved.

XenData Product Brief: SX-250 Archive Server for LTO

Backing up to the Cloud

Windows OS File Systems

Data Loss Prevention (DLP) & Recovery Methodologies

VMware vsphere Data Protection 5.8 TECHNICAL OVERVIEW REVISED AUGUST 2014

Exposing the Cloud: It It s More than a Buzzword Tim Connors, Director, AT&T AT&T

Archiving On-Premise and in the Cloud. March 2015

Implementing a Digital Video Archive Using XenData Software and a Spectra Logic Archive

Next Generation Backup Solutions

Storage Solutions for Data LTO Technology Managing Big Data. PRESENTATION TITLE GOES HERE Shawn O. Brume Sc.D. LTO Consortium

A Deduplication File System & Course Review

How To Backup And Restore A Database With A Powervault Backup And Powervaults Backup Software On A Poweredge Powervalt Backup On A Netvault 2.5 (Powervault) Powervast Backup On An Uniden Power

2.6.1 Creating an Acronis account Subscription to Acronis Cloud Creating bootable rescue media... 12

Implementing a Digital Video Archive Based on XenData Software

XenData Video Edition. Product Brief:

Implementing an Automated Digital Video Archive Based on the Video Edition of XenData Software

Learning Objectives. Chapter 1: Networking with Microsoft Windows 2000 Server. Basic Network Concepts. Learning Objectives (continued)

TUXERA NTFS for Mac USER GUIDE 2/13. Index

USB 2.0 Flash Drive User Manual

Linear Tape File System (LTFS) Linux and Mac User Guide

2009 ikeep Ltd, Morgenstrasse 129, CH-3018 Bern, Switzerland (

Solution Brief: XenData Digital Video Archives in a Dalet Environment

Moving Media Storage Technologies

Disaster Recovery Strategies: Business Continuity through Remote Backup Replication

Features - Microsoft Data Protection Manager

Sony LTO Data Media Storage Tapes

Transcription:

LONG TERM RETENTION OF BIG DATA David Pease Simona Rabinovici-Cohen IBM Research

Outline Introduction and Challenges LTFS: Linear Tape File Systems SIRF: Self-contained Information Retention Format EU ENSURE: Enabling knowledge, Sustainability, Usability and Recovery for Economic Value LTFS and SIRF Summary 2

Big Data for Long Term Generating and collecting very large data sets is becoming a necessity in many domains that also need to keep that data for long periods. Examples: Healthcare Satellite data is kept for ever We would like to keep digital art for ever Scientific and Cultural X-rays are often stored for periods of 75 years Records of minors are needed until 20 to 43 years of age Film Masters, Out takes. Related artifacts (e.g., games). 100 Years or more M&E 3

The Long Term Retention Challenge These documents were created by pre-digital societies. The media and information content are still interpretable. Dead Sea Scroll, ~70AD. Media: Copper. Language: Hebrew. Mayan Glyph, Palenque ~630AD. This information was created a few years ago. Will the media last for 20 years? Will it be possible to access, interpret and present the data in 20 years? 50? 100?

Challenges in the Long Term Retention of Big Data Provide economically scalable storage systems that Efficiently store and preserve big data archives Enable far future search, access, and analytics There are many financial and practical reasons to prefer tape storage for such big data archives Longer media life expectancy Lower cost per PB over time Greater bandwidth for transfers and migrations 5

Outline Introduction and Challenges LTFS: Linear Tape File Systems SIRF: Self-contained Information Retention Format EU ENSURE: Enabling knowledge, Sustainability, Usability and Recovery for Economic Value LTFS and SIRF Summary 6

Linear Tape Open (LTO) Tape LTO Consortium LTO Tape Defines an industry standard for tape drives and media IBM, HP, Quantum, many media manufacturers Serpentine recording, shingled writing Block-addressable Essentially an append-only media LTO Generation 5 (LTO-5) Released April 2010 1.5 terabytes per cartridge (uncompressed) 140 MB/sec streaming data rate Dual-partition capability

Dual-Partition Tape logical view

What is LTFS? A file system implemented on dual-partition linear tape Makes tape look and work like any removable media (e.g., USB drive, removable disk) Files and directories show up on desktop, directory listings, etc. Drag-and-drop files to/from tape, double-click to open Run any application written to use disk files Includes file- and directory-level user metadata Unlimited extended attributes Supports automated libraries as well as stand-alone drives In library mode, allows listing contents and searching of all volumes in library without mounting tapes IBM single-drive implementations released as open source Linux and Mac OS X versions released as Open Source Windows version freely downloadable www-03.ibm.com/systems/storage/tape/ltfs/index.html

What is LTFS? A file system implemented on dual-partition linear tape: Index Partition and Data Partition Index Partition is small (2 wraps, 37.5 GB out of 1.5 TB on LTO5) Data Partition is remainder of the tape File System module that implements a set of standard file system interfaces Implemented using FUSE On Linux and Mac OS X Windows implementation uses FUSE-like framework Includes an on-tape structure used to track tape contents XML Index Schema

Logical View of LTFS Volume LTFS XML Index Index Partition Guard Wraps File File File B O T File File Data Partition E O T

XML Index Schema Similar to information in disk-based file system Files Directories Name, dates, extent pointers, extended attrbutes, etc. Designed to be simple, cross-platform Tags and values easy to read, human format No platform-specific data Supports Unix/Linux, MacOS, and Windows Format becoming the standard for linear tape Formal standardization through SNIA Format specfication on LTO Consortium site: www.trustlto.com/ltfs_format_to Print.pdf

Sample XML Schema <?xml version="1.0" encoding="utf-8"?> <index version="0.9"> <creator>ibm LTFS 0.20 - Linux - ltfs</creator> <volumeuuid>9710d610-5598-442a-8129-48d87824584b</volumeuuid> <generationnumber>3</generationnumber> <directory> <name>ltfs Volume Name</name> <creationtime>2010-01-28 19:39:50.715656751 UTC</creationtime> <modifytime>2010-01-28 19:39:55.231540960 UTC</modifytime> <accesstime>2010-01-28 19:39:50.715656751 UTC</accesstime> <contents> <directory> <name>directory1</name> <contents> <file> <name>binary_file.bin</name> <length>10485760</length> <extentinfo> <extent> <partition>b</partition> <startblock>8</startblock> <byteoffset>0</byteoffset> <bytecount>720000</bytecount> </extent> <extent> <partition>b</partition> <startblock>18</startblock> <byteoffset>0</byteoffset> <file> <bytecount>9765760</bytecount> <name>read_only_file</name> </extent> <length>0</length> </extentinfo> <readonly/> <extendedattributes> </file> <xattr> </contents> <key>uservalue</key> </directory> <value>fred</value> </contents> </xattr> </directory> </extendedattributes> </index> </file>

Index Arrangement on LTFS Tape

LTFS in Single Drive Mode Shows up like any standard (i.e., disk) file system Directories Files Tape contains File/Directory Index in Index Partition XML schema Keeps multiple generations (older versions of XML schema) Data files written to Data Partition (usually) Small data files can optionally be written to Index Partition Quick access, can be cached at mount time Tape index forgotten at unmount

LTFS in Library Mode (LTFS LE) Mount Library, not Drive LTFS Caches Index of each tape read/written Each volume shows as separate file system folder/directory After mount, all tape directories are viewable, searchable Without mounting any tape LTFS drives automation to mount tape on file read/write LTFS can recognize when tape leaves, reenters library Performs consistency check to see if tape index has changed

Outline Introduction and Challenges LTFS: Linear Tape File Systems SIRF: Self-contained Information Retention Format EU ENSURE: Enabling knowledge, Sustainability, Usability and Recovery for Economic Value LTFS and SIRF Summary 17

Self-contained Information Retention Format (SIRF) Being developed by SNIA Long Term Retention (LTR) TWG An Analogy Photo courtesy Oregon State Archives Standard physical archival box Archivists gather together a group of related items and place them in a physical box container The box is labeled with information about its content e.g., name and reference number, date, contents description, destroy date SIRF is the digital equivalent Logical container for a set of (digital) preservation objects and a catalog The SIRF catalog contains metadata related to the entire contents of the container as well as to the individual objects SIRF standardizes the information in the catalog

SIRF Properties SIRF is a logical data format of a storage container appropriate for long term storage of digital information A storage container may comprise a logical or physical storage area considered as a unit. Examples: a file system, a tape, a block device, a stream device, an object store, a data bucket in a cloud storage Required Properties Self-describing can be interpreted by different systems Self-contained all data needed for the interpretation is in the container Extensible so it can meet future needs 19

SIRF Components A SIRF container includes: A magic object: identifies SIRF container and its version Numerous preservation objects that are immutable A catalog that is Updatable Contains metadata to make container and preservation objects portable into the future without external functions 20

OAIS Archival Information Package (AIP) Content Information Preservation Descriptive Information Content Data Object Representation Information Reference Context Provenance Fixity Representation Information Access Rights 0-1 1-* OAIS AIP is an example of preservation object Domain specific packaging formats for preservation objects: XML Formatted Data Unit (XFDU) VERS Encapsulated Object (VEO) Metadata Encoding and Transmission Standard (METS) Preservation metadata: Implementation Strategies (PREMIS)

Layered Approach A SIRF container assumes an underlying object interface Example object layers Advanced: OSD, Cloud, XAM Lower level: UDF, CDFS, FAT, LTFS SIRF metadata is defined at two levels Level 1 catalog (L1) unique metadata, not in the preservation objects, that is mandatory to make preservation objects portable into the future Level 2 catalog (L2) information that is probably also in the preservation objects, that is needed for fast access to the preservation objects 22

SIRF Level 1 Work in Progress The SIRF catalog includes metadata such as: Container information: Spec ID and version SIRF level Container ID State Container provenance Audit log object ID For each Preservation Object (PO): IDs Children s ID Dates Packaging format Fixity Retention Preservation profile Audit log object ID Extension

Preservation Object (PO) IDs Category Fields: PO name non unique identifier e.g. file name PO version ID unique identifier that identifies the specific version of the PO PO logical ID - a unique identifier that identifies the various versions that originate from the same ancestor PO parent ID - a unique identifier that identifies the parent PO from which this PO version was created. Parent PO shares the same logical ID as the current PO, but has different version ID.

ENSURE is FP7 EU Project in the area of preservation Three year Integrated Project (IP) started Feb. 1, 2011 Consortium of 13 partners (industry and academic) ENSURE has a business/industry-oriented focus Drivers for preservation are both regulatory and business value Demonstrated with three use case: Health Care, Clinical Trials and Finance Contributions to standards is a goal of the project 25

PDS Cloud and SIRF in ENSURE Preservation DataStores (PDS) in the Cloud provides preservation-aware storage services for ENSURE based on OAIS The SIRF Handler component in PDS Cloud is for future implementation 26

Outline Introduction and Challenges LTFS: Linear Tape File Systems SIRF: Self-contained Information Retention Format EU ENSURE: Enabling knowledge, Sustainability, Usability and Recovery for Economic Value LTFS and SIRF Summary 27

SIRF and LTFS: Volume File Mark IP Label Construct. SIRF catalog LTFS index DP Label Construct. Preservation Object. Preservation Object The SIRF catalog resides in the index partition LTFS application has rules to indicate what to store in the index partition. This is used to indicate to store the SIRF catalog in the index partition. A preservation object (PO) is mapped to an LTFS file We don t maintain generations of SIRF catalog there are versions of the POs

SIRF and LTFS: Label Construct VOL1 Label LTFS Label SIRF Label fixed-size 80 bytes XML XML VOL1 Label includes for example volume identifier (6 bytes), implementation identifier (13 bytes), owner identifier (14 bytes). LTFS Label includes for example creator, volume UUID, blocksize, compression, partitions ids. The SIRF Label is the magic object and includes for example specification ID and version, SIRF level.

SIRF and LTFS: IDs Category IP Label Construct. SIRF catalog LTFS index The SIRF catalog is an XML including: Container information PO information for each PO The IDs category in each PO information includes a unique identifier - PO Version ID The PO Version ID will be LTO + Tape UUID + LTFS <fileuid> The LTFS fileuid is a 64 bits sequence number generated for each LTFS file

Summary Generating and collecting very large data sets is becoming a necessity in many domains that also need to keep that data for long periods There are many financial and practical reasons to prefer tape storage for such big data archives LTFS provides a file system implemented on dual-partition linear tape being standardized by SNIA LTFS TWG SIRF is a logical data format of a storage container appropriate for long term information retention being developed by SNIA LTR TWG and will be used in ENSURE SIRF serialization for LTFS provides economically scalable storage containers for long term retention of big data More information on LTFS: www.trustlto.com/ltfs_format_to Print.pdf More information on LTR and SIRF: http://www.snia.org/ltr 31