Strategies and New Technology for Long Term Preservation of Big Data



Similar documents
LONG TERM RETENTION OF BIG DATA

Long Term Record Retention and XAM

Product Brief: XenData X2500 LTO-6 Digital Video Archive System

WHITE PAPER WHY ORGANIZATIONS NEED LTO-6 TECHNOLOGY TODAY

Best Practices for Long-Term Retention & Preservation. Michael Peterson, Strategic Research Corp. Gary Zasman, Network Appliance

Archive exchange Format AXF

In ediscovery and Litigation Support Repositories MPeterson, June 2009

<Insert Picture Here> Cloud Archive Trends and Challenges PASIG Winter 2012

Interoperable Cloud Storage with the CDMI Standard

Cloud Storage Standards Overview and Research Ideas Brainstorm

The Key Elements of Digital Asset Management

Taming Big Data Storage with Crossroads Systems StrongBox

LTFS Takes Tape to the Next Level

Document Management and Records Management in SharePoint Scott Jamison

RECORDS RETENTION AND SECURITY REGULATIONS THINK ABOUT IT!

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

Big Data and Big Data Governance

ForgetIT. Grant Agreement No Deliverable D7.1

Media Cloud Building Practicalities

Applications of LTFS for Cloud Storage Use Cases

Object Storage: Out of the Shadows and into the Spotlight

Integrated archiving: streamlining compliance and discovery through content and business process management

September 2009 Cloud Storage for Cloud Computing

Introduction to Engineering Using Robotics Experiments Lecture 17 Big Data

Whitepaper Data Governance Roadmap for IT Executives Valeh Nazemoff

How To Manage Your Digital Assets On A Computer Or Tablet Device

Open Standards Driven Cloud Storage and Preservation

TERRITORY RECORDS OFFICE BUSINESS SYSTEMS AND DIGITAL RECORDKEEPING FUNCTIONALITY ASSESSMENT TOOL

Archiving A Dell Point of View

Johns Hopkins University Data Management Services

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Object Oriented Storage and the End of File-Level Restores

The Archival Upheaval Petabyte Pandemonium Developing Your Game Plan Fred Moore President

Archiving Whitepaper. Why Archiving is Essential (and Not the Same as Backup)

Mining for Insight: Rediscovering the Data Archive

CLOUD STORAGE SECURITY INTRODUCTION. Gordon Arnold, IBM

A grant number provides unique identification for the grant.

CORPORATE RECORD RETENTION IN AN ELECTRONIC AGE (Outline)

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

Teamcenter s Records Management Application

A 5 Year Total Cost of Ownership Study on the Economics of Cloud Storage

Cloud Data Management Interface (CDMI) The Cloud Storage Standard. Mark Carlson, SNIA TC and Oracle Chair, SNIA Cloud Storage TWG

File Management : Guidelines & Policies

VNA Vendor Neutral Archive, The How and Why

We are Big Data A Sonian Whitepaper

AUTOMATED DATA RETENTION WITH EMC ISILON SMARTLOCK

Storage Design for High Capacity and Long Term Storage. DLF Spring Forum, Raleigh, NC May 6, Balancing Cost, Complexity, and Fault Tolerance

Why long time storage does not equate to archive

SNIA Cloud Storage PRESENTATION TITLE GOES HERE

Consolidated Clinical Document Architecture and its Meaningful Use Nick Mahurin, chief executive officer, InfraWare, Terre Haute, Ind.

Search and Real-Time Analytics on Big Data

IBM Solution Framework for Lifecycle Management of Research Data IBM Corporation

HIPAA CRITICAL AREAS TECHNICAL SECURITY FOCUS FOR CLOUD DEPLOYMENT

Privacy and Security Policies for Healthcare Solutions on the Cloud

Interagency Science Working Group. National Archives and Records Administration

DEFINING THE RIGH DATA PROTECTION STRATEGY

Databases in Organizations

08/10/2013. Data protection and compliance. Agenda. Data protection life cycle and goals. Introduction. Data protection overview

LTFS for Microsoft Windows User Guide

Considerations for Management of Laboratory Data

Mobile Protection. Driving Productivity Without Compromising Protection. Brian Duckering. Mobile Trend Marketing

1. Understanding Big Data

Storage Clouds. Enterprise Architecture and the Cloud. Author and Presenter: Marty Stogsdill, Oracle

Achieving Cost-Effective, Vendor-Neutral Archiving For Your Enterprise

DIRECTOR GENERAL OF THE LITHUANIAN ARCHIVES DEPARTMENT UNDER THE GOVERNMENT OF THE REPUBLIC OF LITHUANIA

Magnetic tape storage advances and the growth of archival data

medical devices begin to drift into cloud melisa bockrath

New LTO Linear Tape File System Helps Make Video Storage Easier

THE PRESIDENT S NATIONAL SECURITY TELECOMMUNICATIONS ADVISORY COMMITTEE

C. All responses should reflect an inquiry into actual employee practices, and not just the organization s policies.

How To Understand Data Theory

Transcription:

Strategies and New Technology for Long Term Preservation of Big Data Presenter: Sam Fineberg, HP Storage Co-Authors: Simona Rabinovici-Cohen, IBM Research Haifa Roger Cummings, Antesignanus Mary Baker, HP Labs

Outline Introduction What makes digital preservation of big data so hard? SNIA Long Term Retention technology for big data Self-contained Information Retention Format (SIRF) SIRF serializations for cloud and tape SIRF status and summary 2

Big Data.. Really is BIG.. 2.5 quintillion (10 18 ) bytes of new data created per day in 2012 (source IBM) And the move to the Internet of Things is only going to increase this volume 19.8 Billion connected devices by 2020 (source McKinsey) Only 4.2 billion smartphones and tablets, 3.4 billion PCs Data analytics is improving all the time Therefore historical information has significant value Apply new techniques and algorithms to gain new insights Need to ensure ALL necessary information is captured to extract full value If storing big data is difficult today, how do we preserve it? Over the short term? Over the long term? 3

SNIA Survey from 2007 Top External Factors Driving Long-Term Retention Requirements: Legal Risk, Compliance Regulations, Business Risk, Security Risk Preservation of business history Protection of customer privacy Protection of business or intellectual assets Retaining history for competitiveness or protection Protection from compliance or legal fines Meeting regulatory requirements Meeting regulatory requirements Concern with ligitation protection Business Risk Legal Risk Other Business Risk Security Risk Compliance Requirements Compliance Requirements Legal Risk Security Risk 0% 10% 20% 30% 40% 50% 60% Percent of Respondents Source: SNIA-100 Year Archive Requirements Survey, January 2007. >100 Years >50-100 Years >21-50 Years >11-20 Years >7-10 Years 18.3% 38.8% 13.1% 15.7% 12.3% What does Long-Term Mean? Retention of 20 years or more is required by 70% of responses. >3-6 Years 1.9% 0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% 35.0% 40.0% 4

The Need for Digital Preservation of Big Data Domains that have Big Data require preservation Regulatory compliance and legal issues Sarbanes-Oxley, HIPAA, FRCP, intellectual property litigation Emerging web services and applications Email, photo sharing, web site archives, social networks, blogs Many other fixed-content repositories Scientific data, intelligence, libraries, movies, music Satellite data is kept for ever We would like to keep digital art for ever Scientific and Cultural Healthcare X-rays are often stored for periods of 75 years Records of minors are needed until 20 to 43 years of age Film Masters, Out takes. Related artifacts (e.g., games). 100 Years or more M&E 5

Outline Introduction What makes digital preservation of big data so hard? SNIA Long Term Retention technology for big data Self-contained Information Retention Format (SIRF) SIRF serializations for cloud and tape SIRF status and summary 6

What makes digital preservation so hard? Data volume Technology issues Human issues Digital content is generated at a much faster rate than analog Big Data (volume, velocity, variety) makes this even harder Media obsolescence Media degradation/failure Format obsolescence Loss of context/metadata Large-scale disaster Human error Economic faults Attack Organizational faults Large scale & long time periods Even improbable events will have an effect 7

Real Life Example Problem 2003 2007 To: roger.cummings@veritas.com From: fred@nowhere.com Subject: Something or other To: roger_cummings@symantec.com From: sue@somewhere.com Subject: Something else Same people?? Could you PROVE it 20 years on? To: gary.phillips@veritas.com From: fred@nowhere.com Subject: Something or other To: gary_phillips@symantec.com From: sue@somewhere.com Subject: Something else 8

Goals of Digital Preservation Digital assets stored now should remain Accessible Undamaged Usable For as long as desired beyond the lifetime of Any particular storage system Any particular storage technology And at an affordable cost 9

Preservation Practices must vary by time Can t predict what will change only know it will This means processes are key Must be evolvable Current processes get us to the next step At that point we will likely need new processes to take over Must not destroy what we are trying to protect Standards make evolution easier A good archive is almost always in motion Digital preservation is not a static activity! 10

Practices must vary by context What do we preserve? Raw data? Analytical results? Context? etc. Depends on customer needs and economics Can we afford to keep the raw data? Is it ever practical to re-analyze? Can t always predict the eventual use What insights will the data provide in 1 year, 5 years, 20 years, etc? Trade off, based on predicted business value Many possible preservation techniques and strategies, which could vary over time 11

Outline Introduction What makes digital preservation of big data so hard? SNIA Long Term Retention technology for big data Self-contained Information Retention Format (SIRF) SIRF serializations for cloud and tape SIRF status and summary 12

SIRF: Self-contained Information Retention Format Being developed by SNIA Long Term Retention (LTR) TWG Photo courtesy Oregon State Archives An Analogy Standard physical archival box Archivists gather together a group of related items and place them in a physical box container The box is labeled with information about its content e.g., name and reference number, date, contents description, destroy date SIRF is the digital equivalent Logical container for a set of (digital) preservation objects and a catalog The SIRF catalog contains metadata related to the entire contents of the container as well as to the individual objects SIRF standardizes the information in the catalog 13

SIRF Properties SIRF is a logical data format of a storage container appropriate for long term storage of digital information A storage container may comprise a logical or physical storage area considered as a unit. Examples: a file system, a tape, a block device, a stream device, an object store, a data bucket in cloud storage Required Properties Self-describing can be interpreted by different systems Self-contained all data needed for the interpretation is in the container Extensible so it can meet future needs 14

SIRF Components A SIRF container includes: Magic object: identifies SIRF container and its version Preservation objects that are immutable Catalog that is Updatable Contains metadata to make container and preservation objects portable into the future without external functions 15

Outline Introduction What makes digital preservation of big data so hard? SNIA Long Term Retention technology for big data Self-contained Information Retention Format (SIRF) SIRF serializations for cloud and tape SIRF status and summary 16

SIRF Serializations The TWG has developed serializations for CDMI and LTFS specifies how a CDMI container or LTFS Tape also becomes SIRF-compliant XML and JSON schemas for the SIRF catalog A SIRF-compliant CDMI container or LTFS Tape enables a future CDMI/LTFS client to understand containers created by today s CDMI/LTFS client The properties of the future client is unknown to us today understand means identify the preservation objects in the container, the packaging format of each object, its fixities values, etc. (as defined in the SIRF catalog) 17

SIRF Serialization for CDMI: Interface CDMI API can be used to access the various preservation objects and the catalog object in a SIRF-compliant CDMI container Example Assume we have a cloud container named "PatientContainer" that is SIRF-compliant the container has a catalog object each encounter is a preservation object each image is a preservation object cat PO Patient Container PO PO We can read the various preservation objects and the catalog object via CDMI REST API as follows: GET <root URI>/ PatientContainer>/sirfCatalog GET <root URI>/<PatientContainer>/encounterJan2001 GET <root URI>/<PatientContainer>/chestImage 18

SIRF Serialization for CDMI PatientContainer SIRF magic object: specification=1111 SIRF level = 1 Catalog object=sirfcatalog sirfcatalog { "encounterjan2001":[ "IDs": [{...},] "Fixity": [{...},] ] "chestimage":[ "IDs": [{...},] "Fixity": [{...},] ] } Simple Simple PO Simple PO PO Encounter Jan2001 Simple PO cestimage manifest cestimage manifest cestimage manifest chestimage cestimage manifest cestimage dicom1 cestimage dicom1 cestimage dicom1 cestimage dicom1 cestimage dicom1 chestimage dicom1 chestimage dicom1 dicom2 Composite PO 19

SIRF Serialization for LTFS File Mark IP Label Construct. SIRF catalog LTFS index SIRF magic object: specification=1111 SIRF level = 1 Catalog object=sirfcatalog DP Label Construct. Preservation Object. Preservation Object The index partition of the tape is 2 wraps which is 37.5 GB in LTO-5 and probably larger in LTO-6. The tape index partition is large enough to hold the LTFS index, the SIRF catalog, and even additional information e.g. thumbnails of images 20

SIRF Serialization for LTFS: General A LTFS Tape can also be a SIRF Container when: The SIRF magic object is mapped to extended attributes of the LTFS index root directory The magic object includes, for example, specification ID and version, SIRF level, reference to SIRF catalog The SIRF catalog resides in the index partition and formatted in XML A SIRF preservation object (PO) that is a simple object (contains one element) is mapped to a LTFS file A SIRF PO that is a composite object (contains several elements) is mapped to: a set of LTFS files (one for each element) and a manifest file that its content includes the IDs and fixities of the element data objects 21

Outline Introduction What makes digital preservation of big data so hard? SNIA Long Term Retention technology for big data Self-contained Information Retention Format (SIRF) SIRF serializations for cloud and tape SIRF status and summary 22

SIRF Status More information is available at snia.org/ltr A SIRF draft is in development, and will be available for public review soon A reference implementation of SIRF is in development by the TWG 23

Summary Long Term Retention is hard Volume, Velocity and Variety make it even harder Need to retain not only information of immediate interest but ALL other information needed to make it fully usable in future SIRF Helps, providing a digital box for data and metadata Includes self describing metadata to help understand the contents of the container in the future No single technology will be usable over the timespans mandated by current digital preservation needs SNIA CDMI and LTFS technologies are among best current choices Are good for perhaps 10-20 years SIRF provides a vehicle for collecting all of the information that will be needed to transition to new technologies in the future 24