How to Cost Effectively Retain Reference Data for Analytics and Big Data Molly Rector, EVP Product Management & WW Marketing, Spectra Logic
SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations and literature under the following conditions: Any slide or slides used must be reproduced in their entirety without modification The SNIA must be acknowledged as the source of any material used in the body of any document containing material from these presentations. This presentation is a project of the SNIA Education Committee. Neither the author nor the presenter is an attorney and nothing in this presentation is intended to be, or should be construed as legal advice or an opinion of counsel. If you need legal advice or a legal opinion please contact your attorney. The information presented herein represents the author's personal opinion and current understanding of the relevant issues involved. The author, the presenter, and the SNIA do not assume any responsibility or liability for damages arising out of any reliance on or use of this information. NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK. 2
Abstract Check out SNIA Tutorial: Enter Tutorial Name when finalized How to Cost Effectively Retain Reference Data for Analytics and Big Data The need to store critical digital content at the 100's of terabytes and into the petabyte level is quickly moving outside the capabilities of traditional storage solutions. In addition, the complexity of many of today's storage solutions can prohibit companies from migrating to a solution that best fits their needs. Many companies are turning to active archives, an approach that is quickly gaining traction for companies that regularly manage high-volume reference data or face exponential data growth associated with Data Analytics and Big Data. Software, tape and disk vendors alike are combining the best of these technologies to provide more efficient and functional solutions. And, according to InterSect360, 35% of Big Data users are already using tape to help access and retain the data necessary for long-term analytics. 3
Two Primary Active Archive Principles There are exceptions to every rule Agentless except for.. There is no black or white There is only gray and lot s of overlap between technologies
Active Archive Out of Band Conceptual Design Primary Storage Z: DRIVE NFS / CIFS File Archive Mgmt Software Data Management Software NFS / CIFS Data Migrator Software Vaulting Software Data Mover &/or Archive Products Tape Secondary Heterogeneous Storage Remote Data Center Data Management Framework Remote Data Center/ Cloud Provider
Active Archive In Band Conceptual Design Distributed File Systems Application Specific Workflow Manager NFS / CIFS (NAS) FCP/iSCSI (SAN) Server Clients Typically Required Data Management Software Primary Storage Secondary Storage Tape Data Management Framework
Active Archive File Based In Band Conceptual Design NFS / CIFS Virtual Abstraction Global Name Space Policy Driven Primary Storage Secondary Heterogeneous Storage Tape Remote Data Center Data Management Framework Remote Data Center/ Cloud Provider
Active Archive File Based Hybrid Conceptual Design Primary Storage NFS / CIFS Virtual Abstraction Global Name Space Policy Driven Primary Storage Secondary Heterogeneous Storage Tape Remote Data Center Data Management Framework Remote Data Center/ Cloud Provider
Active Archive File Based Hybrid Conceptual Design Primary Storage NFS / CIFS Vol 1 NFS / CIFS Virtual Abstraction Global Name Space Policy Driven Primary Storage Vol 2 Vol 3 Secondary Heterogeneous Storage Tape Remote Data Center Data Management Framework Remote Data Center/Publi c Cloud
What do we mean by Data Mover? P Data Source Application Server S Data Mover/s
What is Big Data? Applications Requiring Intensive Data Mining and Analytics Financial Institutions Tic Data Analysis Mortgage Data Trend Analysis Check Images Risk Assessment Cross Domain Correlations Brokerage Data Health Care Drug Efficacy Disease Pattern recognition & progression Fraud detection Claim automation Medical Records Digital Imaging Genomic Government (FBI,CIA,DOD,DOE,HLS,IRS) Internet threat detection Pattern recognition Image analysis Fraud and waste mitigation Consumer / Commercial Space Social Media and Product Sentiment correlation Consumer Analysis Advertising Affinity correlation Telemetry and Quality analysis Video Surveillance Biometrics Research Data 0 Digital Media Videos, Music, Books, Photos Seismic/Oil & Gas
Data Storage Initiatives Big Data Storage: Focus is on Performance 1 st Accessibility 2 nd Data Collection Data Processing Data Storage Data Movement Data Sharing, Manipulation, Mining and Analytics Aligns with Workflow and Application Specific Active Archive Technologies Just Long Term Storage : Focus on Scalability 1 st Accessibility 2 nd Low Cost Repository Multi-Purpose Regulatory or Business Driver Requirements Little to NO manipulation of the data required Data must be accessible if ever needed Metadata is helpful for search Aligns with General Purpose Active Archive Technologies
Technology Categories General Purpose - Agentless Out of Band Approach General Purpose Agents/Clients Required Out of Band Approach Work Flow and Application Specific - Agents/Clients Required In Band Approach
General Purpose - Agentless Centralized Archive Repository - for General Unstructured Data Agentless approach - CIFS/NFS Gateway Out of Band approach (Primary Tier Not Managed) Not tied to specific data movers platform independent Use in Legacy environments with existing storage challenges Use in New environments that are diverse in nature not tied to performance intensive applications or workflows Use in multi-tenant / multi-need/application environments
General Purpose - Agentless Example: Use when the environment just wants a dumping ground to store inactive unstructured data for various periods of time and is looking for some confidence that it will be accessible Example: Use when the environment wants selfservice transparent access to the archived data repository Example: Use when servers and hosts are diverse and well established with an inability to install agents Example: Use when the environment is trying to support a multi-data mover
General Purpose Agents/Clients Required Centralized Archive Repository - Focus on Unstructured Data and also specific Agent Supported Applications (Structured Data) Agents must be installed on each host servers and/or workstations Out of Band approach (Primary Tier Not Managed) Data Mover is incorporated into the technology platform dependent Archive Access must be through STUB or GUI no direct CIFS/NFS access Use in Legacy environments with existing data protection and/ or storage challenges if you can install agents Use in New environments that will require data protection and/ or are diverse in nature and not tied to performance intensive applications or workflows Use in multi-tenant / multi-need/application environments
Workflow and Application Specific - Agents/Clients Required High Performance - File System Evolution SAN Focused Shareable File Systems which - Enable Shared Access to Block Based Storage LUNS by multiple application servers Similar to what Network Attached Storage provide natively Provide Shared Storage/Data Access for Applications supporting Block Based protocols (FCP, iscsi etc.) Minimized and/or eliminates native OS File System limitations Must own Primary Storage Tier apply proprietary file system Must own Secondary Storage Tiers/Tape apply proprietary file system Application Server agent/clients required to be able to access proprietary file system independent of OS Proprietary file system enables data movement between storage tiers and/or tape Agents/Clients - NAS Gateway capabilities now added to allow NAS protocol applications to also leverage proprietary file system enhanced functionality
Work Flow and Application Specific Agents/Clients Required Archive Repository Focus for Specialized Unstructured Data associated with a specific workflow or application In Band approach (Primary Tier Is Managed) Present a complete virtualized storage pool across ALL storage tiers Host Agents required to facilitate integration with all storage tiers and proprietary file system Installed on specific workstations and application servers Optimizes performance to and from virtualized storage pool Facilitates integration with specific applications and workflows Optimize data movement and copies between all storage tiers Proprietary Data Mover Removes the Need for Stubs or Pointers Not recommended for use with Legacy environments due to need to incorporate primary storage and agent requirements Use with NEW environments that are designed for a specific application or workflow and where performance is key
Work Flow and Application Specific Agents/Clients Required Examples: Use for New Applications when the Entire Workflow must be managed including: Data Collection Data Processing Data Storage Data Manipulation, Mining, Analytics Examples: Use for Performance/Data Flow Applications Broadcast, Cable & CCTV Movie and Television Production, Post-Production & Graphics Science and Engineering (Oil & Gas research, Satellite imaging) High Performance Computing (Life Sciences- genome sequencing) Audio and Video Surveillance and Mining Government Data Mining projects, Internet Streaming, Pre- Press, Digital Printing, CAD/CAM
Active Archive Check Box Differentiators Unstructured Structured Out of Band In Band Agents Required Agentless Data Mover Included Tape Supported LTFS Supported Copy Management
Active Archive Check Box Differentiators Version Management Varied Retention Across Tiers Spinning Disk Storage Tiering Multi-Site Support with Replication Cloud Integration Single Name Space H/A Strategy Multi-Tenancy Data Profiling Content Classification
Active Archive Check Box Differentiators Custom Metadata Search Legal Hold Single Instancing Deduplication Compression Auto Data Deletion (tape consolidation?) Data Shredding Scale Out Architecture
Active Archive Check Box Differentiators Digital Fingerprint/Self Healing Catalog Rebuild Capability Partial File Retrieval Reporting Capabilities 3 rd Party Application Integration
General Purpose Out of Band / Agentless Comparison Attribute File Manager Archive Manager Data Mover LTFS Appliance LTFS YES YES NO YES Active/Active H/A YES NO - Active /Standby NO - Active/Standby or Need MS Cluster NO - Active/Standby Data Mover Included NO NO NO NO Spinning Disk Storage Tiering YES Only for short retention YES Cache only Auto Data Deletion YES NO YES NO Copy/Version Mgmt. YES user plug-in s (optional) YES 3 Copies on any Tier Can t modify Files must save as new Files YES - Dual Copy on Tape Custom Metadata YES NO NO NO Search YES GUI NO NO NO Scale Out YES NO NO NO
Active Archive Landscape Just Long Term Storage General Purpose Self Service Focus Agentless Easier to Deploy Open Data Mover Platform New or Existing Environment Big Data Workflow and Application Specific Performance Focus Data Sharing Data Mining/Analysis New Application Just Long Term Storage General Purpose All In One Data Mover Included Agents Required Closed Platform Some WorkFlow/App Integration New Better Existing OK Shades of Gray
Approach to Selecting the Right Technology Identify the goals of the environment: What type of data will be archived? Is this a long term general purpose storage repository? Is this for a specific application or work flow? Is this for data mining and analytics? Is this for a specific business unit or department? Is it a new or legacy environment?
Data Storage Initiatives Big Data Storage: Focus is on Performance 1 st Accessibility 2 nd Data Collection Data Processing Data Storage Data Movement Data Sharing, Manipulation, Mining and Analytics Aligns with Workflow and Application Specific Active Archive Technologies Just Long Term Storage : Focus on Scalability 1 st Accessibility 2 nd Low Cost Repository Multi-Purpose Regulatory or Business Driver Requirements Little to NO manipulation of the data required Data must be accessible if ever needed Metadata is helpful for search Aligns with General Purpose Active Archive Technologies
Attribution & Feedback The SNIA Education Committee would like to thank the following individuals for their contributions to this Tutorial. Authorship History Name/Date of Original Author here: Molly Rector 08/15/2012 Additional Contributors Stacy Schwartz-Gardner Please send any questions or comments regarding this SNIA Tutorial to tracktutorials@snia.org 28
Refer to Other Tutorials Use this icon to refer to other SNIA Tutorials where appropriate Check out SNIA Tutorial: Enter Tutorial Title Here Use the Hands-On Lab icon only if you ve been notified that your tutorial is a cross-referenced match to the Hands-On Lab 29