EMC Data de-duplication not ONLY for IBM i Maciej Mianowski EMC BRS Advisory TC May 2011 1
EMC is a TECHNOLOGY company EMC s focus is IT Infrastructure 2
EMC Portfolio Information Security Authentica Network Intelligence RSA Valyd Tablus Verid Archer Content Management Documentum Ask Once Acartus Captiva ProActivity Document Sciences X-Hive Kazeon Virtualization/ Data Mobility VMware Rainfinity Akimbi FastScale Services Dolphin Interlink Internosis BusinessEdge Geniant Conchango 2003 2004 2005 2006 2007 2008 2009/2010 Resource Management Astrum Smarts nlayers Voyence Infra ConfigureSoft Availability/ Archiving Legato Avamar Kashya Illuminator Indigo Stone WysDM Data Domain Bus-Tech Cloud Infrastructure and Services Pi Mozy Consumer/ Small Business Dantz Iomega Data Warehouse Big Data Greenplum Isilon 3
Having Great Technology is Not Enough Customer EMC TC 4
Backup System Infrastructure Every backup environment has a bottleneck. It may be a VERY FAST bottleneck, but it will determine the maximum throughput obtainable with your system. Your backup system will be as fast as the SLOWEST link in the backup chain. 5
What is deduplication? Data deduplication (often called "intelligent compression") is a method of reducing storage needs by eliminating redundant data. Only one unique instance of the data is actually retained on storage media, such as disk or tape. Redundant data is replaced with a pointer to the unique data copy. Data deduplication can generally operate at the file, block, and even the bit level. 6
Do you still need a tape? 7
EMC DataDomain Mission Make this Look Like 8
This 9
Customer Example: 20x Footprint Reduction One DD System 180TB stored 8TB of disk used 20x Reduction Replicated off-site Red Line = Amount of data written to Data Domain (virtual storage) Green Line = Disk Space Consumed (physical storage) Blue Line = Cumulative Compression Effect 10
What are the reasons you do not have deduplication yet? Cost (disk is more expensive than tape) Flexibility (only SAN support) Performance (disk I/O bottleneck) Data Safety & Reliability (only one copy on disk) Is it true? 11
Many vendors offer the same is it true? What needs to compared? Speed (backup/restore) Flexibility (various protocols) Scalability (upgrade options) Supportability (IBM i, Open Systems, Mainframe) Simplicity (management, maintenance) Size (space, de-dupe ratio) Efficient replication (bandwidth reduction) Data Safety (the most important), Encryption Support (good people, local people) Cost (compare all costs) Please prove it References!!! 12
Why EMC Data Domain for IBM i? Feature Integrate with Ease and Flexibility Supportability Retain More Backups with De- Duplication Recover Data Reliably and Efficiently Improve Performance Simplify Infrastructure Benefit Data Domain presents IBM 3584-L32 tape library/libraries and LTO3 (3580-TD3) drives via fibre channel to IBM i hosts BRMS and IBM i Native Commands Support As IBM i data is compressible, it is in the de-duplication wheelhouse Store weeks of full backups on disk in a minimal footprint for rapid database restores De-duplication with replication drives WAN-efficient disaster recovery EMC s Data Domain Data Invulnerability Architecture ensures reliable recovery DIA should resonate well with IBM I customers Unrivaled 8+ TB/hr aggregate and 1+ TB/hr single-stream, inline deduplication Single-stream throughput capabilities are important to understand when considering DB2 backups Data Domain allows for greater parallelization of backup and restores Dedicate Virtual Resource to Each Application Each LPAR can have dedicated virtual drives Simplicity and ease of use is something IBM I customers demand 13
Deduplication Statistics for IBM i Outliers apply In general the same concepts apply to IBM i environments as any other environment in terms of data de-duplication. The following de-dupe ratios were discovered during the test process: Banking: 21 Retail: 24 Shipping: 52 Manufacturing: 22 14
EMC Data Domain Technical Architecture 15
Throughput MB/s Performance: CPU-Centric versus Spindle-Bound 6,000 Data Domain Fibre Channel SATA Most deduplication vendors 50 50 100 150 200 Number of Disk Spindles 16
Price / Performance: CPU-centric wins over time Source: http://seagate.com/docs/pdf/whitepaper/economies_capacity_spd_tp.pdf Improve price / performance along with CPUs Keep price competitive with tape automation Alternative Speed through spindle count Huge amounts of wasted disk space More to manage, more to buy 17
Data Domain Data Flow Mainframe Gateways Appliance-based Disk Systems 8KB # 12KB # 6KB # 10KB # 8KB # 11KB # D IP/FC SISL recognizes new blocks in the memory 8KB # 8KB # 5KB # Pattern/Hash/Bytes Generated LAN-based Clients D NDMP Storage D # # # # # # # # # # Hash Table / Previous Stored Versions SAN-attached Clients D dd dd D D D D D dd dd D D D D D Compressed Data De-Duplicated Data 18
Data Domain Core Focus Deduplication Storage Data Integrity SISL (Stream Informed Segment Layout) Speed DIA (Data Invulnerability Architecture) Data Safety & Reliability 19
Stream Informed Segment Layout (SISL) SISL Summary Vector Memory-based structure to help quickly identify new segments Segment Locality Data layout to maximize probability of locating duplicates SISL is a collection of techniques that speeds up the identification of duplicate segments inline in real time as the data is being received 20
Data Invulnerability Architecture (DIA) Four key elements of the Data Domain Data Invulnerability Architecture: End-to-end verification Fault avoidance and containment Continuous fault detection and healing File system recoverability Other: RAID 6, NVRAM, Snapshots 21
Data Domain Basics Easy Integration with Existing Environments Control Tier Target Tier DR Tier Backup & Archive Applications Backup 1 CIFS 2 NFS 3 NDMP 4 OST 5 DD Boost Replication LAN WAN SAN Backup 6 VTL DD890 Appliance DD890 Appliance 10 and 1 Gb Ethernet; 4 and 8 Gb Fibre Channel Up to 285 TB usable capacity with disk shelves Deduplicating file system Deduplicated IP-based encrypted replication 22
Industry s Most Scalable Inline Deduplication Systems DD800 Appliance Series Global Deduplication Array DD Archiver DD600 Appliance Series DD140 Remote Office Appliance Software options: DD Boost, DD Virtual Tape Library, DD Replicator, DD Retention Lock, and DD Encryption DD140 DD610 DD630 DD670 DD860 DD890 Global Deduplication Array DD Archiver Speed (DD Boost) 490 GB/hr 1.3 TB/hr 2.1 TB/hr 5.4 TB/hr 9.8 TB/hr 14.7 TB/hr 26.3 TB/hr 9.8 TB/hr Speed (other) 450 GB/hr 675 GB/hr 1.1 TB/hr 3.6 TB/hr 5.1 TB/hr 8.1 TB/hr 10.7 TB/hr 4.3 TB/hr Logical capacity 9 43 TB 40 195 TB 84 420 TB 0.6 2.7 PB 1.4 7.1 PB 2.9 14.2 PB 5.7 28.5 PB 5.7 28.5 PB Raw capacity 1.5 TB Up to 6 TB Up to 12 TB Up to 76 TB Up to 192 TB Up to 384 TB Up to 768 TB Up to 768 TB Usable capacity 0.86 TB Up to 3.98 TB Up to 8.4 TB Up to 55.9 TB Up to 142 TB Up to 285 TB Up to 570 TB Up to 570 TB 23
Snapshot Data Protection up to 750 snaps WAN 30 Day Retention - Daily 1 Year Retention - Weekly + Fastcopy Recovery 24
Replication Topologies Data Access Replication types Source Entire Collection Destination Collection Replication CIFS /NFS VTL Directory VTL Pools Directory Replication BOOST BOOST Backup Image Optimized- Duplication 25
Multi-Site Protection for Remote Office Remote Sites Data Center Hub 1-5% DIR A DB Data Domain System Home Archive Data Backup Data Data Domain System 1-5% WAN Home 1-5% DB Data Domain System Low Bandwidth Optimization using Delta Compression Enhanced Data Reduction for Small Sites 26
Multi-Site Data Protection Remote sites London Cascaded Replication Tokyo Collection WAN WAN Directory Directory Protection Site # 1 Protection Site # 2 27
DD Encryption Software Industry s first encryption of deduplicated data at rest Inline: deduplication and encryption before storing Deduplication + Encryption Protects against loss of disk or system Inline encryption provides immediate protection while preserving deduplication Works with all protocols and applications Uses RSA BSAFE FIPS 140-2 validated cryptographic libraries Replicate encrypted data Security officer role for dual authentication Requires one admin user and one security officer role user for lock, passphrase, and disable functions 28
DD Boost Software DD Boost Distributes parts of deduplication process to backup server Supports majority of backup software market Symantec NetBackup and Backup Exec EMC NetWorker Speeds backups by up to 50% Process more backups with existing resources 20 40% less overall impact to backup server 80 99% less LAN bandwidth Enables Data Domain replication management from the backup application 29
Data Domain Archiver Long-Term Retention of Backups, Emails, Projects, Files, Data Domain Controller Backups Active Tier Archive Tier 90 days 7 years + 30
EMC Deduplication Makes it Better Faster Greater Scalability More Efficient Proven Reliability 31
^ 32