A Best Practice Guide to Archiving Persistent Data: How archiving is a vital tool as part of a data centre cost savings exercise NOTICE This White Paper may contain proprietary information protected by copyright. Information in this White Paper is subject to change without notice and does not represent a commitment on the part of Quantum. Although using sources deemed to be reliable, Quantum assumes no liability for any inaccuracies that may be contained in this White Paper. Quantum makes no commitment to update or keep current the information in this White Paper, and reserves the right to make changes to or discontinue this White Paper and/or products without notice. No part of this document may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or information storage and retrieval systems, for any person other than the purchaser s personal use, without the express written permission of Quantum.
CONTENTS Introduction...3 Data and data types...4 Value of data...4 Tiered storage...6 Operational cost...8 Data Management Software....8 The role of tape...9 Disaster Recovery...9 Conclusion...10 Archiving Persistent Data 2
Today s enterprises are experiencing greater storage growth than ever before. The growth comes from structured data from enterprise databases, or unstructured data from a variety of applications. Wherever it comes from, it must be preserved for business continuity, data retention laws, and to meet compliance requirements. The data centre needs to reduce its total cost of ownership (TCO) for its backup/archiving infrastructure. It must contain costs, manage data growth, improve the backup/archive process, and make it more efficient. Unstructured data is often the core data asset in an organisation s work flow and is inherent in revenue generating operations. This white paper discusses how to manage unstructured data growth in the most cost-effective manner. It also discusses the distinct differences between backup and archive and what is the best policy for both. INTRODUCTION When developing a strategy for managing unstructured data growth there are many considerations that need to be taken into account including the type of data and its value, the cost of data growth, the correct platform for the data to reside on and data security. But, to start with, what is archiving and what is the difference between archive and backup? To (over) simplify: Archive is a copy of data that is being retained in a safe and economical location for long periods of time but is reused from time to time. It is a method of storing data in the most cost-effective location. The performance of the systems the archive is stored on is matched to the requirements of the application using the data. Backup is a copy of the data that is only recovered when there is a failure or form of corruption. Archiving Persistent Data 3
Asset Value Cost of Storage Paying too much Primary Storage High performance Frequency of Access Secondary Storage Tier 2 or 3 storage DATA TYPES Not all data is created equal. If you consider the difference between an on-line trading system, processing multiple transactions in milliseconds, and the company payroll that only has to issue funds once per month you can see that data should be prioritised in terms of its currency, life cycle and value. Later in the document we will discuss value and give some real world examples of how data is dynamic. In general, data can be categorised as live, persistent or backup, meaning that it is either in use, is not in use but can be recalled at any point, or it is part of a copy of the primary data to only be used when the system needs to be restored after a failure. One of the challenges is to understand and manage the data. Lots of extra copies of data are being made a snapshot is taken of it, some internal applications back it up, and backing it up with an enterprise backup application will probably make a copy of itself. This scenario doesn t even mention all the replication processes going on: applications are replicating, storage is replicating and backup devices are replicating. What is needed is an automated system that can remove duplicate data, and archive just the data you need. VALUE OF DATA Your data is the life blood of your business and is probably your organisation s most critical asset. Independent IT analysts agree that companies that experience a complete data loss have only a 10% chance of survival in the following 2 years - 50% don t ever trade again! Whether you re running a web based retail site, an SQL database or even the staff payroll you will definitely benefit from the peace of mind that comes with knowing your businesses important data is backed up and available to you for immediate access whenever you need it. At the bare minimum you need to ensure you have a copy of all your data in a different place to the original. This gives you the ability to recover from data loss only if it is a real time copy, or near real time copy such as a snapshot. But if you are attacked by a virus then both copies will be infected. What is needed is a series of point in time copies so that in the event of data corruption you can roll back to a previous instance. You will need to store this historic data so it is accessible when needed but in most cases data more than 30 days old will not be needed. You will also be taking up valuable space on storage systems, keeping them powered up and ready in case you need them. Surely there is a better and more cost-effective solution? Archiving Persistent Data 4
The solution is a combination of back up and archive. Backup is an inert copy that can be called upon to restore data to the primary system. Archive is a live file that can be accessed when needed but doesn t need to be on the expensive front line system. Most data is passive meaning that it is rarely accessed 30 to 90 days after its creation and even less after that time period. By taking frequent snap shots of your new data and archiving data older than 30 days, and backing that up, you get the most cost-effective use of resources. IT projects are fast moving and dynamic but in most cases they are reliant on reusable assets e.g. historic data. For years, the way to handle data growth was to simply throw raw storage capacity at the problem. But that approach no longer works as organisations must not only deal with capacity challenges, but also with the performance, management and running cost of the systems. Some examples of fast moving dynamic projects with reusable assets: Media and Entertainment In the film industry it is common practice to store raw and edited content on high performance arrays while work is in progress. Once the project is completed the content is placed in a working archive or a long term archive depending on the time it takes to create intermediaries, certain special effects and other content to develop the final cut. Preservation of source media containing the original content is extremely common since it is difficult, if not impossible to recreate, but this is often insufficient since raw material does not capture any edits or metadata generated during the processing of raw content to create a finished product. As a result raw content as well as final cuts must be archived. Further reading: http://bit.ly/bnlsdo Life Sciences DNA sequencing and the use of imaging technology is producing new volumes of data that must be analysed, stored and managed. Research centres need to access, share and manage hundreds of terabytes of DNA sequencing data for analysis at any time. Each new generation of sequencers, mass spectrometers, microscopes and other lab equipment produces a richer, more detailed set of data. When the data is part of a workflow it must be on the highest performance systems accessible to researchers for analysis and discovery. The data should then be archived on more cost-effective systems for additional review and retrieval, and backed up off site. Further reading: http://bit.ly/arqap4 Utilities/Oil & Gas To increase oil and gas exploration, speeding the processing of seismic data is a vital tool. This involves massively powerful 3D processing software, fast high capacity Ethernet networks and SAN based storage. Daqing Oil Field Petroleum Exploration and Development Research Institute (EDRI) performs seismic data archival, retrieval, data protection and vaulting through a high performance tape library. Based on parameters such as schedules, work areas, users and key processing criteria, its Geophysics Service Centre can migrate data from online RAID systems to tape thereby releasing disk space for other jobs. When archived files are needed, they can be retrieved automatically from tape back to disk. Additionally, a clone of the final version of processed data can be replicated to the tape library to allow offsite vaulting and data protection for final data. Further reading: http://bit.ly/cj90mm Archiving Persistent Data 5
CERN a Government Research project CERN, the European research centre for nuclear research, recently built the Large Hadron Colider to allow scientists to analyse the structure of matter. The system generates approximately one Gigabyte of new data per second and must be sustained day and night for at least one month of an experiment. This is the equivalent of more than a Petabyte of data being accumulated during the month. All these billions of bits of data generated every second are acquired by the A Large Ion Collider Experiment (ALICE) data acquisition system before being selected, transferred and stored in the main computer centre three kilometres away. This requires high speed, shared workflow operations and large-scale, multi-tier archiving. Further reading: http://bit.ly/c2cls1 TIERED STORAGE In order to be more energy efficient you need to match your various business requirements with the right data storage technology. In most cases, this results in a multitier storage architecture that includes a mix of disk and tape hardware together with replication, deduplication, data management and archive software. As mentioned in the data types section at the start of the document data can generally be categorised as either live, persistent, or backup. This means that data is either in use, not in use but can be recalled at any point, or it is part of a copy of the primary data to only be used when the system needs to be restored after a failure. With this in mind you need to prioritise where the data resides to ensure that your live data is on fast, high-performance systems, your persistent data is archived but easily accessible and that backup data is not only on lower cost systems but ideally powered down unless called upon and, if part of a disaster recovery strategy, copied to a different location. For example you should set a policy that moves data that has not been accessed for 30 days to a secondary storage array and then archive it after 90 days. In this case fast primary storage is used for the live data, clustered SAN or NAS disk arrays for the secondary data and tape libraries for the archive. The reason for this structure is to maintain the most cost-effective system. You could put all data on primary storage but the capital expenditure for the hardware, the management time needed and power usage would be excessive. There is often a misconception that disk based arrays are faster than tape. If you want fast access to an individual file then disk is the correct choice but if you need sustained access to multiple files or need to restore files from a backup then tape, when used in conjunction with intelligent file management and archive software, will be your best choice. Archiving Persistent Data 6
Quantum use HP LTO tape drives in its storage libraries. Both companies have a long history of tape and disk based storage and can be completely impartial when giving advice on which technology suits which data set. A file being streamed from LTO tape using Quantum StorNext technology is much faster and more efficient than a disk based remote backup being restored over a Wide Area Network. Further reading: Taneja Group Technology Analysis: http://www.quantum.com/pdf/quantum_ Goes_Beyond_Backup.pdf Disk arrays used for archiving typically use SATA hard drives since they provide high storage capacity for a given price and are reliable when accessed infrequently. Data movement between tiers in an archive can be a manual process but this is cumbersome and susceptible to error, potentially resulting in data loss. Automation software products can be used to simplify this task. These products should include the ability to protect content by copying files and placing them on archive media. They should also work hand in hand with content asset managers and provide other efficiency features such as replication and deduplication for storage tiers. These features will greatly reduce storage requirements while enabling data to be retained longer. Archiving should not be regarded as a static process. Data volumes will always grow and when an archive load becomes too large decisions will have to be made about which content to transfer and preserve on new media. Selecting media format should always be made with a consideration towards backwards compatibility otherwise data transfer could become an almost constant process. LTO tape is considered by many the best choice of archive media because of its speed, capacity, 30 year shelf life and the fact that it is backed by the LTO consortium, therefore guaranteeing easy future access. The LTO consortium s road map shows the intention to provide read/write capability one generation back and read capability two generations back. The hardware should check the integrity of the data. The software automation tools should provide the ability to stream archive data to tape as this speeds the write and recovery processes. Automated policies that refresh the media over time, transparently to the user, also improve efficiency. In reality a combination of enterprise data management and protection software and a high performance LTO tape library will give the most cost-effective archive performance. Further reading: Computer Technology Review LTO article: http://bit.ly/d29a14 Quantum LTO: http://www.quantum.com/products/tapedrives/ltoultrium/lto-5/index.aspx Archiving Persistent Data 7
Acquisition Costs Power & Cooling Relative Cost of Acquisition 100 80 60 40 20 100 80 60 40 20 Power % used Primary Storage High performance Secondary Storage SATA Secondary Storage Tier 2 or 3 storage OPERATIONAL COST Data centre power, cooling, and space requirements are becoming a challenge. And the demands for data protection, improved restore performance, longer data retention times, and technology integration such as deduplication, are growing at a vast rate. Only the original data needs to be backed up and retained for long periods of time. Keeping it on spinning media for years on end will eat away at the energy portion of your IT infrastructure budget. Moving long-term data retention to tape largely removes the electricity costs to store that data, and enables the enterprise to demonstrate sustainability, via green initiatives, that seek to reduce energy consumption. The above diagram shows the acquisition and running costs of an LTO tape library compared with primary and secondary disk storage. If you are only accessing data occasionally it makes sense to ensure it is stored on the most cost-effective and efficient platform. As you can see, power consumption is a vital consideration for a cost-effective system. With primary data continuing to grow, doubling every 12 to 18 months, powering and managing that growth has moved into the top five of CIO concerns. Overall, 15% of office use of electricity is attributable to IT, according to UK-based Carbon Trust, and it forecasts this will rise to 30% by 2020. THE ROLE OF DATA MANAGEMENT SOFTWARE IN THE DATA CENTRE Good data management software should give you high-speed content sharing combined with costeffective data archiving. It s all about helping you build an infrastructure that consolidates your resources, so workflow runs faster and operations cost less. Data sharing and retention should be combined in a single solution, so you don t have to piece together multiple products that may not integrate well. Even in heterogeneous environments, all data should be easily accessible to all hosts. Further reading: Quantum data management: http://bit.ly/azzpk8 Archiving Persistent Data 8
THE ROLE OF TAPE IN THE DATA CENTRE Tape has historically been the primary media for backup and archive support for the data centre. It continues to be pervasive in data centres of all sizes. According to the Clipper Group for backup, 20% of all enterprises use only tape, while another 65% use both tape and disk, with tape usually sitting behind disk. This means that 85% of all enterprises use tape in some capacity for their data protection need. The primary role of tape is evolving to long-term archive and data retention, with many enterprises using disk systems for short term backup and recovery in order to take advantage of the quick access speeds from disk for an individual file. Tape continues to be the primary storage media for most disaster recovery plans. Further reading: Clipper Group Benefits of tape: http://bit.ly/ae47el Quantum s StorNext data management software provides a solution that allows you to load tapes into a library and the data set is there this can save many hours, sometimes even months, compared to a conventional recovery. It does this by storing the file directory data to provide full access as soon as the tape is loaded. Quantum s Scalar i6000 storage library with LTO5 tape drives includes innovative new features, such as ilayer MeDIA for analysing the integrity of media and a bulk load capability for the mass import and export of tape cartridges. ilayer intelligent software simplifies management and helps contain costs by reducing administrative time. Need a long term archiving solution?: http://bit.ly/9p6nya DISASTER RECOVERY As with backup and recovery, disaster recovery is a vital part of your data protection strategy. A disaster recovery policy basically means that you have a copy of your data, in a non corruptible form, in a different location from the primary data. What causes problems is that there is usually too much data to deal with. A good archive plan will get rid of a major portion of the problem. It will also eliminate the need for some of the complexity that is built into the backup process because people are using backups for long term retention of data. Long term retention should be the sole domain of the archive. These copies should then be replicated off-site in case something goes wrong at the original site. Ideally this should be accomplished by one process. If not, it should be managed as part of an overall backup workflow. Archiving Persistent Data 9
CONCLUSION Archiving is a vital part of your corporate IT policy. The key considerations are to ensure that all initial data is backed up in some way but not replicated multiple times. Backup essentially parks the data in case it is needed for restore purposes. Archive is a long term store that is held on cost efficient media and can be accessed easily when needed. When a massive amount of data is persistent the cost savings and speed efficiencies can be equally massive. The combination of intelligent archiving and data preservation software coupled with the latest high speed tape libraries will give you the best value, protection, operation cost savings and disaster recovery plan available. About Quantum Stornext With StorNext data management software, you get high-speed content sharing combined with cost-effective data archiving and content protection. It s all about helping you build an infrastructure that consolidates your resources, so your workflow runs faster and operations cost less. StorNext offers data sharing and retention in a single solution, so you don t have to piece together multiple products that may not integrate well. Even in heterogeneous environments, all data is easily accessible to all hosts. Key Features and Benefits File System Deduplication optimizes the capacity and cost of primary storage. Distributed Data Movers (DDMs) increase the performance and scalability of storage tiers. Replication enables powerful data protection and data distribution solutions. Management Console greatly simplifies data management complexities. Virtualisation of storage tiers greatly reduces future storage requirements while enabling data to be retained longer. Self-Protecting Architecture leverages integrated data protection, and integrity checks safeguard data both on-site and off-site. Further reading: Quantum StorNext: http://www.quantum.com/products/software/index.aspx About Quantum Scalar tape libraries Designed to grow with your needs, Scalar tape libraries provide best-in-class management, monitoring, and data security capabilities with embedded software called the Quantum ilayer. This software uses detailed information to automatically evaluate the integrity of drives and media within the library, so you can increase backup reliability while decreasing the total cost of ownership. The Scalar family of tape libraries easily integrates into your existing infrastructure and works seamlessly with disk for a complete data protection solution. Further reading: Quantum Scalar tape libraries: http://www.quantum.com/products/tapelibraries/index.aspx Archiving Persistent Data 10
Preserving the World s Most Important Data. Yours. www.quantum.com/stornext, email: softwareinfo@quantum.com Quantum Corporation Northern & Eastern Europe, Middle East and Africa Quantum House, 3 Bracknell Beeches, Old Bracknell Lane West, Bracknell, RG12 7BW, United Kingdom Tel: +44 (0) 1344 353500 Quantum Corporation Central Europe Willy-Brandt-Allee 4, 81829 München, Germany Tel: +49 89 94303-0 Quantum Corporation Southern Europe 8 rue des Graviers, 92200 Neuilly-Sur-Seine, France Tel: +33 1 41 43 49 00 For contact and product information, visit quantum.com or call 800-677-6268 Preserving the World s Most Important Data. Yours. 2010 Quantum Corporation. All rights reserved. Quantum, the Quantum logo, and all other logos are registered trademarks of Quantum Corporation or of their respective owners. Protected by Pending and Issued U.S. and Foreign Patents, including U.S. Patent No. 5.990.810. About Quantum Quantum Corp. (NYSE:QTM) is the leading global storage company specializing in backup, recovery and archive. Combining focused expertise, customer-driven innovation, and platform independence, Quantum provides a comprehensive range of disk, tape, media and software solutions supported by a world-class sales and service organization. This includes the DXi -Series, the first disk backup solutions to extend the power of data deduplication and replication across the distributed enterprise. As a long-standing and trusted partner, the company works closely with a broad network of resellers, OEMs and other suppliers to meet customers evolving data protection needs. WP00148B-v01 Oct 2010