StorReduce Technical White Paper Cloud-based Data Deduplication

StorReduce Technical White Paper Cloud-based Data Deduplication See also at storreduce.com/docs StorReduce Quick Start Guide StorReduce FAQ StorReduce Solution Brief, and StorReduce Blog at storreduce.com/blog Published March 2015

INTRODUCTION StorReduce is a specialized cloud deduplication solution, designed to meet the unique requirements of companies using cloud storage for large volumes of data. StorReduce sits between your applications and cloud storage, transparently deduplicating data inline at speeds of up to 600 MB/s, reducing storage costs and freeing your data for use in the cloud. StorReduce Dashboard Admin Interface Backup Software client Raw Data File System Gateway client StorReduce Server Deduplicated Data Cloud Storage (Amazon ) Cloud Services client interface interface SSD Diagram 1: StorReduce Architecture STORREDUCE KEY CHARACTERISTICS Reduces cloud storage costs typically by 50-95% Fast: Up to 600 megabytes per second, for both reads and writes, adding under 50ms of latency. Always On: Capable of throughput 24/7, 365 p.a. Scalable: Up to 10 petabytes (10,000,000 gigabytes) of data per virtual server. Cloud-native: Deduplicated data is immediately accessible to cloud services via StorReduce s REST API. Software-only solution: No hardware required, free from cost and lock-in. Public, Private or Hybrid Cloud: Works on any public cloud, hybrid or private cloud that has object storage with an or SWIFT interface. 2

Data Management / Backup Software Integration: Works with existing data management or backup software that is compatible with Amazon. Secure User Account and Key Management: Users or servers can be given individual user accounts within StorReduce, allowing data access to be restricted. Multiple access keys can be created and managed as needed for each user account. Secure Policy-based Access Control: Enterprise security policies can be expressed using StorReduce policy engine, using Amazon s IAM policy language. STORREDUCE ARCHITECTURE The StorReduce Server sits between client programs (wanting to store and retrieve data) and the Cloud Storage service they are using. It transparently provides best-of-class data deduplication and throughput speeds. The StorReduce server provides similar functionality to Amazon, including object storage, user accounts, access keys, access control policies and a Web-based management interface (the StorReduce Dashboard). CLIENT SOFTWARE StorReduce works with client software that supports Amazon s REST interface for storage. This includes on-premises backups and data analysis software, as well as custom software written to use the REST interface. Cloud-based services designed to work with Amazon can also be used with StorReduce - these also act as clients. Client software is configured to talk to the StorReduce server instead of directly to Cloud Storage, using access keys provided by the StorReduce server. Other interfaces can be supported via gateway software that exposes these interfaces and converts requests into calls to Amazon s interface. StorReduce may natively support file-system based interfaces like CIFS and NFS in the future. 3

STORREDUCE SERVER The StorReduce server runs on its own physical or virtual machine, with local SSD storage recommended. Each StorReduce server is able to handle up to 10 Petabytes (10,000,000 Gigabytes) of raw data, depending on the deduplication ratio achieved and the amount of SSD storage available for index information. StorReduce supports the creation of multiple storage buckets, with global deduplication performed across all buckets. For public cloud the StorReduce server runs very effectively on an Amazon EC2 instance. An Amazon AMI is available with the server pre-installed, allowing quick and easy setup. StorReduce can also easily be installed on any Cloud-based Linux virtual machine, using rpm. For migration of on-premises data to the cloud, or for private cloud deployments, the StorReduce server can be run on-premises on a physical or virtual machine. The architecture is designed to allow multiple StorReduce servers to be run against the same back-end Cloud Storage service, for redundancy, load-sharing and increased storage volume. For example, an on-premises StorReduce server might be used to deduplicate and upload backup data, with a second in-cloud StorReduce server providing immediate access to this data for cloud services as the data is uploaded. Interface: The StorReduce server exposes an -compatible REST interface for object storage. This interface is designed to be highly scalable, and supports most interface calls including: Object GET/PUT/POST/DELETE (including multiple-object delete) Multipart uploads (including listing and deleting uploads) Digital signature verification Bucket create/delete/rename Setting/reading bucket policies for access control. Admin Interface: A separate REST interface is exposed for use by the Web-based dashboard. This admin API is served on a separate port to allow firewalls to restrict network-level access, and can optionally also be served over HTTP on port 443. The admin API is available for use by other client applications as well as the StorReduce dashboard, and supports manipulation of user accounts, access policies, index snapshots as well as providing a replica of the API for use by management tools. 4

LOCAL SSD STORAGE Each StorReduce server requires fast local storage to maintain index information. The amount of raw data a StorReduce server can handle depends on the amount of SSD storage available and the deduplication ratio achieved for the data. Local SSD storage is treated as ephemeral by the StorReduce server. All information stored in local SSD storage can be recovered from Cloud Storage if required (see later section). CLOUD STORAGE The StorReduce server uses Cloud Storage for all persistent data. It acts as an client, making use of the Amazon REST API to store all its data in a single bucket. This means StorReduce may be used with any -compatible Cloud Storage solution, including Amazon itself. For private cloud deployments StorReduce can talk to any -compatible data store. StorReduce may natively support other Cloud Storage solutions in the future. The StorReduce server makes use of Cloud Storage to store the following types of data: Deduplicated user data: Raw data is deduplicated using state-of-the-art algorithms and then compressed. Typically this requires only 5% to 40% of the Cloud Storage space the raw data would have required, depending on the type of data being stored. System Data: Information about buckets, users, access control policies and access keys is also stored in back-end cloud storage, making it available to all StorReduce servers in a given deployment. Index snapshots: Data for rapidly reconstructing index information on local SSD storage can also be stored in the back-end cloud storage (see later section). PERFORMANCE The StorReduce server is optimized for scalability, high throughput and low latency. The internal architecture and code are highly optimized for data deduplication, and to ensure that performance is maintained even when running in a public cloud environment. 5

A single StorReduce server is capable of sustained speeds of 600 Megabytes per second, for both reads and writes. This is close to saturating a 10Gb/s network connection. StorReduce tested this using a single server on an AWS EC2 instance, loading up the server by feeding data from multiple other EC2 instances. Details on how to achieve 600 MB/s throughput are set out at storreduce.com/blog. Running a StorReduce on-premises can significantly speed up throughput and decrease transfer bandwidth to cloud-based storage by deduplicating data prior to sending it into the cloud, and by reading deduplicated data from the cloud and reconstituting it locally. StorReduce s fast throughput reduces migration times from years to weeks and greatly reduces the cost. See direct independent comparison of StorReduce vs a well known migration vendor at http://www.storreduce.com/case-studies/apn-spectrumdata/. Latency is kept to a minimum, typically only around 50ms of additional latency even when StorReduce is running in the cloud. For most situations this makes no difference at all to end users, and does not affect throughput at all. See Diagrams 2 and 3 below which show that the initial migration of tapes or disk-based backup to the cloud (also known as ingestion ) can be done using an on-premises StorReduce server together with existing backup software: ON PREMISES AWS VPC Tapes Backup Software StorReduce Virtual Machine Glacier Diagram 2: On-premise tape to cloud migration Subsequent access to the data can be managed entirely on-cloud, and the on-premises StorReduce server can be removed. The data s index remains on the cloud with a interface that enables any Cloud Services to natively access all of the migrated data: 6

AWS VPC StorReduce Virtual Machine on EC2 Glacier The deduplication index is now in the AWS Cloud, not locked in on-premise software or hardware. The data is easily accessible via an compatible API. Diagram 3: After migration reinstate StorReduce server in the cloud INDEX DATA StorReduce maintains an index of user data on fast local storage. Each StorReduce server keeps its own independent index. All index data can be rebuilt from the log of transactions stored in Cloud Storage. For large data sets it can take a long time to rebuild the index from scratch. To speed this up, the server periodically takes a snapshot of the index and stores this in Cloud Storage. When a StorReduce server starts up, if an index needs to be rebuilt then the server will: 1. Load the last index snapshot from Cloud Storage 2. Replay subsequent transactions to bring the index up to date. Note: When stopping a StorReduce Server running on Amazon EC2, Amazon will delete all data on that machine s SSD instance storage. When the machine is started again the index must be rebuilt as described above. For this reason it is recommended to leave production StorReduce servers running rather than stopping and starting them. Note that when restarting an EC2 instance the instance storage is preserved. 7

MULTIPLE STORREDUCE SERVERS Because StorReduce maintains a log of all transactions on Cloud Storage, multiple servers will be able to watch this transaction log to keep their independent indices up to date. This will allow the same content to be fetched from multiple StorReduce servers in different locations - in particular both on-premises and on-cloud. StorReduce high-availability clustering will also makes use of this property to provide load-sharing and failover within a cluster of StorReduce servers. For any given object there will be one primary write server, but other servers will be able to read and serve up the data. By sharding data among multiple StorReduce primary servers it will be possible to support very large data sets (larger than 10PB) with failover and redundancy. In the current version of StorReduce new servers can be set up to talk to an existing Cloud Storage service and they automatically populate their local index data from Cloud Storage. SECURITY Security is extremely important for any cloud storage solution. As well as leveraging the security of your underlying Cloud Storage service, StorReduce provides the following specific security-related capabilities: User Account Management: StorReduce maintains a set of user accounts for each StorReduce deployment. User accounts can be used to provide people with limited access to the StorReduce dashboard, or to provide people or programs with limited access to the API. Individual user accounts can be revoked to instantaneously cut off access. Access Keys: Each user account can have multiple access keys, used for accessing the API. These work in the same way as access keys managed by Amazon s IAM service. Individual access keys can be revoked using the StorReduce dashboard. Digital Signatures: All requests from clients must be digitally signed using a secret access key tied to a StorReduce user account. Data Segregation: StorReduce supports the creation of multiple storage buckets for data segregation, with different access rights for each bucket. Policy-Based Access Control: Enterprise security policies can be expressed using StorReduce policy engine, supporting Amazon s IAM policy language. Access control can be applied to buckets using bucket policies (compatible with Amazon IAM bucket policies). 8

HTTPS encryption and Certificates: The StorReduce server by default requires HTTPS encryption for dashboard requests, and can be configured to require HTTPS encryption for all API requests. Server certificates can be uploaded and set through the StorReduce dashboard. Amazon AWS Role Credentials: A StorReduce server running in an EC2 instance can make use of AWS roles to securely obtain credentials for accessing its underlying Cloud Storage, enabling automatic key rotation. USE CASES BACKUP / TAPE MIGRATION StorReduce makes cloud a more affordable storage medium than tape when it achieves a deduplication ratio of 50%+. This is the case even factoring in migration costs as it greatly reduces the cost for transmission bandwidth and on-cloud storage - StorReduce typically achieves 95% deduplication on backup data. StorReduce software can be installed on-premise for a CAPEX-free, very fast migration of an enterprise s large tape archive data onto the Cloud. Installing StorReduce on-premise minimizes bandwidth during the transfer (see diagrams 2 and 3 above). After the transfer is completed, the on-premise StorReduce software can be removed and reinstated in the cloud. The index is now on the cloud and combined with StorReduce s interface, this enables the migrated data to be seamlessly accessed by cloud services, like data mining, search and more. The data is now out of the on-premise silo and is affordable and accessible in a pooled Data Lake on cloud. LARGE DATA SETS ON-CLOUD: BIG DATA Companies storing large data sets onto Cloud Storage typically have a data growth rate of hundreds of percent per annum. Despite AWS s repeated storage price drops, this translates to rapidly increasing storage costs. Many of these companies use Hadoop or AWS Elastic MapReduce to data mine this information, so they need to keep their storage on Cloud Storage s hottest more expensive tier. Compounding their escalating data volume is the fact that often this big data is copied and stored on Cloud Storage multiple times to enable analysis, QA and testing by multiple employees. This can become prohibitively expensive, forcing choices to be made about which data to store and what subsets to test on. 9

Where multiple sets of the same data feeds are stored onto Cloud Storage, StorReduce can remove redundant blocks of data inline as the data pours in. This greatly reduces the volume and cost of storage, often by over 50%. StorReduce s scalability for large data sets and its very fast throughput make it an ideal solution for on-cloud big data companies wanting to decrease their cloud storage and also use data mining services like EMR or Hadoop in real time. AWS VPC CLIENT S VIRTUAL PRIVATE CLOUD Client Applications Data from mobiles, wearable, big data, etc StorReduce Virtual Machine on EC2 Glacier (optional) On-premise backups, tape archives & unstructured data API, so data is seamlessly accessible by: - EC2 - Elastic MapReduce - CloudSearch Diagram 4: StorReduce inline on-cloud deduplication UNSTRUCTURED DATA StorReduce can be used to move general unstructured data to the cloud, such as data from corporate file servers, in the same way that it migrates tape and backup data to cloud. Most data contains duplicate information and StorReduce should deliver cost savings of 50-80% on such data. (See Diagram 4 above.) PRIVATE CLOUD StorReduce can be inserted into any private cloud with an object store and an or OpenStack Swift interface. It inserts in the same manner as in Diagram 4 for a public cloud. It will reduce the amount of infrastructure required for storage of unstructured data by the amount of deduplication benefit it provides. 10

YOUR CLOUD DATA LAKE Deduplicating your on-cloud data and migrating your tape and backup archive data with StorReduce moves your data into an extremely affordable and scalable Data Lake that sits on cloud. Your data is now unlocked by StorReduce s native cloud interface, so that any cloud services can seamlessly work across all of your data to maximize its business benefit! ON-PREMISE DATA SILOS - LOCKED DATA: Backup Hardware Tape Archives Structured Big Data Unstructured Corporate Data Data Migration via StorReduce StorReduce API for Cloud Access to Data DATA LAKE Powered by StorReduce Scalable PBs of data, up to 600MB/s throughput Affordable Storage reduced by 50-95% Agile No Silos. One API. API. Natively use any cloud service From Internet: Mobile, Social, IoT Data Move Data Easily Reduce bandwith up to 95% when moving data to other Regions or hybrid cloud. Scalable DR Recover thousands of VMs instantly Index & Search Free data that was not previously visible Data Analysis Hadoop, EMR, other cloud services Diagram 5: StorReduce as an Enabler for Your Cloud Data Lake 11