1 Hardware Configuration Guide Contents Contents... 1 Annotation... 1 Factors to consider... 2 Machine Count... 2 Data Size... 2 Data Size Total... 2 Daily Backup Data Size... 2 Unique Data Percentage... 2 Backup Type... 3 Backup Scheme... 3 Backup Window... 3 Retention Period... 3 Hardware... 4 Storage... 4 Backup data... 4 Deduplication database... 7 Catalog database... 8 Other Storage Node databases... 9 CPU... 9 RAM... 9 Network Data Transfer Rate Number of Storage Nodes Time to process backups Time to reindex backups Annotation The following document describes what are the main factors affecting the backup and deduplication speed. This document can be used to estimate the amount of backup data occurred after some period of time and hardware configuration (primarily of the Storage Node) needed to manage this data.
2 Factors to consider The affordable Storage Node configuration depends on a lot of factors. There is a bunch of questions to be answered to define the explicit hardware component parameters to be used for storing and effectively processing backups. The most important factors to be taken into account are represented as parameters and appropriate formulas are provided. Machine Count The Machine Count parameter roughly shows the amount of work for Storage Node to be done. 1-4 machines can be handled by ASN with almost any configuration. Having 500 machines even with 5-10 GB of data will most probably require a high-performance server for the Storage Node. The Machine Count parameter helps to decide how many Storage Nodes you need, what network bandwidth is most appropriate. In conjunction with average data size information this number can be used to estimate the capacity of backup storage. Note that the Machine Count parameter should consider the future growth. Data Size Data Size parameter, which means the average amount of data on the machine, is used for estimation of the backup storage size. Additionally it is a basis for calculating the Daily Backup Data Size. Data Size Total Sometimes it is more convenient to use the Data Size Total parameter which means a total amount of data on all machines in the environment to be backed up. This parameter value is calculated by the formula: Data Size Total = Machine Count * Data Size Daily Backup Data Size Daily Backup Data Size parameter shows how much original backup data to be processed appears every day. This data needs to be backed up so it is taken into account in capacity calculations. Additionally this amount is used to calculate the needed backup window. It is convenient to use the Daily Backup Data Percentage parameter which is specified in percents. Backup Data Size can be calculated by the formula: Backup Data Size = Data Size * Daily Backup Data Percentage / 100 Unique Data Percentage Unique Data Percentage parameter means how much unique data are there on the machine in general. User data are usually unique. Operating system or program files are usually duplicated. This parameter value depends on the backed up machine purpose it can be office machine with low percentage of unique data or file server with high percentage of unique data or something else. Usually this parameter value it is taken as 10-20%. If you already have a Storage Node, one of the ways how to calculate this amount is to perform full back up of several machines and use the resultant Deduplication Ratio in the following formula:
3 Unique Data Percentage (%) = (Deduplication Ratio * Machine Count / 100 1) / (Number of Backups - 1) Unique Data Size can be calculated by the formula: Unique Data Size = Data Size * Unique Data Percentage / 100 This parameter affects the deduplication ratio and as a result backup storage savings. Additionally the less unique data is on a machine, the less backup traffic to the deduplicating vault is. Backup Type Disk-level backups are used several times more often than file-level backups especially for servers as this type of backups can be used for system recovery. File-level backups are convenient for storing user data when the data is important and system itself is not needed to be recovered. Disk-level backups are performed faster than file-level ones. Backup Scheme The backup scheme defines the method and frequency of backups. Here is a list of available backup schemes: - The Simple scheme is designed for quick setting up daily backup. Backups generally depend on the previous ones up to the very first one. - The Grandfather-Father-Son scheme allows you to set the days of week when the daily backup will be performed and select from these days the day of weekly/monthly backup. - The Tower of Hanoi backup scheme allows you to schedule when and how often to back up and select the number of backup levels. By setting up the backup schedule and selecting backup levels, you automatically obtain the rollback period the guaranteed number of sessions that is possible to go back at any time. TOH provides the most effective distribution of backups on the timeline. - The Custom scheme provides the most flexibility in defining backup schedules and retention rules. Backup Window Backup window defines the time when backups are allowed. Backups are usually configured for night time to avoid affecting performance of working machines in business hours. Backup window affects the number of Storage Nodes to be used. If one Storage Node cannot process backups of all machines, an additional one should be added. That is defined by comparing the length of the backup window and a time needed for backups. Retention Period Retention period defines how long the backups should be stored. Backup schemes provide the ability to adjust the retention period. Retention period affects the capacity of a backup storage.
4 Hardware The backup requirements are the basis for finding the affordable configuration for a Storage Node. Each configuration parameter depends on one or more requirements. This section describes how to define the configuration parameters of a Storage Node: storage size and type, RAM size, CPU speed. Storage Amount of space occupied by the backups is one of the most important parameters of a Storage Node configuration. The backups and their metadata are stored in several places: deduplication data store, deduplication database, catalog database. The following sections show how to estimate the size of each of them. Backup data Capacity The capacity which is taken by the backups mainly depends on backup data size and used backup schemes/schedules. Here are the details specific for the backup schemes: - Simple scheme daily incremental backups are performed. First full backup can be performed in an implementation phase which should not be taken into account. The size of daily backup is a size of daily incremental data. - Grandfather-Father-Son scheme supposes daily backups. Full backups are made on a repetitive basis. The backup window should fit the time of full backup creation (or several backup windows can be defined for each type of backup). The size of largest differential backup is 15 sizes of daily incremental data but as it is made at the same day as the full backup, the backup window for this day must fit full backup time. - Tower of Hanoi supposes the creation of incremental backups every 2 nd day. Differential backups are created every other second day. Once on a period a full backup is created. The frequency of making full backup depends on a level of the scheme. With the 6 th level (which is by default) a full backup is created every 16 th day. If the creation of full backups is rare, a special backup window can be scheduled for each of them. Otherwise the full backup should be fit into standard backup window. The size of the largest differential backup is the size of daily incremental backups multiplied by the period length. - Custom scheme is the most flexible one so it can be configured with relevance of specified backup windows. The following table shows the numbers of backups for different backup schedules/schemes for the specified periods of time and retention periods ( f is for full, i is for incremental, d is for differential, w is for weeks): GFS (keep monthly backups indefinitely, no backups on weekends) Retention\Due date 1 month 3 months 6 months 1 year 2 years 5 years Indefinitely 2f,3d, 15i 4f, 5d, 4i 7f, 5d, 4i 14f, 5d, 4i 26f, 5d, 4i 65f, 5d, 4i GFS (keep monthly backups for 1 year, no backups on weekends) Retention\Due date 1 month 3 months 6 months 1 year 2 years 5 years
5 1 year (104w) 2f,3d, 15i 4f, 5d, 4i 7f, 5d, 4i 14f, 5d, 4i 14f, 5d, 4i 14f, 5d, 4i Daily full (make full backup every day, no backups on weekends) Retention\Due date 1 month 3 months 6 months 1 year 2 years 5 years 1 week (1w) 5f 5f 5f 5f 5f 5f 1 month (4w) 20f 20f 20f 20f 20f 20f 3 months (13w) 20f 65f 65f 65f 65f 65f 6 months (26w) 20f 65f 130f 130f 130f 130f 1 year (52w) 20f 65f 130f 260f 260f 260f 2 years (104w) 20f 65f 130f 260f 520f 520f 5 years (260w) 20f 65f 130f 260f 520f 1300f Daily incremental (make incremental backups every day, no backups on weekends) Retention\Due date 1 month 3 months 6 months 1 year 2 years 5 years 1 week (1w) 1f, 4i 1f, 4i 1f, 4i 1f, 4i 1f, 4i 1f, 4i 1 month (4w) 1f, 19i 1f, 19i 1f, 19i 1f, 19i 1f, 19i 1f, 19i 3 months (13w) 1f, 19i 1f, 64i 1f, 64i 1f, 64i 1f, 64i 1f, 64i 6 months (26w) 1f, 19i 1f, 64i 1f, 129i 1f, 129i 1f, 129i 1f, 129i 1 year (52w) 1f, 19i 1f, 64i 1f, 129i 1f, 259f 1f, 259f 1f, 259f 2 years (104w) 1f, 19i 1f, 64i 1f, 129i 1f, 259f 1f, 519i 1f, 519i 5 years (260w) 1f, 19i 1f, 64i 1f, 129i 1f, 259f 1f, 519i 1f, 1299i Weekly full, daily incremental (make full backup once a week and make incremental all other days, no backups on weekends) Retention\Due date 1 month 3 months 6 months 1 year 2 years 5 years 1 week (1w) 1f, 4i 1f, 4i 1f, 4i 1f, 4i 1f, 4i 1f, 4i 1 month (4w) 4f, 16i 4f, 16i 4f, 16i 4f, 16i 4f, 16i 4f, 16i 3 months (13w) 4f, 16i 13f, 52i 13f, 52i 13f, 52i 13f, 52i 13f, 52i 6 months (26w) 4f, 16i 13f, 52i 26f, 104i 26f, 104i 26f, 104i 26f, 104i 1 year (52w) 4f, 16i 13f, 52i 26f, 104i 52f, 208i 52f, 208i 52f, 208i 2 years (104w) 4f, 16i 13f, 52i 26f, 104i 52f, 208i 104f, 416i 104f, 416i 5 years (260w) 4f, 16i 13f, 52i 26f, 104i 52f, 208i 104f, 416i 260f, 1040i The size of full backup changes because of changed data. For example if daily change percentage is 1% and backups are performed on workdays only, full backup size after 52 weeks will be 3.6 times higher than the original data backup. For rough capacity estimations it is recommended to use the backup size at the end of the retention period. The more accurate results you can get taking the average between the initial backup and the last backup sizes. Incremental backup size depends on the frequency of backups. Daily changed backup size is one of the parameters defined initially. It is possible to configure backups to be performed only weekly, monthly or at any day so the real incremental backup size must be calculated. Differential backup size is based on changed data percentage and the amount of days from the last full backup. To estimate highest differential backup size take the amount of days between first full and last of its differential backups. For example in case of GFS the longest distance between full backup and differential backup is 15 days so the largest differential backup size will be the size of daily incremental data multiplied
6 by 15. More accurate estimation is using average between the first and the last differential backups of the same full backup. The data in all types of backups can be compressed. Normal level of compression, which is used by default, usually makes the backup data size about 1.5 times smaller. Deduplication significantly affects the amount of data occupied by backups. Here is the formula for calculating the initial storage space: Storage Space (GB) = (Data Size Total * Unique Data Percentage / Data Size Total * (100 - Unique Data Percentage) / 100 / Machine Count) / Compression Ratio To calculate the storage space after some period of time, backup scheme and retention period should be considered (above table can be used, see example at the end of this section for more details). Storage type The main backup storage requirement is having enough capacity to store all backup data. Here are the recommendations regarding the storage type: 1. Deduplicated vault data can be organized: a. on the Storage Node s local hard drives recommended for higher performance b. on a network share c. on a Storage Area Network (SAN) d. on a Network Attached Storage (NAS) 2. The storage device can be relatively slow comparing to the vault database disk. 3. Use RAID for redundancy. 4. Place vault data and vault database on drives controlled by different controllers. 5. Vault data should not be placed on the same drive with operating system. 6. There should be plenty of free space to store all the backups and perform service operations. Storage size depends on the amount and type of data you are going to back up and appropriate retention rules. To estimate the storage size, take the full amount of backups of all your machines, divided this by the deduplication factor and multiply this by the number of full backups from each machine you are going to retain. Add daily incremental data size multiplied by the number of working days the data is retained. Flat recommendation: use 7200RPM disks in a RAID. 30 workstations are backed up with GFS scheme (full backups are stored infinitely) about 1500 GB of data in total. Daily Backup Data Percentage is 1% a day. Unique data size is 20%. Need to calculate the capacity of the storage needed for backups after one year (52 weeks). With GFS scheme in a year there will be 14 full backups, 5 differential and 4 incremental backups for each machine. To calculate full backup size we need to know the initial and the final ASN backup data sizes. During calculations we should take into account deduplication and compression.
7 Initial ASN backup data size is: Initial Backup Data Size = (Data Size Total * Unique Data Percentage / Data Size Total * (100 - Unique Data Percentage) / 100 / Machine Count) / Compression Ratio = (1500 GB * 20 / GB * (100-20) / 100 / 30) / 1.5 = (300 GB + 40 GB) / 1.5 = GB. Now we have to count final backup data size. We count it as initial backup size plus a half of the daily change for each day (this is specific for GFS scheme). Final Backup Data Size = Initial Backup Data Size + (Initial Backup Data Size * Daily Backup Data Percentage / 100) * (Due Date * 5 1) / 2 = GB + ( GB * 1 / 100) * (52 * 5 1) / 2 = GB GB = GB. We will take average full backup size: Average Full Backup Size = (Initial Backup Data Size + Final Backup Data Size) / 2 = (227 GB GB) / 2 = 374 GB Average Incremental Backup Size is taken as a daily change of average full backup: Average Incremental Backup Size = Average Full Backup Size * Daily Backup Data Percentage / 100 = 374 GB * 1 / 100 = 3.74 GB For differential backup we will take the following estimation: the first differential backup in GFS scheme contains the data for 5 days so it is 5 * 1% = 5%. The last differential in the chain is 15 * 1% = 15%. Average is 10% of average backup data. In bytes that will be: Average Differential Backup Size = Average Full Backup Size * 10 / 100 = 374 GB * 10 / 100 = 37.4 GB Total data size for all types of backups for the specified period will be (we take upper storage limit estimation so Final Backup Data Size is used for full backups): Data Size Total = Final Backup Data Size * 14 + Average Differential Backup * 5 + Average Incremental Backup Size * 4 = 521 GB * GB * GB * 4 = 7294 GB GB + 15 GB = 7496 GB Additional space is occupied with deduplication database which stores information about where to find deduplicated data blocks and Catalog database which allows fast browsing and search inside archive content. Deduplication database Deduplication database stores hash values and offsets for each file 256 kb block (file-level backups) or disk 4 kb block (disk-level backups) stored in the vault. Deduplication database size can be roughly estimated based on the fact that in case of disk-level backups Deduplication database for 500 GB of unique data takes 8 GB. For the case of file-level backups the Deduplication database for 500 GB will take 64 times less space (0.13 GB) because of difference in deduplicated block size. Meanwhile the size of Deduplication database is much lower, it is recommended to store it on the reliable drives with minimal access time. Such drives are more expensive that s why the size of deduplication database should be considered. Deduplication database can be placed on a separate drive for higher performance reasons.
8 Here is the formula for deduplication database for disk-level backups: Deduplication Database size (GB) = * Data size 2.12 GB And here is a variant for file-level backups: Deduplication Database size (GB) = * Data size GB Storage type For higher Storage Node performance follow the provided recommendations: 1. The folder must reside on a fixed drive. 2. The folder size may become large estimation is 40 GB per 2.5 TB of used space, or about 2.4 percent. 3. The folder should not be placed on the same drive with operating system. 4. Minimal access time is extremely important. If you are backing up more than GB a day an enterprise-grade SSD device is highly recommended. If SSD s are not available, you can use locally attached 10000RPM or 7200RPM in a RAID10. Processing speeds will, however, be slower than enterprise-grade SSD devices but about 5-7%. Flat recommendation: use 7200RPM disks in a RAID10. There are 30 servers with 50 GB of data on each of them. Unique data size is 20%. Need to calculate the size of deduplication database. As the first step the data size is calculated. Data Size = (Data Size Total * Unique Data Percentage / Data Size Total * (100 - Unique Data Percentage) / 100 / Machine Count) / Compression Ratio = (1500 GB * 20 / GB * (100-20) / 100 / 30) / 1.5 = (300 GB + 40 GB) / 1.5 = 227 GB And now calculate deduplication database size: Deduplication Database Size = * Data Size GB = * 227 GB GB = 3.86 GB GB = 1.74 GB Catalog database Catalog database contains index with information about all the files in the vault. This database size can be roughly estimated based on the fact that catalog database with 1 million items (file information blocks) take about 250 Mb. This is correct for both disk-level and file-level backups. So here is the formula: Catalog Database Size (GB) = GB * Number of Files It is more convenient to operate with the same parameters in all the formulas. For such purposes the Data Size Total can be converted to Number of files based on average file size. Average file size varies depending on the machine type. Average file size for office workstations is about 0.5 Mb. If the machine stores big files such as music or video, the average size is higher.
9 Storage type Catalog database processing speed does not affect backup performance. This database is usually placed on the same disk where operating system resides. There are 5 servers with 100 GB of data on each of them. Need to calculate the size of catalog database. Suppose the average file size is about 0.5 Mb = GB (actually for such a small amount of machines it is possible to count the number of files on each of them). Number of Files = 100 * 5 / = files Catalog Database Size = GB * = 0.25 GB Other Storage Node databases Storage Node maintains several additional databases where it stores information about logs, tasks and other things. The size of these databases is small and does not depend on backup parameters and can be skipped from calculations. CPU CPU speed is generally not a bottleneck so it is recommended that the Storage Node has a CPU with 2 cores x 3 GHz or 4 cores x 2.5 GHz. This is true regardless of the number of client machines that use the Storage Node. RAM RAM becomes a vital configuration parameter of a Storage Node in Acronis Backups & Recovery 11. Because of widely using caching algorithms and because of 64bit architecture Storage Node became much more scalable. Addition of more RAM for the Storage Node server in most cases helps to make the amount of data effectively handled by a Storage Node higher. The minimal recommended amount of RAM is based on the amount of unique data to be processed by ASN. In any case there should be not less than 8 GB of RAM. Having 8 GB of RAM (which is minimal recommended RAM size) allows effective processing of 800 GB of unique data which confirms to 12 GB of Deduplication database size. Having 32 GB of RAM allows to process 3700 GB of unique data (61 GB of Deduplication DB). The formula is the following: Minimal RAM Size (GB) = (4000 GB + 24 * Data Size) / 2900 Note that incremental data are mostly unique so the amount of unique data on the machines will grow. There are 10 servers with 2000 GB of total compressed deduplicated backup data. Need to calculate the minimal RAM size. The calculations are simple:
10 Minimal RAM Size = (4000 GB + 24 * Data Size) / 2900 = (4000 GB + 24 * 2000 GB) / 2900 = 18 GB Network Data Transfer Rate Client-side deduplication is always turned on Storage node 11 so it does not send duplicate data to deduplicated vaults. That minimizes network traffic during backing up making it lower up to times, depending on the amount of unique data and availability of duplicated data on the backup storage. First backup to empty backup storage has high traffic as all the backup data is transferred to the server. When indexing of these data completes, the deduplication ratio of consequent backup data raises. That s why it is recommended to perform initial phase of backing up first one or several machines and wait until indexing completes. The formula for disk-level backups is the following: Average Network Traffic (Mbit/sec) = Unique Data Percentage / 100 * 28 Mbit/sec + 2 Mbit/sec The formula for file-level backups is the following: Average Network Traffic (Mbit/sec) = Unique Data Percentage / 100 * 17.5 Mbit/sec Mbit/sec For the case of simultaneous backing up of several clients the traffic is multiplied by the number of these clients. There are 10 clients with unique data percentage 10%. Need to define average network traffic for the case of simultaneous disk-level backing up of these 10 clients. First, calculate the traffic for the initial phase as there are no data on the storage yet, all the backup data will be transferred so the traffic will be maximal. Average Network Traffic = 100 /100 * 28 Mbit/sec + 2 Mbit/sec = = 30 Mbit/sec Second, define the network traffic for one machine: Average Network Traffic = 10 / 100 * 28 Mbit/sec + 2 Mbit/sec = 2.8 Mbit/sec + 2 Mbit/sec = 4.8 Mbit/sec Then, multiply this value by the number of simultaneous backups: Average Total Network Traffic = 4.8 Mbit/sec * 10 = 48 Mbit/sec Number of Storage Nodes The number of Storage Nodes to be used can be based on the general backup processing speeds on the server with recommended configuration described above. Here is a list: Disk-level backup speed: 135 GB/hour Disk-level indexing speed: 85 GB/hour File-level backup speed: 45 GB/hour
11 File-level indexing speed: 204 GB/hour To estimate how much Storage Nodes is needed, calculate the time needed to process all the backups and then compare it with backup window. Backup window is specified for the client software so only backup time is compared with it. Indexing speed should be taken into account to calculate if Storage Node can reindex all the backups (additionally to the time of backups) for the time provided for its work (for dedicated Storage Node servers 24 hours a day). So in short Storage Node should accept all the backups in a specified backup window and accept and reindex all the backups in a time provided for its work. If the time is not enough, additional Storage Node should be added or more time provided. One more possible solution is to use Custom backup scheme and configure longer-time backups to be performed on special long backup windows, for example on holidays. Time to process backups The amount of time depends on the amount of data to be processed and the selected backup scheme/schedule. Full backups performed much longer than incremental and differential ones. If all full backups do not fit into backup window, protected machines can be grouped to perform full backups of each group in at a separate day. For example if full backups for all 10 machines cannot be performed in one backup window, split machines to several groups and configure appropriate amount of backup plans for full backups to be performed in different days. As backups can be accepted by Storage Node in parallel, the time for backups can be shorter. So the time needed to accept all the backups is usually divided by 2 because of parallelism. Here is a common formula: Backup Time (hours) = Data Size / Backup Speed / 2 Based on the data above take the backup sizes, calculate the amount of time for it and divide the appropriate backup window time by this calculated value. Do it for all backup types. The biggest calculated value will show how much Storage Nodes are needed. Number of Storage Nodes = max (Backup Time / Backup Window) (all backup types and appropriate backup windows are taken) 5 servers (500 GB of data in total) are being backed up with GFS scheme. Disk-level backups are performed. Daily incremental data size is 2%. Backup window is 8 hours a day for any type of backups. Need to calculate how much Storage Nodes needed to accept all the backups. Let s calculate how much time needed to process full backups. Full Backup Time = 500 GB / 135 GB/hour / 2 = 1.85 hours Now count how much Storage Nodes are needed. Number of Storage Nodes = 1.85 hours/ 8 hours = 0.23, less than 1 so one Storage Node is enough to accept all the backups.
12 We do not calculate here the number for Incremental backups as they are performed at the similar backup window but have much smaller size. Time to reindex backups The backups are reindexed one by one so the average speed of backups indexing is lower than the backup speed. From the other side the time for reindexing is usually 24 hours a day (for dedicated Storage Node servers). For calculating how much Storage Nodes needed to reindex backups from all the machines, the amount of backup data should be estimated based on the backup scheme/schedule. Reindex time is based on the amount of backup data and reindex speed: Reindex Time (hours) = Data Size / Reindex Speed Based on the data above take the backup sizes, calculate the amount of time for it and divide the appropriate Storage Node work hours by this calculated value: Number of Storage Nodes = Reindex Time / Work Hours 10 servers and 100 workstations contain 6000 GB of data to back up. Disk-level backups are performed. Storage Nodes are dedicated servers (work 24 hours a day). Need to calculate how much Storage Nodes needed to reindex all the backups. Reindex time is calculated: Reindex Time = 6000 GB / 85 GB/hour = 71 hours Storage Node works in full time so the amount of time for reindex is 24 hours and the number of Storage Nodes will be: Number of Storage Nodes = 71 hours / 24 hours = 2.96 which shows that 3 Storage Nodes will be needed to reindex all full backups in one backup window. Otherwise the machines can be divided to three groups with a separate backup plan for each of them. In this case one Storage Node will be able to reindex all backup data.