Availability Modeling and Analysis for Data Backup and Restore Operations

Availability Modeling and Analysis for Data and Restore Operations Xiaoyan Yin, Javier Alonso, Fumio Machida 2, Ermeson. C. Andrade 3, Kishor S. Trivedi Department of Electrical and Computer Engineering, Duke University, Durham, USA {xiaoyan.yin, javier.alonso}@duke.edu, kst@ee.duke.edu 2 Knowledge Discovery Research Laboratories, NEC Corporation, Kawasaki, Japan f-machida@ab.jp.nec.com 3 Informatics Center, Federal University of Pernambuco (UFPE), Recife, PE, Brazil ecda@cin.ufpe.br Abstract Data backup operation is an essential part of common IT system administration to protect against data loss caused by any storage failures, human errors, or disasters. Lost data can be recovered from the backed up data if it exists. Since the backup and restore operations accrue downtime overhead or performance degradation, they have to be designed to ensure the data reliability while minimizing the performance and availability overhead. In this paper, we study the impacts of different backup policies on availability measures such as storage availability, system availability, and user-perceived availability. and restore operations are designed using SysML Activity diagrams that are automatically translated into Stochastic Reward Net (SRN) to compute the availability measures. Our numerical results show the effectiveness of the combination of full backup and partial backup in terms of userperceived data availability and data loss rate. Furthermore, the sensitivity ranking can help improve the availability measures. Keywords- availability; data backup; data restore; Stochastic Reward Net (SRN); storage system I. INTRODUCTION Data backup is a common system administrative operation to protect data from the risk of data loss due to hardware and software failures, human errors, thefts, computer viruses, and natural disasters. Even though data replication and mirroring techniques are widely adopted in modern storage systems, data backup operation still plays an important role for data protection in infrastructure level of system administration. While data replication is used to provide high-availability of data, data backup is complementary especially for archiving data or for disaster recovery purposes. system allows recovering data on the storage system, in case of unexpected accident of data loss. No matter what sort of storage media is used, restore operation is indispensable for data recovery. There are a number of strategies for data backup such as online or offline, full or partial backup, frequencies, and the choice of storage media []. The storage media typically dominates the cost and the speed of backup and restore operations. System owners choose the appropriate storage media from tape drive to Fibre Channel SAN (FC-SAN) according to the budget limitations. The backup carried out without interrupting the service accesses to the data is called online backup, while the backup that necessitates data downtime is called offline backup. In online backup, data operations may be carried out during the backup operation, and hence maintaining data consistency is important. Another aspect of backup types is the scope of data to be backed up; full or partial. Full backup takes a full copy of data at once, while partial backup copies only the differential data from an earlier backup content. The effectiveness of data backup is also affected by the frequency of backups. The frequency of backups impacts on the recovery point objective (RPO), which is a metric for disaster recovery design. The RPO represents the maximum allowable time period in which data might be lost due to a system failure. The higher frequency of backup achieves the lower RPO, while it may decrease the availability or performance of the system. Even though data backup and restore operations aim at achieving high-availability, the operations often accrue downtime overhead, risk of operational errors and performance degradation. Accordingly, backup and restore operations have to be designed carefully under the trade-off between the availability improvements and the downtime/performance overheads due to the operations. Such operation designs need to be continuously revised according to the changes in business, technologies, and the amount of data. Thus, the efficient and accurate assessment of storage system availability is a fundamental challenge of the operation design in practice. In this paper, we present a framework to evaluate the impacts of backup and restore operations on system availabilities to assist system engineers to design effective backup and restore operations. In our previous work [2][3], Candy framework is presented for composing analytic model semi-automatically from system specification models described in SysML [4]. Candy is extended in this paper by introducing new notation methods and associated translation rules, so as to enable system engineers to describe their backup and restore operations efficiently in SysML diagrams. The SysML diagrams are translated into stochastic rewards nets (SRN), which are an extension of stochastic Petri nets [2], and several availability measures are computed by solving the SRN. For the purpose of data availability analysis, we introduce five quantitative measures: three of them characterize the availability of systems from different perspectives such as storage level, system level, and user-perceived level; the other two characterize the system performance degradation in terms of data loss rate and data loss ratio. These measures are used to evaluate the backup strategies from different perspectives. From the numerical results, the effectiveness of a judicious combination of full backup and partial backup is presented. Moreover, the sensitivity ranking of parameter values provide valuable information for system engineers to improve their operations toward highly available system. The rest of this paper is composed as follows: Section II reviews related work. Section III introduces Candy framework, new specifications, and corresponding translation rules. Section

IV describes the backup configurations and policies under analysis, specified by SysML and translated into SRNs by Candy. Section V defines the availability and performance measures. Section VI presents the numerical results for five backup policies, and Section VII concludes the paper. II. RELATED WORK The importance of backup systems is reflected by the wide range of commercial solutions offered by industry [5][6][7][8]. However, backup process causes performance degradation of the storage systems in online backups, or even service interruption in offline backups. In order to minimize the backup time, several works have been proposed [9][0][]. In [9], authors present a backup scheduling approach, based on integer programming, to minimize the overall backup time. From the past backup experiences, job duration and job throughput are assigned to each job to calculate the optimal backup job scheduling. In [0], authors present an analytical study to determine the optimal number of incremental backups between two full backups, minimizing the overall cost of the backups. The volume of data is a key factor of backup time. Data deduplication technique has been proposed to reduce the amount of replicated data in storage systems []. Reduction of the volume of data by data de-duplication can help to shorten the total backup time. Reliability discrete event simulation is conducted for storage systems with and without de-duplication in [2]. The paper indicates that de-duplication have a negative impact on reliability, but a good impact on backup data volume. Data de-duplication is also used in a commercial storage appliance for data backup [6]. Another way to reduce the backup time is using distributed backup storage systems. In [3], authors present a distributed backup strategy which backs up data efficiently considering the network bandwidth, while faster data access and recovery is guaranteed. FOBSM [4] provides an architecture to develop fault-tolerant backup system using Object-Z. Following the Object-Z reasoning rules, the fault tolerant properties are analyzed. In case of partial backup, the backup time can also be reduced by using delta compression technique [5]. For the disaster recovery design, the dependability model considering disk arrays, data back up and remote mirroring is presented in [6]. In contrast to these works, we focus on the composition of analytic model for evaluating the efficiency of backup operations from their design specifications. There are many design choices of backup strategies according to system configurations. Therefore we need a general framework to compose analytic models for analyzing the availability of the systems and their operations. Model translation techniques for composing analytic models from system specification models have been studied in previous literature. However, most of them are focused on model translation aimed at evaluation of the performance [7][8] and few address the evaluation of the availability [9][20]. Furthermore, these studies have not addressed the impact of maintenance operations such as backup and restore operations. In this paper, we use Candy [2][3] to incorporate the specification of backup procedures described in SysML into the analytic model for availability analysis of storage systems. SysML IBD STM AD () Model translation () Model translation () Model translation Specify the system Model components Model components Model components Figure. Availability modeling steps using Candy III. Candy (2) Model assembly System engineers System SRN CANDY FRAMEWORK (3) Model synchronization Activity SRN Assign guard functions SRNs This section overviews the Candy framework and presents the required new extensions for modeling data backup scenarios. Candy is a modeling framework to compose SRNs for availability analysis from the specifications of IT systems described by SysML. The overview of the model composition procedure in Candy is depicted in Fig.. Candy supports three types of SysML diagrams as input models: Internal Block Diagram (IBD), State Machine Diagram (STM) and Activity Diagram (AD). Those SysML diagrams are translated into availability model components. These model components are assembled and subsequently synchronized together according to the dependencies specified in SysML. The procedure can be divided into three steps: ) model translation, 2) model assembly, and 3) model synchronization. In the first step, all the input SysML diagrams are translated into model components according to the predetermined translation rules. In the second step, the model components translated from IBDs and STMs are assembled together to form a SRN subnet which is called System SRN. SysML Allocation [4] represents the dependencies between SysML elements and it is used to identify the dependencies among the model components. In the third step, the model components translated from ADs (called Activity SRNs) are synchronized to the System SRN by defining additional guard functions to implement the dependencies among the SRNs. The dependency information is supplied in the stereotypes of SysML elements (such as action, block, and allocation) or is obtained through interactions with system engineers who are users of Candy. The obtained SRN is solved to compute availability measures using software packages supporting SRN such as SHARPE [22] and SPNP [23]. A. Extensions To simplify the system specification and to aid system engineers to describe the dependencies inherent in the systems correctly with the smallest effort, we introduce new notation methods in compliance with SysML specifications. ) Composite state Composite state is a type of state used in STM which may contain one region or is decomposed into two or more regions [24]. A region is defined as an orthogonal part of either a composite state or a state machine. Any state enclosed within a

UP S S Fail Repair S3 S2 Down (a) STM for a system operation with a composite state UP S2 (b) Lower level STM model for the composite state UP S2 (a) STM for a system operation P S (b) AcceptEventAction node with T fail P up T recv T [G ] P S T 3 [G 3 ] T T 2 P in_s2_waiting T S2_waiting [G S2_waiting ] P down (c) Obtained SRN for STM in (a) P S2 T 2 [G 2 ] (d) Obtained SRN for STM in (b) Figure 2. Model translation for composite state Table. Guard function for Fig. 2(d) Guard G if(#( P up )==) else 0 G 2 if(#( P up )==) else 0 G 3 if(#( P up )==) else 0 region of a composite state is called a substate of that composite state. Using composite states, designers can simplify the descriptions of complex state transitions by aggregating common internal transitions into a composite state. To support the notation of composite state in STM, we introduce a new translation rule into Candy. First, an STM containing any composite states and the corresponding STM which represents internal state transitions within the composite states are translated into SRNs, separately. Fig. 2 shows an example for the model translation of the composite state UP. Fig. 2(b) presents the internal state transitions of the composite state UP in Fig. 2(a). According to the basic translation rules for STM, those models are translated into the corresponding SRNs as presented in Fig. 2(c)-(d), respectively. The state transitions among substates (S, S2 and S3) in the composite state UP occur only when the system is in the composite state. In other words, if the system is not in the composite state UP, all the transitions in the composite state are disabled. SRN can capture such relationship by introducing the guard functions for disabling the transitions. In Fig. 2(d), three guard functions G, G 2, and G 3 are assigned to the transitions T, T 2, and T 3. The transitions are, thus, enabled only when a token is deposited in P up. The definitions of guard functions are listed in Table that can be generated automatically. 2) Accept event action AcceptEventAction is generally used in AD to present an action that waits for the occurrence of an event meeting specified condition [24]. Such event occurrence is often associated with a state change described in STM. The dependency among an action node and state transitions in STM needs to be clarified in SRN to compute availability measures correctly. To make system engineers explicitly specify the condition of event acceptance, we define a new P S2 (c) Obtained SRN for STM in (a) (d) Translation rule for action node in (b) Figure 3. Model translation for AcceptEventAction with Table 2. Guard function generated from AcceptEventAction waiting the occurrence of state change to S2 Guard G S2_waiting if(#( P S2 )==) else 0 stereotype. The stereotype presents an existence of a dependency from the AcceptEventAction node to STM. The state which satisfies the condition for AcceptEventAction is specified in the text on the node. An example for the action node with stereotype is illustrated in Fig. 3. As denoted as S2 in the AcceptEventAction node in Figure 3(b), the condition of the accept event action is satisfied when the state is changed to S2 in the STM in Fig. 3(a). Fig. 3(c)(d) show the translated SRNs from STM and AcceptEventAction, respectively. The guard function G S2_waiting is defined as shown in Table 2 to enable the transition T S2_waiting when a token is deposited in P S2. Notice that this is a counterpart of an action node with stereotype <<control>> which represents the scenario that the execution of the action node triggers a state transition in an STM [2]. 3) Decision node Decision node is a control node in AD which routes a flow to one of the outgoing edges. The decision condition is described in the comment with stereotype <<decisioninput>> [4]. Basic translation rule for the decision node has been described in [2][3]. If the outgoing control flows of the decision node depend on the system states, the guard functions of the translated Activity SRN can be defined using markings of the System SRN in the model synchronization step. To facilitate system engineers conducting model synchronization for time-based decision conditions (e.g., whether it is on Sunday or not), we define an additional synchronization rule for a time-based decision condition as shown in Fig. 4. Fig. 4(a) shows an example of decision node with time condition. Fig. 4(b) presents the corresponding parts of SRN. The number of tokens in the place P days represents different days within a week. Since there are seven days within a week, we initialize 6 tokens in P days, where the number of tokens 6~0 represents Monday to Sunday respectively. Suppose there is a token in P out_decn. If there are n ( n 6) tokens in P days (i.e., not Sunday), the transition T out2_dec is enabled whereas the transition T out_dec is disabled by the inhibit arc (An inhibit arc

P days 6 T out2_dec [Day!=Sunday] [Day==Sunday] (a) Decision node in AD P in_decn T decn [G decn ] P out_decn 6 T out_dec [G out2_dec ] [G out_dec ] (b) Modified SRN in synchronization process Figure 4. Model translation for a time-based decision node from a place to a transition means that the transition is enabled only when the corresponding place contains no tokens). If there is no token in P days (i.e., Sunday), T out_dec is enabled whereas T out2_dec is disabled. Since T out_dec is an immediate transition, it fires in zero time once it is enabled, and hence six tokens will be immediately deposited in P days through a cardinality output arc (i.e., the next week is initialized). Such synchronization rules for the time-based decision condition can be easily generalized to other types of periods such as hours, months, years, etc. IV. DATA BACKUP AND RESTORE MODEL TRANSLATION A. System description To focus on the evaluation of general backup and restore operations to a storage system, we consider a simple direct attached storage system (DAS) which is directly connected to a server. The backup is assumed to be performed over a dedicated network and the backup data is stored in backup storage system attached to another server in the same network. Data loss can occur in the primary storage system resulting from various causes such as hardware failure, human error, software corruption, and virus infection. When data is lost and it cannot be recovered via normal means, data restore operation is required to recover the data from the backup storage. In this paper, we focus on offline backup scenario. However, our model composition and evaluation techniques are not limited to the offline backup scenario. As mentioned earlier in Section I, there are two main backup approaches based on data to be backed up: full backup or partial backup. Partial backup can be further divided into two types: incremental and differential. Incremental backup copies files added or modified since the last full or partial backup, whereas differential backup copies files added or modified since the last full backup. Since the time to take a backup and the time to restore data are dependent on the backup type, it is important to determine the type of backup and its frequency in consideration with system availability and performance. Considering the full or partial backup option with their frequencies, we study five backup policies in this paper; p : full backup every month, p 2 : full backup every week, p 3 : full backup every day, p 4 : full backup every Sunday and 3 partial differential backups in between, and p 5 : full backup every Sunday and 6 partial differential backups in between (one partial backup for each day except Sunday). To better present model translations based on Candy framework, we use the last backup policy p 5 to demonstrate the Candy translation process in the following subsections. B. STM for storage system and its translation Fig. 5(a) shows the STM representing failure and recovery behavior of a storage system. The STM contains a composite state SYSTEM UP which represents the storage system is not either in failure states or under recovery. The storage system remains in SYSTEM UP as long as no failure occurs. When a failure occurs without resulting in data loss, the system enters state Partial failure. Such failure is subsequently detected by a monitoring mechanism and then the system returns to SYSTEM UP after the storage system is restarted. When a storage failure results in data loss (enters state Data Loss), it is detected by monitoring mechanism (in the state Detected) and system administrator is summoned for manual recovery. The responsible system administrator analyzes the causes and impacts of the failure, and performs data restoration from backup data. The state is changed to Restored when the data restore finishes. The system returns to the state SYSTEM UP after the storage system is restarted. When the storage system is in the composite state SYSTEM UP, backup procedure can be performed. The state transitions associated with data backup operations in SYSTEM UP are described in as another STM shown in Fig. 5(b). The backup procedure starts from the state Internal UP, which represents that the storage system is up and no backup is performing. When either full or partial backup is triggered by an action in AD (described later in Section IV.C and IV.D), the storage system enters the state either Full or Partial, respectively. During this backup process, the backup operation Recover Detected Partial failure Detected Restart [No] SYSTEM UP [Yes] Data Loss <<DecisionInput>> The failure result in data loss? Detected Detected Restart Restored Recovered Data Restore Administrator Arrival (a) Failure and recovery behavior of the storage system P. failure Partial failure Start Partial Successfull Partial Start-up storage Internal UP Successful Start-up storage Recover Start Full (b) behavior of the storage system Figure 5. STM for storage system F. failure Full failure

may fail (enters the state F. failure or P. failure) and subsequently be recovered before successfully finishing backup (enters the state Successful Full or Successful Partial ). The failures of backup operations are caused by errors in backup jobs, human errors, or lack of storage spaces. The storage system will be restarted and put back online after the successful backup, i.e., returning to Internal UP. The state transitions in Fig. 5(b) are only enabled when the storage system is up, i.e., the storage system is in the state SYSTEM UP in Fig. 5(a). Fig. 6(a) and (b) present the SRNs translation of the two STMs presented in Fig. 5(a) and (b), respectively. Since the internal STM in Fig. 5(b) depends on the corresponding composite state SYSTEM UP, the translated SRN in Fig. 6(b) depends on whether there is a token in place P UP in the other SRN in Fig. 6(a). As introduced in Section III.A., such composite state dependency is incorporated through defining guard functions for all transitions in the SRN translated from the internal STM to be disabled when a token is not deposited in P UP. The definition of guard functions are revised in the model synchronization step discussed in Section IV.E. P detect2 T detect2 δ P pfail T pfail τ λ pf * c T restart2 T detect δ P UP T dataloss λ pf * (-c) P dataloss /sp P restore µ T restore P detect T P recovered arrival (a) Storage system SRN for failure and recovery behavior of Fig. 5(a) T PBkp T STRstart2 T PBFail P PBSucc [G PBFail ] [G PBkp ] τ T restart [G STRstart2 ] C. AD for full backup procedure and its translation As mentioned earlier, we use policy p 5 to show the model translation of backup procedures. For backup policy p 5, one full backup is performed every Sunday and six partial backups are performed each day, except Sunday. We specify the full backup procedure and the partial backup procedure in separate ADs. The AD for the full backup procedure is presented in Fig. 7. The procedure is assumed to be executed automatically by scheduled command script. At am every Sunday, the storage status will be first checked by the action node Check storage status. If the storage system is in failure or under recovery (i.e., the storage system is not in SYSTEM UP in Fig. 5(a)), a storage alert message is issued to system administrators who are responsible for enforcing further maintenance operations. If the storage system is working properly (i.e., the storage system is in SYSTEM UP), full backup of the storage system is triggered by the action node Startup Full and storage system status is changed from Internal UP to Full in Fig. 5(b). The AcceptEventAction node Full Success with stereotype denotes the process waiting for the successful completion of full backup (i.e., the storage system s status reaches Successful Full in Fig. 5(b)). When the full backup is done, a signal is generated and transmitted to the next activity (AcceptEventAction Full B. Confirmed in Fig. 9 for partial backup activities). The AD for full backup procedure is translated into the Activity SRN as presented in Fig. 8 based on the AD translation rules [2] and the additional rules (See Section III.A.2.) Since the AD for full backup procedure has dependencies to the AD for partial backup procedure and the STM, the SRN has also dependencies to the corresponding SRNs. Those dependencies are implemented by introducing additional guard functions in model synchronization step discussed later in Section IV.E. D. AD for partial backup procedure and its translation In the backup policy p 5, partial backups are performed every day except Sundays. The AD for partial backup procedure is presented in Fig. 9. If full backup is done successfully, a signal is received by the accept event action P PBFail P PBkp a.m. every Sunday Check storage status T PBRec [G PBRec ] T FBFail [G FBFail ] T out_spbkp T in_spbkp [G out_spbkp ] P in_spbkp [G in_spbkp ] T out_sfbkp [G out_sfbkp ] P in_sfbkp T in_sfbkp [G in_sfbkp ] P Internal_UP <<DecisionInput>> Is the storage working properly? [Yes] [No] <<Control>> Startup Full backup Issue Storage Alert P FBFail T FBRec P FBkp T FBkp T STRstart [G FBRec ] [G PFBSucc FBkp ] [G STRstart ] (b) Storage system SRN for backup behavior of Fig. 5(b) Figure 6. Storage system SRN Full Success Full Done Figure 7. AD for full backup procedure

P FB_conf T out_fbdone [G out_fbdone ] P in_fbdone T out_decn [G out_decn ] P in_csfbkp <<control>> T in_csfbkp [G in_csfbkp ] P out_csfbkp T out_csfbkp [G out_csfbkp ] P waiting P ini T ini P wait T wait P in_cksta P in_decn T decn [G decn ] P in_clock T clockfb T reset T CKsta [G CKsta ] P out_decn T out2_decn [G out2_decn ] P in_sdalt T SDalt [G SDalt ] P out_clock T fin_out_decn5 [G fin_out_decn5 ] T out_decn4 [G out_decn4 ] P in_cspbkp <<control>> T in_cspbkp [G in_cspbkp ] P out_cspbkp T out_cspbkp [G out_cspbkp ] P days 6 6 T reset2 P out_decn3 T out_decn3 [G out_decn3 ] P in_cksta3 T CKsta3 [G CKsta3 ] P in_decn4 T out_fb_conf [G out_fb_conf ] T decn4 [G decn4 ] P wait2 T P in_clock2 wait2 P in_decn3 T clockpb T decn3 [G decn3 ] P out_decn4 P out_clock2 T out2_decn3 [G out2_decn3 ] P fin_out2_decn3 T fin_out2_decn3 [G fin_out2_decn3 ] T waiting [G waiting ] P fin_sdalt Figure 8. Full backup activity SRN Full B. Confirmed T fin_sdalt [G fin_sdalt ] P waiting2 P fin_out_decn5 T waiting2 [G waiting2 ] P fin_sdalt2 T out2_decn4 [G out2_decn4 ] P in_sdalt2 T SDalt2 [G SDalt2 ] Figure 0. Partial backup activity SRN T fin_sdalt2 [G fin_sdalt2 ] [Day!=Sunday] <<DecisionInput>> Is the storage working properly? Check storage status [Yes] <<Control>> Partial Data a.m. [No] Partial Success [Day==Sunday] Issue Storage Alert Figure 9. AD for partial backup procedure Full B. Confirmed to trigger subsequently actions for partial backups. At am every day, if it is Sunday, no partial backup activities will be performed. Otherwise, similar to full backup activities, storage status is first checked and then one partial backup is triggered if the storage system is available (i.e., in SYSTEM UP). After the completion of the partial backup, the procedure waits for the next trigger to partial backups. The AD for partial backup procedure is translated into the Activity SRN as presented in Fig. 0. Resulting from the dependencies between storage system behavior, full backup activities and partial backup activities, model synchronizations are then discussed in the next section. E. Model synchronization As a result of the model translation steps, a System SRN for storage system (in Fig. 6) and two Activity SRNs representing full and partial backup procedures are obtained. There are four types of dependencies between System SRN and Activity SRNs which are inherited from the dependencies in the original STMs and ADs. In the model synchronization step, those dependencies are embodied in the SRNs by assigning additional guard functions. ) Decision node In the Activity SRNs, there are two decision nodes to branch a flow depending on the status of storage system. Since the status of storage system is represented in the System SRN, the markings of the System SRN are used to define the guard functions G out_decn, G out2_decn, G out_decn4, and G out2_decn4 to represent the decision conditions. The definitions of these guard functions are summarized in Table 4 and Table 5. 2) Action node stereotyped <<control>> The action nodes performing backup are denoted with the stereotype <<control>> that indicates the action induces state changes in the System SRN. For an instance, the action node Startup Full in Fig. 7 causes the state transition from Internal UP to Full in Fig. 5(b). To implement those dependencies in SRN, we synchronize the Activity SRN and the System SRN by assigning guard functions to the

associated transitions using the method presented in [2]. G in_sfbkp, G out_sfbkp, G in_spbkp, and G out_spbkp in Table 3, G in_csfbkp, G out_csfbkp in Table 4, and G in_cspbkp and G out_cspbkp in Table 5 are obtained by the model synchronization. 3) Accept event action stereotyped As mentioned in Section III.A.2, the stereotype is associated to AcceptEventAction node which waits for the occurrence of a state change represented in an STM. In the Activity SRNs, those nodes are used for the waiting actions for full backup success (in Fig. 7) and partial backup success (in Fig. 9). It is easy to correlate those AcceptEventAction nodes with the states Successful Full and Successful Partial in the STM in Fig. 5(b). When the correlation is specified, the guard functions for T waiting and T waiting2 can be generated to represent those dependencies in Activity SRNs. The definitions of G waiting and G waiting2 are shown in Table 4 and Table 5. 4) Send signal action and accept event action The connection between the full backup AD and the partial backup AD is represented by the pair of SendSignalAction node and AcceptEventAction node. When full backup is successfully performed on Sunday, the activity for partial backup operation is triggered for subsequent weekdays. When the pair is specified, the relationship can be represented in SRN by the guard function G out_fb_conf to the transition T out_fb_conf. The definition of G out_fb_conf is shown in Table 5. Assigning these guard functions, finally we obtain the SRN which integrates all the System SRNs and Activity SRNs. Guard Table 3. Storage system SRN guard functions Storage system SRN model G in_sfbkp if(#( P UP )== and #( P in_csfbkp )==) else 0 G out_sfbkp if(#( P UP )== and #( P out_csfbkp )== ) else 0 G in_spbkp if(#( P UP )== and #( P in_cspbkp )==) else 0 G out_spbkp if(#( P UP )== and #( P out_cspbkp )== ) else 0 G FBkp if(#( P UP )== ) else 0 G PBkp if(#( P UP )== ) else 0 G STRstart if(#( P UP )== ) else 0 G STRstart2 if(#( P UP )== ) else 0 G FBFail if(#( P UP )== ) else 0 G PBFail if(#( P UP )== ) else 0 G FBRec if(#( P UP )== ) else 0 G PBRec if(#( P UP )== ) else 0 Guard Table 4. Full backup activity SRN guard functions Full backup activity SRN model G out_decn if (#( P UP )==) else 0 G out2_decn if (#( P UP )==) 0 else G in_csfbkp if(#( P in_sfbkp )==) else 0 G out_csfbkp if(#( P in_sfbkp )==0 ) else 0 G waiting if (#( P FBSucc )==) else 0 G out_fbdone if(#( P FB_conf )==0) else 0 Guard Table 5. Partial backup activity SRN guard functions Partial backup activity SRN model G out_fb_conf if(#( P in_fbdone )==) else 0 G out_decn4 if (#( P UP )==) else 0 G out2_decn4 if (#( P UP )==) 0 else G in_cspbkp if(#( P in_spbkp )==) else 0 G out_cspbkp if(#( P in_spbkp )==0 ) else 0 G waiting2 if (#( P PBSucc )==) else 0 V. OUTPUT MEASURES To compute the availability and the performance impacts of data backup and restore operations, we introduce five metrics of interest on the model. Three of them characterize the availability of systems from different perspectives, while the other two characterize the system performance degradation in terms of data loss rate and data loss ratio. A. Storage System Availability Storage system availability A s is defined as the steady-state probability that the storage system is working properly without any storage failures. Fig. 6 represents the state transition of the storage system. The storage system is considered to be working properly if a token is in P up. Hence the reward function for computing A s is defined as R, shown in Table 7. B. System Data Availability System data availability A d is defined as the steady-state probability that data in the storage system is accessible. Even though the storage system is working properly, the data is not accessible during backup operation in the case of offline backup. As shown in the SRN in Fig. 6, storage system data is accessible only when a token is in P Internal_UP. The reward function for A d is defined as R 2 shown in Table 7. C. User-perceived Data Availability Consider an online application service using the storage system, and users access the data in the storage system via the application service. Such application service usually has maintenance period in which users cannot access the service (i.e., the service time for users is limited). User-perceived data availability A u is defined as the steady-state probability that data is accessible for users of the application in the service time. In common system administration, offline backup is scheduled out of the service time so as not to interfere with the user accesses. However, delay of backup operations or unexpected storage failure might affect the user-perceived data availability in the service time. A separate SRN model interacting with backup activity SRN model is constructed to compute the steady-state userperceived data availability as shown in Fig.. If there is a token in place P in_service, users are allowed to access services. The transition T BKstarted is enabled if either full backup or partial backup starts. If there is a token in place P out_service, no service is available for users. A clock for a period of out-ofservice time (e.g., seven hours) is triggered as soon as there is a token in place P out_service (i.e., the transition T clocksc is

Guard T BKstarted [G BKstarted ] P out_service T ServiceStart [G ServiceStart ] P in_service P in_clock3 P out_clock3 T clocksc [G clocksc ] T reset3 Figure. SRN for user-perceived data availability computation Table 6. Guard functions for SRN in Fig. G BKstarted if(#( P out_clock )== or #( P out_clock2 )==) then else 0 G clocksc if(#( P out_service )==) then else 0 Reward Table 7. Reward function for availability measures R if(#( P UP )==) then else 0 R 2 if(#( P UP )== and #( P Internal_UP )==) then else 0 R 3 if(#( P UP )== and #( P Internal_UP )==) and #( P in_service )==)) then else 0 R 4 if(#( P in_service )==)) then else 0 enabled). After the out-of-service time expired, the transition T ServiceStart is enabled. The guard functions for the transitions in the SRN in Fig. are presented in Table 6. User perceived steady-state availability A u is given by Au P{ data is accessible given that service is available}. P{ data is accessible and serviceis available} P{ serviceis available} Therefore, two reward functions are computed first in order to calculate the user-perceived data availability as shown in Table 7. Reward function R 3 is used to compute the steadystate probability that data is accessible and service is available, whereas reward function R 4 is for computing the proportion of the service time over the whole time period. User perceived steady-state data availability A u can be calculated as R 3 /R 4. D. Data Loss Rate When a storage system fails with data loss and any other alternative data replications or transaction logs are not available, all the data altered (i.e., transaction processed) since the last backup is lost. The data loss rate refers to the number of transactions lost per hour. Assume transactions arrivals follow Poisson process with rate λ a. As shown in Fig. 2, the storage system can encounter a failure at any time instant backup λ a updates T data loss failure backup Figure 2. Data loss due to storage system s failure t between two backups. If such failure results in data loss (with probability -c), the number of transactions lost is the number of transactions processed between the last backup and the time that storage system fails. Therefore, for each storage system failure resulting in data loss, the number of transactions lost is the number processed during half of the backup period on average. Thus the data loss rate γ (#transactions/hours) is: T Ad f c a 2 where λ f is the storage system failure rate, (-c) is the uncoverage factor of the storage system failures which will result in data loss, and T is the time between two backups. E. Data Loss Ratio Data loss ratio is defined as the percentage of transactions lost over the total number of transactions processed. Since the total number of transactions processed per hours is A d λ a, the data loss ratio θ is given by: T f c A 2 VI. A. Policies Comparison d a NUMERICAL RESULTS In this section, we compare the output measures for different backup policies represented by SRNs, solved numerically by the software package SPNP [22]. The input parameters values are shown in Table 8. For the initial case study, the parameter values are defined arbitrarily yet reasonably in consideration with the practical storage system configurations and operations. The values vary from system to system depending on the type of storage, applications, and administrative organization. For differential backup in policy p 5 (one full backup and six partial backups), the amount of data changed from the last full backup is increasing every day during a week. Hence, taking policy p 5 for example, the partial backup rate and data restore rate are set to be different for different days during a week as shown in Table 9. Similarly, different partial backup rate and data restore rate are assigned for policy p 4. Both the average partial backup time and average data restore time are assumed to increase 0 minutes every day within a week. For user-perceived data availability computation, the non-service period is set to be seven hours immediately after the backup starts. Table 0 shows the numerical results for different backup policies, where F refers to full backup and P refers to partial backup (i.e., differential backup). For policies p, p 2 and p 3, we observe that if full backup performs more frequently, system and user-perceived data availability decreases. However, the frequent backup strategies have advantages in smaller data loss rate and data loss ratio. The storage system availability is constant for backup policies p, p 2 and p 3 because the storage system follows exactly the same failure and recovery behavior with the same parameters (e.g., the data restore rate is the same). For backup policies p 4 and p 5, similar trends are observed that more frequent backup leads to less data availability but less data loss. The storage system availability by policy p 5 is slightly better than that by policy p 4 due to its higher data restore rate.

Table 8. Input parameters values Symbols or Assigned Parameters Values [/h] Transitions Transactions arrival rate λ a 3.6 sec 000 Storage failure rate λ f year 0.000455 Coverage of storage failure C 95% 0.95 Storage failure detection T rate detect, T detect2 5 minutes 2 Service person arrival rate T arrival 30 minutes 2 Storage restart rate T restart, T restart2 2 minutes 30 Full start rate T out_sfbkp sec 3600 Full rate T FBkp 6 hours 0.66666667 Full failure rate T FBfail 6 months 0.0002348 Full recovery rate T FBRec 5 minutes 2 Start-up storage rate T STRstart, T STRstart2 minute 60 Partial start rate T out_spbkp sec 3600 Partial failure rate T PBfail 3 months 0.000462963 Partial recovery rate T PBRec 5 minute 2 Check status T CKsta, T CKsta2, T CKsta3, T CKsta4 sec 3600 Decision node T decn, T decn2, T decn3, T decn4, T decn5 sec 3600 Server Alert T SDalt, T SDalt2, T SDalt3, T SDalt4 sec 3600 Full interval T FBclock week 0.00595238 Partial interval T PBclock 24 hrs 0.04666666 Control start full backup T in_csfbkp sec 3600 Control start partial backup T in_cspbkp sec 3600 Table 9. Parameters for differential backups of policy p 5 Partial backup rate for Data restore rate for Place transition: T transition: T PBkp restore ==5 0.5 0.33333 ==4 0.465 0.357 ==3 0.4286 0.3 ==2 0.4 0.2857 == 0.375 0.2727 ==0 0.3529 0.2608 For policy p 3 and policy p 5, which have the same time period between two backups, system and user-perceived data availability are higher for policy p 5 because partial backup takes less time than full backup. Storage system availability is slightly higher for policy p 5 since data restore consumes less time than full backup case. In addition, data loss ratio is proportional to the time between two backups and hence, policy p 3 and policy p 5 have exactly the same data loss ratio. Policy ID p p 2 p 3 p 4 p 5 Policy Table 0. policies comparison System Data Availability User-perceived Data Availability Storage System Availability Data Loss Rate (# transactions /hour) Data Loss Ratio F 0.9960503 0.996875 0.99996944 2.03754025 0.00223283 per month F 0.96478576 0.986036943 0.99996944 0.462276382 0.00047945 per week F 0.76898905 0.889302759 0.99996944 0.052670365 0.000068493 per day F + 3 P 0.9279523 0.97653686 0.99996256 0.27096 0.0009863 per week F + 6 P 0.904660 0.969969395 0.999963206 0.06240035 0.000068493 per week Data loss rate is higher for policy p 5 because its system data availability is larger than policy p 3 (i.e., more transactions are processed in policy p 5 ). In a summary, policy p 5 is considered as the best option among these five backup policies since it ensures relatively high data availability and small data loss. B. Sensitivity Analysis In practice, detecting the bottlenecks of the system is critical to effectively improve the system availability. To find availability bottlenecks based on the models, sensitivity analysis is conducted with respect to various model parameters for policy p 5. The model parameters chosen for sensitivity analysis are listed in Table. System data availability A d and storage system availability A s are evaluated for the sensitivity analysis as shown in Fig. 3 and Fig. 4 respectively. A positive sensitivity indicates that an increase of a parameter value leads to an increase in the output measure, whereas a negative sensitivity indicates that an increase of a parameter results in a decrease in the output measure. As observed in Fig. 3, 2 parameters have positive sensitivity on A d and the other 6 have negative sensitivity. In addition, full backup rate has the most significant positive sensitivity and uncovered storage failure rate has the most significant negative sensitivity on A d. The other parameters have little or no influences on A d. Therefore, to improve A d, the most efficient way is to increase full backup rate and decrease uncovered storage failure rate. Fig. 4 shows the results for A s where 6 model parameters have positive sensitivity and the other 2 have negative sensitivity. Service person arrival rate, storage failure detection rate and storage restart rate are the top three significant positive factors. Uncovered storage failure rate has the most significant negative sensitivity. The other parameters have little or no influences on A s. Therefore, in order to improve A s, the most efficient method is to increase service person arrival rate and decrease uncovered storage failure rate. Table. Model parameters for sensitivity analysis Parameter ID Parameters Uncovered storage failure rate T dataloss 2 Covered storage failure rate T pfail Transitions Assigned 3 storage failure detection rate T detect, T detect2 4 Service person arrival rate T arrival 5 storage restart rate T restart, T restart2 6 Full backup start rate T out_sfbkp 7 Full backup failure rate T FBFail 8 Full backup recovery rate T FBRec 9 Full backup rate T FBkp 0 Partial backup start rate T out_spbkp Partial failure rate T PBFail 2 Partial backup recovery rate T PBRec 3 Start up storage rate T STRstart, T STRstart2 4 Check status T CKsta, T CKsta3 5 Decision node T decn, T decn3, T decn4, 6 Server Alert T SDalt, T SDalt2 7 Control start full backup T in_csfbkp 8 Control start partial backup T in_cspbkp

0.25 0.20 0.5 0.0 0.05 0.00 2 3 4 5 6 7 8 9 0 2 3 4 5 6 7 8 -.00-2.00-3.00-4.00.50E-06.00E-06 5.00E-07 Sensitivity to system data availability Parameter ID 2 3 4 5 6 7 8 9 0 2 3 4 5 6 7 8 0.00 Figure 3. Sensitivity analysis for system data availability A d Sensitivity to storage system availability Figure 4. Sensitivity analysis for storage system availability A s VII. CONCLUSION & FUTURE WORK In this paper, we have studied the availability and performance for different backup policies of the storage system using semi-automatic model translation from the SysML models representing backup procedures. The numerical results provide insights for system engineers to choose the best backup policy under the tradeoff between availability and overheads due to the operations according to their needs. In addition, sensitivity analysis with respect to model parameters was conducted to guide the decisions made for improving system data availability and storage system availability. As a future work, we plan to extend the current model in two main directions: i) incorporate other backup strategies such as online backup or partial incremental backup for further policy comparisons, and ii) develop a general clock model in SRN to help system engineers to determine the best time to conduct the backup process (full and partial). The implementation of the proposed framework will be achieved in NEC modeling environment CASSI (Computer Aided System model-based System Integration environment) [25]. REFERENCES positive sensitivity negative sensitivity 0.00E+00 2 3 4 5 6 7 8 9 0 2 3 4 5 6 7 8 0.00 -.00-2.00-3.00-4.00-5.00 positive sensitivity Parameter ID 2 3 4 5 6 7 8 9 0 2 3 4 5 6 7 8 negative sensitivity [] A. Chervenak, V. Vellanki, and Z. Kurmas, "Protecting file systems: A survey of backup techniques" In Proc. of Joint NASA and IEEE Mass Storage Conference, 998. [2] F. Machida, E. Andrade, D. Kim, and K. Trivedi, Candy: Componentbased availability modeling framework for cloud service management using SysML, Proc. Int. Symp. on reliable distributed systems, 20. [3] E. Andrade, F. Machida, D. Kim, and K. Trivedi, Modeling and analyzing server system with rejuvenation through SysML and stochastic reward nets, in proc. of 6 th Int. Conf. on Availability, Reliability and Security (ARES), 20. [4] OMG Systems Modeling Language (OMG SysML) Version.2, http://www.omg.org/spec/sysml/.2 [5] EMC Advisor, http://www.emc.com/products/detail/software/backupadvisor.htm. [6] NEC HYDRAstor, http://www.necam.com/hydrastor/ [7] CA ARCserve, http://www.arcserve.com/us/default.aspx [8] Symantec Exec, http://www.symantec.com/backup-exec. [9] L. Cherkasova, A. Zhang, and X. Li, DP+IP = design of efficient backup scheduling, In Proc. Int. Conf. on Network and Service management (CNSM), pp. 8-25, 200. [0] S. Nakamura, K. Nakayama, and T. Nakagawa, Optimal backup interval of database by incremental backup method, Int. Conf. on Industrial Engineering and Engineering Management, pp.28-222, 2009. [] D. Geer, Reducing the storage burden via data de-duplication, Computer, vol. 4, pp. 5-7, Issue. 2, Dec. 2008. [2] E. Rozier, W. Sanders, P. Zhou, N. Mandagere, S. Uttamchandani and M. Yakushev, Modeling the fault tolerance consequences of deduplication, In Proc. of Int. Symp. on Reliable Distributed Systems, 20. [3] K. Renuga, S. Tan, Y. Zhu, T. Low, and Y. Wang, Balanced and efficient data placement and replication strategy for distributed backup storage systems, Int. Conf. on Computational Science and Engineering (CSE '09), pp.87-94, 2009. [4] H. Wang, K. Zhou, and L. Yuan, Fault-Tolerant Online Service: Formal Modeling and Reasoning, In Proc. of Int. Conf. on Networking, Architecture, and Storage (NAS), 2009, pp.452-460. [5] R. Burns and D. Long, "Efficient distributed backup with delta compression", In Proc. of Workshop on I/O in Parallel and Distributed Systems,, pp. 26-36. 997. [6] K. Keeton, C. Santos, D. Beyer, J. Chase, and J.Wilkes, "Designing for disasters", In Proc. Conf. on File and Storage Technologies (FAST04), 2004. [7] J. P. López-Grao, J. Merseguer, and J. Campos, From UML activity diagrams to stochastic Petri nets, In Proc. of the 4th Int. Workshop on Software and Performance (WOSP), pp. 25-36, 2004. [8] S. Distefano, M. Scarpa, and A. Puliafito, From UML to Petri nets: The PCM-based methodology, IEEE Trans. on Soft. Eng., Jan. 200. [9] A. Bondavalli, I. Maizik, and I. Mura, Automated Dependability Analysis of UML Designs, In Proc. 2nd Int. Symp. on Object-oriented Real-time distributed computing (ISORC), 999. [20] G. J. Pai and J. Dugan, Automatic synthesis of dynamic fault trees from UML system models, In Proc. of 3th Int. Symp. on Software Reliability Engineering (ISSRE), 2002. [2] K. S. Trivedi, Probability and Statistics with Reliability, Queuing, and Computer Science Applications, John Wiley, New York, 200. [22] K. S. Trivedi and R. Sahner, SHARPE at the age of twenty two, SIGMETRICS Perform. Eval. Rev., vol. 36, no. 4, 2009. [23] G. Ciardo, A. Blakemore, P. Chimento, J. Muppala, and K. Trivedi, Automated generation and analysis of Markov reward models using stochastic reward nets, in: C. Meyer, R. Plemmons (Eds.), Linear Algebra, Markov Chains and Queuing Models, vol. 48, Springer, 993. [24] OMG Unified Modeling Language (OMG UML), Superstracture version 2.3. http://www.omg.org/spec/uml/2.3 [25] S. Izukura, et. al., Applying a model-based approach to IT systems development using SysML extension, In Proc. of Int. Conf. on Model Driven Engineering Languages and Systems, pp.563-577, 20.