Virtual Full Replication for Scalable. Distributed Real-Time Databases

Transcription

1 Virtual Full Replication for Scalable Distributed Real-Time Databases Thesis Proposal Technical Report HS-IKI-TR Gunnar Mathiason University of Skövde June,

2 Abstract Distributed real-time systems increase in size an complexity, and the nodes in such systems become difficult to implement and test. In particular, communication for synchronization of shared information in groups of nodes becomes complex to manage. Several authors have proposed to using a distributed database as a communication subsystem, to off-load database applications from explicit communication. This lets the task for information dissemination be done by the replication mechanisms of the database. With increasingly larger systems, however, there is a need for managing the scalability for such database approach. Furthermore, timeliness for database clients requires predictable resource usage, and scalability requires bounded resource usage in the database system. Thus, predictable resource management is an essential function for realizing timeliness in a large scale setting. We discuss scalability problems and methods for distributed real-time databases in the context of the DeeDS database prototype. Here, all transactions can be executed timely at the local node due to main memory residence, full replication and detached replication of updates. Full replication contributes to timeliness and availability, but has a high cost in excessive usage of bandwidth, storage, and processing, in sending all updates to all nodes regardless of updates will be used there or not. In particular, unbounded resource usage is an obstacle for building large scale distributed databases. For many application scenarios it can be assumed that most of the database is shared by only a limited number of nodes. Under this assumption it is reasonable to believe that the degree of replication can be bounded, so that a bound also can be set on resource usage. The thesis proposal identifies and elaborates research problems for bounding resource usage in large scale distributed real-time databases. One objective is to bound resource usage by taking advantages of pre-specified data needs, but also by detecting unspecified data needs and adapting resource management accordingly. We elaborate and evaluate the concept of virtual full replication, which provides an image of a fully replicated database to database clients. It makes data objects available where needed, while fulfilling timeliness and consistency requirements on the data. In the first part of our work, virtual full replication makes data available where needed by taking advantages of pre-specified data accesses to the distributed database. For hard real-time systems, the required data accesses are usually known since such systems need to be well specified to guarantee timeliness. However, there are many applications where a specification of data accesses can not be done before execution. The second part of our work extends virtual full replication to be used with such applications. By detecting 2

3 new and changed data accesses during execution and adapt database replication, virtual full replication can continuously provide the image of full replication while preserving scalability. One of the objective of the thesis work is to quantify scalability in the database context, so that actual benefits and achievements can be evaluated. Further, we find out the conditions for setting bounds on resource usage for scalability, under both static and dynamic data requirements. 3

4 Contents 1 Introduction Document layout Background A distributed real-time database architecture The concept of scalability Scalability in a fully replicated real-time main memory database Virtual full replication Basic notation Definition Incremental recovery Problem formulation Motivation Problem statement Problem decomposition Aims Objectives Methodology Virtual full replication for scalability Static segmentation Static segmentation algorithm Properties, dependencies and rules Analysis model and assumptions Analysis Implementation Discussion Simulation for scalability evaluation Motivation for a simulation study Simulation objectives Validation of software simulations Modeling detail Approaches for validity and credibility Experimental process and experiment design Simulator implementation

5 4.3.8 Experiment execution Discussion and results Adaptive segmentation Adaptive segmentation with pre-fetch A framework of properties Case study and implementation Related work Strict consistency replication Replication with relaxed consistency Partial replication Asynchronous replication as a scalability approach Adaptive replication and local replicas Usage of data properties Conclusions Summary Initial results Expected contributions Expected future work A Project plan and publications 51 A.1 Time plan and milestones A.2 Publications A.2.1 Current A.2.2 Planned

6 List of papers This Thesis Proposal is based on the work in the following papers. G. Mathiason and S. F. Andler. Virtual full replication: Achieving scalability in distributed real-time main-memory systems. In Proc. of the Work-in-Progress Session of the 15th Euromicro Conf. on Real-Time Systems, pages 33-36, July (ISBN ) [52] G. Mathiason, S. F. Andler, and D. Jagszent. Virtual full replication by static segmentation for multiple properties of data objects. In Proceedings of Real-time in Sweden (RTiS 05), pages 11-18, Aug ISBN , ISSN ) [53] G. Mathiason. A simulation approach for evaluating scalability of a virtually fully replicated real-time database. Technical Report HS-IKI-TR , University of Skövde, Sweden, Mar [50] 6

7 1 Introduction In a distributed database, availability of data can be improved by allocating data at the local nodes where data is used. For real-time databases transaction timeliness is a major concern. Data can be allocated to the local node, to avoid remote data access with the risk of unpredictable delays on the network. Further, multiple replicas of the same data at different nodes allow data redundancy that can be used for recovery from failures, avoiding corruption or losses of data when nodes fail. In a fully replicated database the entire databse is available at all nodes. This offers full local availability of all data objects at each node. The DeeDS [6] database prototype stores the fully replicated database in main memory to make transaction timeliness independent of disk accesses. Full replication of data together with detached replication of updates allow transactions to execute on the local node entirely, independent of network delays. Detached replication sends updates to other nodes independent from the execution of the transaction. Database clients get the perception of a single local database and need not to consider specific data locations or how to synchronize concurrent updates of different replicas of the same data. Full replication uses excessive system resources, since the system must replicate all updates to all the nodes in such a system. This causes a scalability problem for resource usage, such as bandwidth for replication of updates, storage for data replicas, and processing for replicating updates and resolving conflicts for concurrent updates at different nodes for the same data object. With existing approaches, real-time databases does not scale well, and the need for increasingly larger real-time databases increase. With this Thesis we aim show how to bound resource usage in a distributed real-time database such as DeeDS, by using virtual full replication to make it scalable. We also quantify scalability for the domain, for an evaluation of the benefits We elaborate virtual full replication [6] as an approach to manage resources for scalability, to replicate and store updates only for those data objects that are used at a node, and maintain scalability over time. This avoids irrelevant replication and maintains realtime properties, while providing an image of full replication for local availability of data. Virtual full replication reduces resource usage to meet the actual need, rather than using resources to replicate all updates blindly to all nodes, resulting in a scalable distributed real-time database. Virtual full replication uses knowledge about the needs of the application, and uses application requirements for properties of the data, such as location, consistency model, and storage media to support timeliness requirements of the clients in a scalable system. Furthermore, properties have relations that can be used to control 7

8 resource usage in more detail. This Thesis Proposal claims that a replicated database with virtual full replication and eventual consistency enables large scale distributed real-time databases, where each database client have an image of full availability of a local database. 1.1 Document layout Section 2 contains a background on distributed real-time databases and in particular how full replication may support timeliness. An generic approach for scalability evaluation is introduced, and indicates how such approach can be used for a scalability evaluation of a distributed databases. In section 3 the problem of scalability in a fully replicated database is detailed and how we decompose the problem, how the problem can be defined in terms of assessable objectives. Section 4 describes our methodology in detail, for how to reach the objectives. Some of the steps in the methodology have been performed already and the results are presented in conjunction, while other steps remains to be done. Section 5 describes related work, both for distributed real-time databases and for other areas that relates to our approach. Finally, section 6 summarizes the conclusions in this proposal, highlights the expected contributions and expected consequences for subsequent future work. 8

9 2 Background 2.1 A distributed real-time database architecture The main property of real-time systems is timeliness, which can only be achieved with predictable resource usage and with sufficiently efficient execution. Predictability is essential and for a hard real-time system the consequence of a missed deadline may be fatal, while for soft real-time systems a missed deadline lowers the value of the provided service. Thus, for real-time systems, predictable resource usage is the primary design concern that enables timeliness. To improve predictability and efficiency, the database of the distributed real-time database system DeeDS [6] resides entirely in main memory, removing dependability on disk I/O delays caused by unpredictable access times for hard drives. Also, accesses to main memory are many times faster. To further improve predictability of database transactions, the database is fully replicated to all nodes, to make transaction execution independent of network delays or network partitioning. With full replication there is no need for remote data access during transactions. Replication also improves fault tolerance, since there are redundant copies of the data. Full replication allows transactions to have all of its operations running at the local node. A fully replicated database with detached replication [32], where replication is done after transaction commit, allow independent updates [18], that is, concurrent and unsynchronized updates for replicas of the same data objects. Independent updates may cause database replicas to become inconsistent, and inconsistencies must be resolved in the replication process by a conflict detection and resolution mechanism. In DeeDS, updates are replicated to all nodes detached from transaction execution, by propagation from the node executing an updating transaction after transaction commit, and integration of replicated updates at all the other nodes. Conflicting updates as result of independent updates are resolved at integration time. Temporary inconsistencies are allowed and guaranteed to be resolved at some point in time, giving the database the property of eventual consistency. Applications that use eventually consistent databases need to be tolerant to temporarily inconsistent replicas, and many distributed applications are. Further, in an eventually consistent database that supports a bounded time for such temporary inconsistencies (which can be achieved by using bounded time replication), applications that have requirements on timely replication and consistency can use a temporary inconsistent database as well. In a fully replicated database using detached replication, a number of predictability problems can be avoided that are associated with synchronization of concurrent updates at different nodes, such as agreement protocols or distributed locking of replicas of objects, 9

10 and reliance on stable communication to access data. Also, application programming is easier since the application programmer may assume that the entire database is available, and that the application program has exclusive access to it. 2.2 The concept of scalability Scalability is a concept that intuitively may appear obvious. It is used in many areas of research, but the availability of a generic definition and a theoretical framework with metrics is limited. A system is scalable if the growth function for required amount of resources, req(p), does not exceed the function for available amount of resources, res(p), when the system is scaled up for some system parameter p (also called the scale factor). The resource usage may follow some function of the scale factor, and the upper bound for this function, O(req(p)), must not exceed the function of available resources. For a system with linear scalability this relation must be valid for all sizes of p, but for other systems scalability may be related to only certain sizes of p. In a few research areas, scalability concepts are well developed and related metrics for scalability are available, namely in the areas of parallel computing systems [86] [57] in particular for resource management [55], and for shared virtual memory [74], for design of system architectures for distributed systems [16], and for network resource management [3]. Frölund and Garg define a list of generic terms for scalability analysis in distributed application design [29]: Scalability: A distributed (software) design D, is scalable if its performance model predicts that there are possible deployment and implementation configurations of D that would meet the Quality of Service (QoS) expected by the end user, within the scalability tolerance, over a range over scale factor variations, within the scalability limits. Scale factor: A variable that captures a dimension of the input vector that defines the usage of a system. Scaling enabler: Entities of design, implementation or deployment that can be changed to enable scalability of the design. Further, the authors also define Scalability point, Scalability tolerance, Scalability limits and Scaling parameters, which we exclude here. We can map the terms above into our context. QoS can be seen as timeliness of local transactions (deadline miss ratio), the level of consistency (consistency model properties and the bound on replication delay). Scale factors may be the database size, the number of nodes, the ratio and frequency of update transactions or others. 10

11 Jogalekar and Woodside [40] have presented a generic metric for scalability in distributed systems based on productivity, and a system is regarded as scalable if productivity is maintained as the scale changes. Given the quantities: λ(k) = throughput in responses/sec, at scale k f(k) = average value of each response, calculated from its quality of service at scale k C(k) = cost at scale k, expressed as running cost per second to be uniform with λ The value function f(k) includes appropriate system measures, such as response delay, availability, timeouts or probability of data loss. The productivity F (k) is the value delivered per second, divided by the cost per second: that F (k) = λ(k) f(k)/c(k) (1) The scalability metric ψ(k 1, k 2 ) relates the productivity at two different scales such ψ(k 1, k 2 ) = (F (k 2 ))/(F (k 1 )) (2) With ψ(k x, k y ), design alternatives x and y can be compared, such that a higher ratio indicates better scalability for k y. It it not possible to compare scalability between different systems since the ratio is based on specific value functions, but scalability can be evaluated for finding alternative scaling enablers and settings for scale factors for a particular system. The authors use the scalability metric for evaluating actual systems, and the metric has also been used by other authors using other value functions [17] [42]. Similarly, we need to define an appropriate value function for a distributed real-time database, which would be connected to the real-time properties of the system. Typically this would include the timeliness of transactions for different classes of transactions. The cost function would relate to the amount of resources used, in terms of bandwidth, storage and processing time. 2.3 Scalability in a fully replicated real-time main memory database For a fully replicated database with immediate consistency, updates are replicated to all data replicas during the execution of the updating transaction, using a distributed agreement algorithm to update all replicas at one instant, and where there is no state where replicas can differ during the update. Such updates must lock a majority of replicas of the 11

12 updated object during the transaction. With detached replication, transaction timeliness does not depend on timeliness for locking replicas, the network delays, or the waiting time for release of locks set by transactions at other nodes. Thus, detached replication is a scaling enabler for such database. However, a fully replicated database with detached replication has another scalability problem in that all updates need to be sent to all other nodes, regardless of whether the data will be used there or not. Full replication also requires that replicas of all data objects must be stored at all the nodes, independent of whether the data ever will by used there or not. Also, updates in fully replicated databases must be integrated at all nodes, requiring integration processing of updates at all nodes. Under the assumption that replicas of data objects will only be used at a bounded subset of the nodes, the required degree of replication becomes lower than in a fully replicated database. The resource usage of bandwidth, storage and processing depends on the degree of replication, and these resources are wasted in a fully replicated database, compared to a database with a lower degree of replication. With replication of only those data objects that are actually used, resources can be saved and thereby scalability can be improved. 2.4 Virtual full replication With virtual full replication [6] the database clients has a image of a fully replicated database. The database system manages knowledge of what is needed for database clients to perceive such an image. This knowledge includes a specification of the data accesses required by database clients. Virtual full replication use the knowledge to replicate data objects to a subset of the nodes where data is needed, and also to replicate with the least resource usage needed to maintain the image. With virtual full replication several consistency models for data may coexist in the same database, and the knowledge is also used to ensure the consistency model for each data object. Scalable usage of resources is dependent on the number of replicas to update. Since a virtually fully replicated database has fewer replicas, resource consumption is lower, while in a fully replicated database there are as many replicas as there are number of nodes. The overall degree of replication is lower and replication processing that serves no purpose is avoided. Only those data objects that are used at a node are replicated there, which reduces resource usage of both bandwidth, storage and processing. With such resource management, scalability is improved without changing the application s assumption of having a complete database replica available at the local node. However, virtual full replication that considers only specified data accesses does not make data available for accesses to arbitrary data at an arbitrarily selected node, which 12

13 makes such a system less flexible than a full replicated database. For unspecified data accesses and for changes in data requirements, virtual full replication adapts by detecting changes and reconfigure replica allocation and replication. This preserves scalability by managing the degree of replication over time. Segmentation of the database is an approach for limiting the degree of replication in a virtually fully replicated database. A segment is a subset of the objects in the database and each segment has an individual degree of replication over the nodes. The degree of replication is a result of allocating segments only to the nodes where its data objects are accessed. This is typically much fewer nodes than used in a fully replicated database. Also, a database may have multiple segmentations, each for a different purpose. Segmenting for data availability strives to minimize the degree of replication, while a segmentation for consistency points out the method for replicating updates in the database Basic notation In this thesis proposal the following notation is used. A database maintains a finite set of logical data objects O = {o 0, o 1,...}, representing database values. Object replicas as physical manifestations of logical objects. A distributed database is stored at a finite set of nodes N = {N 0, N 1,...}. A replicated database contains a set of object replicas R = {r o, r 1,...}. The function R : O N R identifies the replica r R of a logical object o O on a node N N if such a replica exists. R(o, N) = r if r is the replica of o on node N. If no such replica exists, R(o, N) = null. node(r) is the node where replica r is located, i.e. node(r(o, N)) =. N. object(r) is the logical object that r represents, i.e. object(r(o, N)) =. o. A distributed global database (or simply database) D is a tuple < O, R, N >, where O is the set of objects in D, and R is the set of replicas of objects in O, and N is the set of nodes such that each node N N hosts at least one replica in R, i.e. N = {N r R(node(r) = N)}. We model transaction programs, T, using four parameters, including the set of objects read by the transaction, READ T (the read set), the set of objects written by the transaction, WRIT E T (the write set), the conflict set CON FLICT T is the set of objects that conflicts with updates at other nodes. The transaction program T can thus be defined as T = {READ T, WRIT E T, CON FLICT T }. Also, we refer to the size of the read set as r T = READ T, the size of the write set as w T = WRIT E T, and the size of the conflict set as c T = CON FLICT T. Also, the working set WS T is the union of the read and write sets of the transaction program W S T = {READ T WRIT E T }. 13

14 An transaction instance T j of a transaction program is executing at a given node n with a certain maximal frequency f j. We define such transaction instance by a tuple such that T j =< f j, n, T >. When the node for execution is implicit and when we only need the sizes of the read, write, and conflict sets, we simplify the notation as T j =< f j, r j, w j, c j >. Further, node(t j ) is the node where transaction T is executed, begin(t j ) is the time at which transaction T j begun its execution. commit(t j ) is the commit time of T j and abort(t j ) is the abort time of T j (if T j is aborted, commit(t j ) =, if T j is committed, abort(t j ) = ). end(t j ) is the completion time of T j Definition Virtual full replication ensures that for each transaction that read or write database objects at a node, there exists a replica of the object. Formally, o O T (o {READ T WRIT E T } r R(r = R(o, node(t )))) 2.5 Incremental recovery With incremental recovery, a node in a full replicated distributed database can be recovered into a consistent database replica, without that any of the other working replicas need to stopped or even locked [45]. Incremental recovery was proven to give a consistent database copy at the recovered node. In a main memory database, data is stored by memory pages, which is the smallest memory entity that is access during read or write operations. It is assumed that memory management circuitry ensures that a read or write operation has exclusively access to a certain memory page during the operation. Fuzzy checkpointing uses this mechanism for sequentially copying all memory pages at a node (the recovery source), and such a copy can be sent over the network to recover a failed node (the recovery target). Selecting a recovery source is done by a negotiation process, to select the most appropriate source node. Each page that is copied is logged at the recovery source, and pages in the log that are updated after the memory page was sent to the target node need to be sent again. For such updates, updated memory pages are forwarded to the recovery target as long as fuzzy checkpointing proceeds. Once the entire database has been copied and all subsequent updates have been forwarded, the fuzzy checkpoint finishes atomically with a consistent replica at the recovered node. 14

15 3 Problem formulation 3.1 Motivation A fully replicated distributed database scales badly since it replicates all updates to all nodes, using excess resources. Scalability is an increasingly important issue for distributed real-time databases, since the number of nodes, the database size, the number of users and the workload involved in typical distributed database applications are increasing. For many such systems it can be assumed that only a fraction of the replicated data is used at the local node, which is motivated by hot-spot and locality behavior of accesses in distributed databases. This thesis argues that a fully replicated distributed main memory real-time database can be made scalable by effective resource management by using virtual full replication, and that degrees of scalability can be quantified by metrics. Different scale factors and different scaling enablers influence resource usage differently are varied to evaluate scalability of the database. Also, resource usage is compared to alternative approaches for timeliness in large scale distributed databases, such as fully replicated databases, or approaches that use partial replication and remote transactions. For many applications, we believe that a distributed real-time database can be a suitable infrastructure for communication between nodes. Publishing and using exchange data through a distributed real-time database facilitates structured storage and access, implicit consistency management between replicas, fault-tolerance and higher independence by lower coupling between applications. With a distributed database as an infrastructure there is no need to explicitly coordinate communication in a distributed application, which reduces complexity for the communicating application, in particular where the groups that communicate often change. Consider a wildfire fighting mission. In such dynamic situation, hundreds of different actors need to coordinate and distribute information in real time. In such scenario, actors and information are added and removed dynamically when the mission situation suddenly change. A distributed database has been pointed out as a suitable infrastructure for emergency management [75], and could be used as a white-board (also called black-board [56], in particular as an software architecture approach [30], or for loosely coupled agents [54]), for reading and publishing current information about the state of a mission, supporting situation awareness from information assembled from all the actors. Using a database, implying the usage of transactions, ensures consistency of the information and also avoids the need of specific addressing within the communication infrastructure. Ethnographical field studies [43] show that such infrastructure supports sense-making by enabling hu- 15

16 mans to interact, and it also gives support for the chain of command and strengthens organizational awareness, which is essential for success of the mission [58]. In such an infrastructure actors may have access to the complete information when needed, but each actor will most of the time use only parts of the information for their local actions and for collaboration with close by peers. By using virtual full replication, scalability of such distributed database can be preserved. Virtual full replication uses known properties about data objects, the applications and the database system, to reduce resource usage by resource management based on the data needs. We have shown a database segmentation approach that considers multiple properties for data, and relations between the data properties. So far, only a few selected properties have been used, but an extended set of useful properties and their relations will to be defined, to setup segments that meet data requirements from database applications. Since properties are related, it can be expected that groups of properties can be structured into profiles of data properties to reduce the resulting number of segments for our approach. 3.2 Problem statement Predictable execution of transactions is inherent for distributed real-time databases. Predictable transactions can be achieved by predictable resource usage and sufficient efficiency of execution, and by excluding involvement of sources of unpredictability in execution of transactions. With main memory residence, access times become independent of disk access times. With full replication, the entire database becomes locally available and local accesses become independent of network delays. With detached replication, the network delays for updating replicas is separated from the execution time of the transaction. Combining main memory residence, full replication and detached replication of updates gives predictable transactions in a distributed real-time database. Such database does not scale well, since all updates need to be replicated to all other nodes. By using knowledge about data needs, irrelevant replication can be avoided and the database becomes scalable, but such replication makes data available only for the data needs that are known prior to execution. Thereby, the database loses the flexibility to execute arbitrary transactions at arbitrary nodes. To maintain scalability while making data available, the virtually fully replicated database also needs to detect and adapt to unspecified data needs that arises during execution. Our initial work shows that segmentation of the database, based on a priori known data needs and database and application properties, improves scalability by scalable usage of three key resources. Therefore, reconfigurable segmentation of the database seems to be a viable extension to be able to adapt to changed data needs and to regain the flexibility 16

17 lost by fixed segmentation. Further, scalability of such an approach need to be evaluated. A quantification of scalability is needed to evaluate the influence of different scale factors and alternative scaling enablers. This thesis explores, by detailed simulation, the scalability of a large scale distributed real-time database that manage resources by using virtual full replication. Such database is also capable of adapting to changed data needs while preserving real-time properties of the application. The hypothesis of this thesis is that such system is scalable, and that scalability can be preserved by adaptation during execution. 3.3 Problem decomposition The problem has several aspects and we choose to divide it into the following components: Concept elaboration. The concept of virtual full replication has been pointed out as an approach for providing an image of full replication, while replicating only what is needed to create such an image [6]. An approach needs to be elaborated with algorithms and an architecture that can support it. Also, conditions need to be found out for which virtual full will manage to provide such an image. Formation and allocation of units of replication. Segmentation of the database, and replication of segments only to the nodes where the data is used, seems to be useful approach for bounding resource requirements. The cost of segmentation needs to be elaborated, both the processing cost for using a segmentation algorithm, in particular during system run time, and the storage cost for data structures to maintain segments. A part of this subproblem is to define how segments are appropriately formed and adapted, and what data properties to consider for saving resources by using segmentation. Also, architectural support need to be found out for segment management and replication of updates. Adaptation of database configuration. To maintain scalability of the database throughout execution, the virtually fully replicated database will need to adapt to changed data needs of the database clients. For this purpose, adaptation algorithms and architectures need to be developed and evaluated, as well as the type of changes to consider for adaptation. Properties and their relations. Virtual full replication is based on knowledge about data properties and application requirements. In our present work, we have used only a few of the data properties that could be considered for segmentation and resource management. Further, we have shown that there exists relations between some of the properties that can be used, and such relations may be used to improve manage- 17

18 ment of resources. However, we have not defined a full framework of properties and their relations. In such framework it may be useful to define application profiles of related properties to match typical groups of applications, to reduce the amount of information used for resource management. Scalability analysis and evaluation. From our initial studies we see that different scale factors influence scalability differently. So far, we have only evaluated a few scale factors, such as the number of nodes and the bound on replication degree, and we will add others, for a more extensive evaluation. The concept of virtual full replication does not put limitations on how the image of full replication is built. Therefore, alternative scaling enablers may be explored and combined for resource management, including approaches for replication of updates, segmentation considering differentiation in cost of communication links, or different propagation topologies. 3.4 Aims Our aims with this thesis are: A1 To bound resource usage to achieve scalability in a distributed main memory realtime database, by exploring virtual full replication as a scaling enabler. A2 To show how scalability can be quantified for such a system. A3 To show how properties and their relations can be used in resource management for virtual full replication. 3.5 Objectives O1 Elaborate the concept of virtual full replication O2 Bound used resources by virtual full replication for expected data requirements. O3 Bound used resources by virtual full replication for unexpected data requirements. O4 Define conditions for scalability, for expected and unexpected requirements. O5 Quantify scalability in the domain, and valuate scalability achieved related to the conditions. 18

19 4 Methodology We develop and evaluate an approach of resource management for scalability by using virtual full replication. We assess scalability using analysis, simulation and implementation, while evaluating the proposed scalability enablers. We start by elaborating the problem with full replication and we present possible approaches for virtual full replication. As next steps we refine the approach by using static and adaptive segmentation, and we elaborate segmentation algorithms for multiple properties of data and multiple requirements from database applications. For an initial evaluation of segmentation, we implement static segmentation in the DeeDS database prototype. We develop a simulation for evaluating adaptive segmentation and to allow study of large scale systems. Adaptive segmentation is more complex to analyze and also to implement in the actual database system, so we study scalability in an accurate simulation of the system. For sanity control of the simulation, we replicate the experiments done with the implementation already done for DeeDS. Further, the approach for adaptive segmentation is extended with pre-fetch of data for availability. Finally, we conclude the findings about data properties and their relations, as used in our segmentation approach, to document a framework of useful properties that can be use to improve scalability of distributed real-time databases. Some of the steps in our methodology have already been done while other steps remains. Some of the steps described can optionally be excluded, which is indicated at each step. 4.1 Virtual full replication for scalability M1: In the first research step, virtual full replication has been introduced and segmentation has been proposed as a mean for achieving it [52]. The concept of virtual full replication was elaborated. By using knowledge about the actual data needs, and also recognizing differences in properties of the data, segments can be allocated only to the nodes where data is used, and thereby bound resource requirements at a lower level, while maintaining the same degree of (perceived) availability of data for the application. This reduces flexibility compared to a fully replicated database, since not all objects are available at all nodes. Consequently, availability for unspecified accesses can not be guaranteed. For hard real-time systems, access patterns and resource requirements are usually known a priori of execution and the data needs can easily be specified. For such systems the reduced flexibility is not a problem, since virtual full replication then will guarantee availability for all data needs. The aspect of cohesive nodes is emphasized, where some nodes share information more frequently than other nodes, or some nodes use data with the same consistency model. The paper describes a work in progress and points out essential research problems in 19

20 the area, such as the need for an analysis model for how system parameters influence critical resources. A first definition of scalability for a fully replicated database ( replication effort ) was presented and an algorithm for segmentation was introduced, however with high complexity and a combinatorial problem when using multiple data properties. Multiple consistency models may coexist in the same database, since different segments can use different consistency models. To evaluate approach for this, an implementation for co-exstince of both bounded and unbounded replication time for updates was implemented in the DeeDS prototype [48], following the proposed architecture for virtual full replication. In this implementation there are two segments with different types of eventual consistency, one with unbounded replication delays and another with a bound for replication delays. 4.2 Static segmentation M2: In the second research step, we pursued deeper knowledge about static segmentation. The segmentation approach for virtual full replication was refined [53]. We explore how Virtual full replication can be supported by grouping data objects into segments. The database is segmented to group data objects that need to be accessed at the same set of nodes and each segment is allocated only at the nodes where its data objects need to be accessed. Segmentation is an approach for managing resources by limiting the degree of replication for data objects to avoiding excessive resource usage. The data objects in a segment share key properties, where node allocation is one such property. Segmenting a database for the known data references is a trivial problem, since a partitioning and replication schema can be derived directly from the list of required accesses. This can be done by a database designer or automatically from a list operations in transactions. However, when adding other data properties, there will be a combinatorial increase in the resulting number of segments with unique combinations of properties, such as described in [49]. With a segmented database, each segmentation enables support for a specific property and by allowing multiple segmentations on the same database, it is possible support several properties. A segmentation on consistency allows multiple concurrent consistency models in the same replicated database. This enables new types of applications for the DeeDS database system, since it allows data objects that can not use eventual consistency. With a segmentation on storage medium, some segments may be allocated to disk instead of memory, when the requirement is to prioritize durability in favor of timeliness or even efficiency. By segmenting the database on combinations of properties we can find the largest possible physical segments, where all data objects of a segment can be managed uniformly. 20

21 We can recover segments faster by block-reading an entire physical segment from a backup node and also prioritize recovery of critical segments. When introducing combination of properties of the data to replicate, the number of segments will multiply with the number of data properties used and the possible values that properties can be set to. Consider a segmentation that allows two consistency models combined with segmentation for allocation. The resulting number of possible segments may double compared to a segmentation only for allocation. To find the the segments in a database where multiple properties are combined, a naive algorithm need to loop through the combination of values of each property for all data objects. Also, the more properties that are used for segmenting the database, the smaller the segments will become. In a case where each object has its own unique combination of properties, each segment will become one object large. This is a bound on the number of possible segments, which is less likely in a database where usage of objects rather cluster to cohesive sets of data objects. Thus, there is a need for an approach of segmenting data on multiple properties that is efficient and that considers clustering but also capture and use knowledge on how properties can be combined. In this step we have presented an algorithm that handles the combinatorial problem of segmenting a database on multiple properties, where multiple and overlapping segmentations of the database are allowed, and where dependencies between properties are considered[53]. We introduce logical segments that allow overlapping segmentations, where each segmentation represents a property or a set of properties of interest. From logical segments, we can derive physical segments, such as allocation and recovery units. With the refined segmentation algorithm presented, we can segment a database by multiple properties without a combinatorial problem. This algorithm also recognizes the relations between data properties, where relations can be dependent and unrelated of each other, or even mutual. The dependency relation implies that a dependent property can not be supported if the property dependent upon is not supported. Relations between properties reduce the combinations of segmentations for multiple properties, and is the way profiles may be created. Our approach allows control of relations, by rules that can be applied on the set of combined properties. The approach also allows specification of data clustering, which sets the common properties equal for clustered objects typically shared by cohesive nodes. In current work, we use our algorithm for automatic segmentation and segment allocation of units of distribution, physical segments, based on application knowledge about data used by transactions and other knowledge about data properties originating in application semantics. 21

22 For our future work, we plan to extend the algorithm to maintain scalability at execution time, supporting mode changes and unspecified data requests Static segmentation algorithm Our approach for static segmentation is based on that sets of properties are associated with each object, and that objects with same or similar property sets can be combined into segments. We introduce an example here by first considering a single property, where data is accessed. Consider a scenario, with the following five transaction programs executing as seven transaction instances, in a database with at least six objects replicated to at least five nodes: T 1.1 =< r : o 1, o 6, w : o 6, N 1 >, T 1.2 =< r : o 1, o 6, w : o 6, N 4 >, T 2 =< r : o 3, w : o 5, N 3 >, T 3 =< r : o 5, w : o 3, N 2 >, T 4.1 =< r : o 2, w : o 2, N 2 >, T 4.2 =< r : o 2, w : o 2, N 5 >, T 5 =< r : o 4, N 3 >. Based on these accesses the objects have the following access sets: o 1 = {N 1, N 4 }, o 2 = {N 2, N 5 }, o 3 = {N 2, N 3 }, o 4 = {N 3 }, o 5 = {N 2, N 3 }, o 6 = {N 1, N 4 }. These particular access sets can give a segmentation that has an optimal placement of data: s 1 =< {o 4 }, N 3 >, s 2 =< {o 3, o 5 }, N 2, N 3 >, s 3 =< {o 1, o 6 }, N 1, N 4 >, s 4 =< {o 2 }, N 2, N 5 >. For an implementation, we use a table to collect all properties of interest. For the algorithm used with the table, the data accesses of all transactions are marked in a table, where we have data objects in rows and nodes in columns. See Figure 1 (left) for a database of 6 objects replicated to 5 nodes. Note that rows in the table equally well could represent groups of data objects, such as object classes or user-defined object clusters that share the same properties. By assigning a binary value to each column, each row can be interpreted as a binary number that forms an object key that identifies the nodes where the object is accessed. By sorting the table on the object key, we get the table in Figure 1 (right). Passing through this table once, we can collect rows with same key value into unique segments and allocate each segment at the nodes marked. See Algorithm 1 for pseudo code. 22

23 column value objects o ACCESS LOCATIONS N1 N2 N3 N4 N5 x x segment key 9 o4 ACCESS LOCATIONS N1 N2 N3 N4 N5 x segment key 4 Seg 1 o2 x x o3 x x 18 6 sort o3 x x o5 x x 6 Seg 2 6 o4 x o5 x x 4 6 o1 o6 x x x x 9 9 Seg 3 o6 x x 9 o2 x x 18 Seg 4 Figure 1: Segmentation principle, used for segment allocation The sort operation in this algorithm contributes with the highest computational complexity. Thus this algorithm segments the database in O(o log o), where o is the number of objects in the database. Such property set can be extended with additional knowledge. We can uniformly handle more properties on objects by extending the property sets. Assume that the database clients require that o 2 must have a bound on the replication time, and that o 1 and o 6 can be allowed to be temporality inconsistent. Also, assume that objects o 3, o 4, o 5 need to be immediately consistent. Further, assume that objects o 1, o 3, o 5, o 6 are specified to be stored in main memory, while objects o 2, o 4 are stored on disk. To control multiple properties, we extend the property set into (resulting in table shown in Figure 2) o 1 = {N 1, N 4 }, {asap}, {memory}, o 2 = {N 2, N 5 }, {bounded}, {disk}, o 3 = {N 2, N 3 }, {immediate}, {disk}, o 4 = {N 3 }, {immediate}, {disk}, o 5 = {N 2, N 3 }, {immediate}, {memory}, o 6 = {N 1, N 4 }, {asap}, {memory}. The sort operation in the algorithm implemented contributes with the highest computational complexity. One sort operation is used for each segmentation of interest, and typically a reduced number of segmentations are used. Thus, the algorithm segments the database in O(o log o) for multiple properties as well, where o is the number of objects in the database. This is far better than the naive algorithm for multiple property segmentation that was presented in [49], with a computational complexity of O(o!). By selecting particular properties, segmentations for special purposes can be generated 23