Elasticity Primitives for Database as a Service

Transcription

1 UNIVERSITY OF CALIFORNIA Santa Barbara Elasticity Primitives for Database as a Service A Dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science by Aaron J. Elmore Committee in Charge: Professor Divyakant Agrawal, Co-Chair Professor Amr El Abbadi, Co-Chair Professor Xifeng Yan Professor Kenneth Salem March 2014

2

3 The Dissertation of Aaron J. Elmore is approved: Professor Xifeng Yan Professor Kenneth Salem Professor Divyakant Agrawal, Committee Co-Chair Professor Amr El Abbadi, Committee Co-Chair February 2014

4

6 vi

7 vii To Emily, my better half.

8 viii

9 Acknowledgements The company we keep through journeys is just as important as the journey we take. I have been extremely fortunate with the quality and quantity of those around me on my journey so far. I know words alone are not enough to properly thank everyone, but it is a good start. First and foremost, I acknowledge my advisors Divy Agrawal and Amr El Abbadi. From the start, they displayed an uncanny ability to guide and lead while fostering independence. Their ability to understand problems and clarify the murky is something I will always strive towards. Their almost thirty year partnership as academics is truly inspiring. Beyond being excellent mentors and researchers, the two of them have always demonstrated the importance of giving respect and support, taking time to enjoy life and your family, and remaining modest. I am ever grateful for having great collaborations with Sudipto Das during my early doctorate years. I have no doubt that any success I achieve, will have also been fostered by Sudipto s influence. Sudipto instilled an ability to critically question your own problems, assumptions, and solutions. He also taught me that working with people who are better than yourself is the surest way to grow. His perseverance and work ethic is the model of a great researcher. In addition to Sudipto, I am extremely appreciative of all lab mates I have had throughout my time at UC Santa Barbara. Having great people to share victories and defeats with certainly has enriched my experiences. Finishing the doctorate journey would not have been possible without their friendship, advice, contributions, and encouragement. Ceren, Shiyuan, Shoji, Suditpto, Shyam, Zhengkui, Hatem, Alex, Faisal, Vaibhav, Xiaofei, Theo, Siva, Jiaci, and Cetin have all made long hours and days much more enjoyable Towards the end my doctorate I was fortunate to work with Andy Pavlo. While Andy is known for his personality, he is an excellent collaborator and mentor. His willingness to go above and beyond to help people around him is a gift. Without a doubt, I know Andy will produce amazing students throughout his new academic career and I am indebted on his advice during my transition out of doctorate studies. I am extremely grateful for my committee members Xifeng Yan and Ken Salem. As committee members they were wonderful in providing feedback, helping me question research directions, and being encouraging about taking new steps professionally. I appreciate all their understanding and patience. Our department has been fortunate to have Janet Kayfetz help our students learn how to more effectively communicate ideas in both writing and presenting. Janet takes a deep interest in her students and I am extremely glad for the skills ix

10 and knowledge she has imparted in me. Her vote of confidence is always a valuable one. I have been extremely fortunate to work with amazing colleagues and mentors during my summers. Phil Bernstein graciously had me join him one summer at Microsoft Research. Working with such a titan of research was a transformative experience. Phil s approach to problems and research showed me the importance of details and clarity. I am indebted to his mentorship and career advice. I was also fortunate to be invited by Adam Silberstein to join Trifacta for one summer. Adam was an excellent mentor and was always happy to share his valuable insights and experiences with large scale systems. Along with Adam, I was lucky to work with a lot of very talented people at Trifacta during a formative time. I am especially grateful for my experiences with Joe Hellerstein, Jeff Heer, Eric Bothwell, and Sean Kandel. These fine people at Trifacta are not only in the midst of building a great company, but they all took the time and interest to mentor and guide me as I move to the next stage in life. My time at school has been enriched by getting to interact with some incredible staff, faculty, and students. There are too many people to name, but I am extremely happy to have worked with Mark and Jim at NCEAS, Zarko at the Computation Institute, and Adam at UCSB. Soren Llyod and Leo Irakliotis were instrumental in encouraging me to reach my potential early on. My friends throughout California, Chicago, Denver, Seattle, New York, and Tel Aviv have been constant source of support. I cannot express how much they have done for me throughout the years. My family has also been incredibly supportive. Most importantly they encouraged me to start and finish the doctorate. My parents instilled an incredible sense of determination, confidence, and wonder that equipped me to get this far. Lastly and most importantly, I must thank my partner and wife Emily. She has made me into a more thoughtful, compassionate, caring, and well-rounded individual. While undertaking a doctorate one can lose sight of the world around them as they dive deeper and deeper. Emily is always there trying to anchor me to the world around me. I am eternally grateful for her support and belief. She is honestly my better half. x

11 Curriculum Vitæ Aaron J. Elmore Education 2014 Doctor of Philosophy in Computer Science, University of California, Santa Barbara Master of Science in Computer Science, University of Chicago Bachelor of Science in Electronic Commerce Technologies, DePaul University. Experience 2013 Software Engineering Intern Trifacta San Francisco, CA 2012 Research Intern Microsoft Research Redmond, WA Software Engineering Intern Amazon Web Services Seattle, WA Research Assistant National Center for Ecological Analysis and Synthesis (NCEAS) Santa Barbara, CA Teaching Assistant University of California, Santa Barbara Barbara, CA Santa Research Assistant Distributed Systems Lab, UCSB Santa Barbara, CA Research Assistant University of Chicago Chicago, IL xi

12 2007 Present Chief Architect Customore Chicago, IL Software Engineer 1SYNC / GS1US Chicago, IL Software Engineer JC Whitney Chicago, IL Software Engineer The Incrementum Group, LLC Chicago, IL Computer Science Tutor DePaul University Chicago, IL Selected Publications Aaron J. Elmore, Vaibhav Arora, Andrew Pavlo, Divyakant Agrawal, Amr El Abbadi. Squall: Fine-Grained Live Reconguration for Partitioned Main Memory Databases, In Submission. Aaron J. Elmore, Carlo Curino, Divyakant Agrawal, Amr El Abbadi. Towards Database Virtualization for Database as a Service, VLDB 2013 (Tutorial). Aaron J. Elmore, Sudipto Das, Alexander Pucher, Divyakant Agrawal, Amr El Abbadi, Xifeng Yan. Characterizing Tenant Behavior for Placement and Crisis Mitigation in Multitenant DBMSs, ACM International Conference on Management of Data (SIGMOD) 2013: Stacy Patterson, Aaron J. Elmore, Faisal Nawab, Divyakant Agrawal, Amr El Abbadi. Serializability, not Serial: Concurrency Control and Availability in Multi-Datacenter Datastores, Very Large Data Bases (VLDB) 2012: Aaron J. Elmore, Sudipto Das, Divyakant Agrawal, Amr El xii

13 Abbadi. InfoPuzzle: Exploring Group Decision Making in Mobile Peer-to-Peer Databases, Very Large Data Bases (VLDB) 2012: Divyakant Agrawal, Amr El Abbadi, Beng Chin Ooi, Sudipto Das, Aaron J. Elmore. The evolving landscape of data management in the cloud, International Journal of Computational Science and Engineering Aaron J. Elmore, Sudipto Das, Divyakant Agrawal, Amr El Abbadi. Zephyr: Live Migration in Shared Nothing Databases for Elastic Cloud Platforms, ACM International Conference on Management of Data (SIGMOD) 2011: Divyakant Agrawal, Amr El Abbadi, Sudipto Das, Aaron J. Elmore. Database Scalability, Elasticity, and Autonomy in the Cloud. 16th International Conference on Database Systems for Advanced Applications (DASFAA) 2011: Aaron J. Elmore, Sudipto Das, Divyakant Agrawal, Amr El Abbadi. Towards an Elastic and Autonomic Multitenant Database, 6th International Workshop on Networking Meets Databases (NetDB) Honors and Awards SIGMOD Student Travel Grant UCSB Senate Travel Grant Amazon Research Grant, Winter 2011 Outstanding Teaching Assistant, UCSB Computer Science Merit Fellowship Top of the Class Honors, DePaul University Professional Activities 2014 ACM SIGMOD Demo PC Member UCSB Computer Science Faculty Recruitment Graduate Representative xiii

14 UCSB Computer Science Graduate Representative President UCSB Computer Science Graduate Admissions Committee Representative External reviewer for ODBASE 2011, VLDB 2012, COMAD , Middleware Reviewer for Transactions on Computers, Transactions on Storage 2011 Helped organize the NSF Workshop Science of Cloud held in March 2011 xiv

15 Abstract Elasticity Primitives for Database as a Service Aaron J. Elmore Transactional databases are a critical component in data intensive applications. They enable application developers to persist and query data without having to design for concurrency control, fault tolerance, atomic multi-operation transactions, or physical storage layout. Due to the utility of databases and their general purpose design they are widely used within organizations. However, databases are predicated on an architecture that assumes one database instance is dedicated to hosting a single application. Organizations managing many small databases with fluctuating requirements face wasted resources and redundant costs. Building a database-as-a-service platform allows for the effective consolidation of many databases into a reduced number of servers. This dissertation focuses on the primitives, or tools, required to transform traditional database architectures into a distributed, scalable, and self-managed data platform. The presented primitives enable system elasticity, or the ability for a system to dynamically adapt the available capacity in response to changing resource requirements. First, we propose a self-managed controller to leverage expert administrators in managing database placement and maintaining system performance. This controller provides a method to identify resource requirements at runtime and a method to empirically learn how various databases will behave when colocated. These techniques are utilized to place databases and loadbalance the system when resources are constrained. Second, this dissertation presents two techniques to migrate databases between servers without making the system unavailable for applications. These advances include the live migration of shared nothing databases and the live reconfiguration of partitioned main-memory databases. The presented primitives are critical steps in building a scalable database platform to host many applications using existing database architectures. xv

16 xvi

17 Contents Acknowledgements Curriculum Vitæ Abstract List of Figures List of Tables ix xi xv xxi xxiii 1 Introduction The Need for Database-as-a-Service Challenges Faced with DBaaS The Need for Elasticity Primitives Dissertation Overview Modeling and Placement Primitives Movement Primitives Contributions Background Multitenancy Models Multitenancy for the Cloud Recent Multitenant Systems I Modeling and Placement Primitives 21 3 Pythia Challenges in Multitenancy Controller for a Multitenant DBMS Delphi Architecture xvii

18 3.4 Service Level Objectives Effects of Colocation Problem Formulation Pythia: Learning Behavior Tenant Feature Selection Resource-based Tenant Model Resource-based classes Training the model Node Model for Tenant Packing Utilizing Machine Learning Delphi Implementation Statistics Collection Crisis Detection and Mitigation Monitoring and Crisis Detection Crisis Mitigation Experimental Evaluation Benchmark and Tenant Description Model Evaluation Tenant Model Evaluation Node Model Evaluation Crisis Mitigation Summary II Movement Primitives 53 4 Forms of Database Migration Asynchronous migration Synchronous migration Live migration Zephyr Background System Architecture Migration Cost Known Migration Techniques Zephyr Design Design Overview Migration Cost Analysis Correctness and Fault Tolerance xviii

19 5.3.1 Isolation guarantees Fault tolerance Migration Safety and Liveness Optimizations and Extensions Replicated Tenants Sharded Tenants Data Sharing in Dual Mode Implementation Details Experimental Evaluation Benchmark Description Migration Cost Summary Squall Background H-Store Architecture Database Partitioning Motivation The Need for Reconfiguration The Impact of Reconfiguration Overview of Squall Initialization Data Migration Termination Managing Data Migration Identifying Migrating Data Reactive Migration Asynchronous Migration Replication Management Fault Tolerance Failure Handling Crash Recovery Dynamic Data Chunking Experimental Evaluation Workloads Cluster Expansion Cluster Consolidation Database Size Sensitivity Analysis Future Work xix

20 6.9 Summary III The End for Now Conclusion and Future Work Conclusion Future Work Bibliography 127 Appendices 135 xx

21 List of Figures 1.1 A shared nothing multitenant DBMS architecture Pythia incrementally learns behavior Overview of Delphi s architecture Effects of throughput on cache impedance Tenant model resource consumption when run in isolation Node model performance by label confidence Comparing improvements to nodes in violation, and the impact on nodes not in violation Tenant latencies by platform total tenant count Timeline for different phases during migration. Vertical lines correspond to the nodes, the broken arrows represent control messages and the thick solid arrows represent data transfer. Time progresses from top towards the bottom Ownership transfer of the database pages during migration. P i represents a database page and a white box around P i represents that the node currently owns the page B+ tree index structure with page ownership information. A sentinel marks missing pages. An allocated database page without ownership is represented as a grayed page Impact of the distribution of reads, updates, and inserts on migration cost; default configurations used for rest of the parameters. We also vary the different insert ratios 5% inserts correspond to a fixed percentage of inserts, while 1/4 inserts correspond to a distribution where a fourth of the write operations are inserts. The benchmark executes 60,000 operations Impact of varying the transaction size and load on number of failed transactions. We also report the slope of an approximate linear fit of the points in a series xxi

22 5.6 Impact of the database page size and database size on number of failed operations The H-Store architecture from [66] Simple TPC-C data, showing WAREHOUSE and CUSTOMER partitioned by warehouse IDs A sample partition plan to control data layout. For TPC-C in this example, all tables are either replicated or partitioned by their foreign key relationship to the warehouse table As workload skew increases on a single warehouse in TPC-C, the collocated warehouses experience reduced throughput due to contention As a systems partition plan changes, Squall must manage and track the progress of reconfiguration at each node to ensure correct data ownership in a lightweight manner Sample Updated Partition Plan Tracking the partition s progress at different granularities Partition Addition A reconfiguration to expand a cluster with two nodes from 6 partitions to 8 partitions. This expansion acts a reshuffle as all data items are evenly distributed between 8 partitions after reconfiguration Node Addition A reconfiguration to expand from 4 partitions on one node to 8 partitions on two nodes. This expansion attempts to minimize the data movement, by having each partition migrate half of its data to exactly one new partition Node Removal A reconfiguration to contract from 8 partitions on two nodes to 4 partitions on one node The impact of migrating larger databases on mean throughput xxii

23 List of Tables 2.1 Multitenant database models, how tenants are isolated, and the corresponding cloud computing paradigms Summary of the forms of migration and the associated costs Notational Conventions xxiii

24 xxiv

25 Chapter 1 Introduction The bureaucracy is expanding to meet the needs of the expanding bureaucracy. 1.1 The Need for Database-as-a-Service Oscar Wilde Transactional databases are a critical component in data intensive applications. They enable application developers to persist and query data without having to design for concurrency control, fault tolerance, atomic multi-operation transactions, or physical storage layout. Due to the utility of databases and their general purpose design they are widely used within organizations. However, with disjoint project development teams, acquisitions, and distinct databases for development practices large organizations can experience database proliferation. In one extreme cases, a telecommunications company was found to manage 20,000 separate database instances [25]. This proliferation comes at a high cost to organizations faced with managing such a large scale of database instances. Databases are predicated on an architecture that assumes a server is primarily dedicated to hosting the database instance. Often each instance hosts a single application s database. This architecture is an artifact of decades of database research and development focused on providing general purpose, high performance databases to fully utilize a machine s resources to support high throughput applications with low-latency response times. Many database vendors have created highly configurable databases to support a wide variety of application requirements. Tuning configuration parameters, such as the amount of memory dedicated for caching data, how concurrent operations are serialized, or the amount of acceptable time to delay log flushes have significant impact on the performance and 1

26 Chapter 1. Introduction guarantees provided by a database system. Given the variety and ramifications of potential database configurations, organizations rely on skilled database administrators (DBAs) to properly tune databases by working closely with application developers and system architects. The ability to properly tune and configure a database is often gained through years of administration experience. In addition to utilizing expert administrators to optimize performance, modern database systems rely on scaling up the capacity of powerful servers to handle demanding applications. Databases greedily consume and explicitly manage the physical resources of a server. Therefore adding memory, faster persistent storage, larger CPUs, or faster network devices can often resolve performance issues. Research has investigated the use of parallel [34] and distributed databases [60] to increase performance through scaling out across multiple servers. However, the performance implications of distributed transactions has limited popularity of these databases for update intensive workloads. The need for skilled DBAs and expensive dedicated hardware, combined with expensive licenses for popular commercial DBMSs, results in databases being an expensive part of software application stacks. With databases being an expensive component, organizations that host many databases incur exacerbated and redundant costs. Many factors contribute to database proliferation. Organizations can maintain many product licenses across different vendors to support distinct application requirements (e.g. spatial functionality, text search), support legacy applications (e.g. deprecated functionality), and support distinct databases for development practices (e.g. separate production, QA, and development databases). A large number of databases drives up capital expenditures not only for purchasing the servers and licenses, but for support staff to manage the physical machines and recurring costs for the storage, power, and cooling of the servers. With the architecture that assumes a dedicated database instance or server per application, this proliferation results in high costs for managing all hosted databases. The high costs and wasted resources associated with managing multiple databases creates a demand for solutions in consolidating databases into fewer servers. Implementing efficient consolidation at the database tier, requires transforming a traditional DBMS into a multitenant DBMS that can effectively share resources between many hosted applications, or tenants. The rise of cloud computing as a successful computing paradigm has demonstrated the benefits of consolidating various compute components into a multitenant service offering. This includes low level offerings to provide on-demand virtual computers in an Infrastructure-as-a-Service (IaaS) platform, to a shared application stack hosted in a Software-as-a-Service (SaaS) platform. Cloud computing offerings are successful for service providers due to their ability to leverage 2

27 Challenges Faced with DBaaS Section 1.2 economies of scale to amortize the cost of each service instance. The costs of running a server does not vary greatly with server utilization. The majority of costs derives from purchasing the server itself, power, cooling, physical space to store the server, and human administrators. These costs do not significantly change if the server is utilizing 5% or 80% of its resources. Services hosted on idle or low usage servers could be consolidated to fewer machines to lower the total operating costs. Therefore, the specialization of large scale hosting encourages effective consolidation to maximises resource utilization in a shared infrastructure. Conversely, users of cloud service platforms are attracted by a pay-as-you-go model that does not require significant initial investment in capital or development. While an effective pricing model and low vendor tie-ins attract users to cloud computing models, the performance and availability guarantees need to meet the application requirements. Striking a balance between service costs, which is largely determined by consolidation, and performance is a critical challenge in building a service offering. With the popularity and high costs of databases, a Database-as-a-Service (DBaaS) offering is appealing to both application developers and organizations hosting many databases. Here applications, or users, rent a virtual database from the service provider. To the application it appears if they have a dedicated and isolated database instance. In reality the user is unlikely to acquire a dedicated database instance, but a slice of a shared database application. Use of standard database APIs (e.g. JDBC or ODBC), reduce concerns about vendor tie-in and minimize modifications to migrate from a hosted database to a database service. A Database-as-a-Service can be used a public cloud offering, such as Amazon s Relational Database Service (RDS), or as internal service, such as a university offering a consolidated database platform for department usage. 1.2 Challenges Faced with DBaaS A Database-as-a-Service platform orchestrates a cluster of multitenant database servers to appear as a monolithic database to application developers. Fig. 1.1 demonstrates a conceptual Database-as-a-Service architecture which is composed of multiple servers, each hosting multiple database applications. In addition to a transformed DBMS engine to support multitenancy, new components such query routers and system controllers are needed to manage a cluster of database servers. Designing such a database platform requires many architectural decisions, including mechanisms for how DBMSs are multiplexed between applications or how tenant placement decisions are made. To limit the scope of architectural decisions, a database platform will likely target one of two major databases use cases. Online transaction processing (OLTP) systems represent a class of databases 3

28 Chapter 1. Introduction Figure 1.1: A shared nothing multitenant DBMS architecture. that serve applications with frequent short read and write operations. Often these operations are rolled into an atomic transaction that is serialized against concurrent transactions. Online analytic processing (OLAP) systems focus on read-heavy analytic and data mining workloads, with updates batched through an load process. These systems may also be referred to as a decision support system (DSS). Since the use cases differ, organizations will run distinct analytic and transactional database systems to insulate the often customer facing OLTP workloads from the sporadic and resource intensive OLAP workloads. However, this separation does not preclude the use of analytic queries in an OLTP system, and vice-versa. While the challenges faced in building an OLTP platform and OLAP platform are similar, the constraints, goals, and solutions will vary. This dissertation focuses on solutions for an OLTP database service. To make an OLTP focused Database-as-a-Service offering practical for application developers, guarantees for availability and performance are needed to form expectations and promises between the parties. A service level objective (SLO) is a guarantee for a single performance metric. Common SLOs typically include uptime (availability), or an operation latency response time for simple operations. A service level agreement (SLA) can be synonymous with SLO, but often it is an agreement between the provider and user that encompasses multiple SLOs. SLOs and SLAs are provided in a variety of methods, but an economic incentive model is typically used. Here the user pays for the service, either per hour or per operation, and the provider suffers a penalty for SLO violations. One example is that a user pays per GB of data stored and a nominal fee for each is- 4

29 Challenges Faced with DBaaS Section 1.2 sued query, and if the query violates a latency SLO the user is refunded a certain amount for that query. For a system hosting multiple tenants there must be a provisioning strategy that allocates the number of physical servers required and maps tenants to each server. Provisioning strategies ensure that each tenant has a suitable amount of resources to process requests in a timely manner. When a tenant is added to the system, decisions about the tenant s initial placement must be made. A consolidation primitive will determine how to initially place tenants. As tenants consume a set of resources (e.g. CPU cycles, IOPS, or memory) and each server has a fixed amount of resources, the initial placement of tenants to servers is often viewed as a multidimensional knapsack problem. Several heuristics have been proposed to address the initial placement, or consolidation, of tenants [57, 25]. While the initial tenant consolidation is a primary concern in multitenant systems, it alone does not provide continual resource effectiveness. Many tenant applications exhibit temporal usage patterns. These patterns may be recurring (e.g. diurnal) or seasonal (e.g. course registration systems). A traditional method for consolidation is to profile the application in isolation in order to identify the expected usage and resource requirements. This technique is referred to as sandbox profiling. Once the resource requirements are established for all applications, the tenants are provisioned to support their highest expected level of usage. While this peak provisioning approach ensures that under most circumstances a tenant has ample resources, it is an intensive and brittle approach to consolidation. If the application experiences variance in usage patterns, then during low periods of activity the system is over-provisioned and resources are idle. Since there are many fixed costs for running servers (e.g. power, cooling, and space), idle resources are effectively wasted resources. Additionally, if an application is web facing it is subject to sudden changes in normal usage level (e.g. flash crowds). These shifts in usages, either sudden or gradual, can cause an application to use more resources than its historical peak. In these cases the system can become over-utilized and colocated tenants may not have sufficient resources to meet SLOs. This performance crisis requires an adjustment to the system s provisioning strategy. These scenarios highlight why consolidation alone is not enough to maximize physical resources while ensuring performance objectives. To support performance SLOs, a system controller will need to monitor tenant activity and react when a SLO violation does occur. If hardware failures or software bugs are ruled out as reasons for the SLO violation, then the system can assume a behavior change created the violation. To respond to this performance crisis, a controller must implement some form of resource isolation to ensure tenants receive ample resources to meet their performance objectives. Methods 5

30 Chapter 1. Introduction for implementing resource isolation include (i) changing the mapping of tenants to servers to change resource utilization patterns, (ii) adding additional servers to the system and place some tenants on the new servers, (iii) implementing resource control mechanisms to limit resources consumed by a given tenant, or (iv) rate limiting the client queries either at the database or query router. The class of solutions related to the placement of tenants to ensure resources or performance is often referred to as soft isolation [62], the class of solutions related to controlling how resource are shared between tenants is referred to as resource allocation [63], and solutions related to limiting or queueing queries is often referred to as admission control [86]. While research is ongoing for these orthogonal approaches to resource isolation, this thesis focuses on the elasticity primitives (or tools) needed to enable a soft isolation based multitenant database platform. We focus on soft isolation for three primary reasons. First, soft isolation enables a database platform to be built with little modification to existing database kernels. Using vanilla database releases or limited patches to popular releases increases the opportunity for adoption and impact. Second, if the database platform is built on an elastic infrastructure (i.e. easy to provision additional servers), then a system does not have to limit tenant requests, either from queries or resource requests, to ensure resource isolation. Third, soft isolation enables the flexible sharing of resources which allows tenants to consume additional resources when needed and to relinquish resources when not needed. 1.3 The Need for Elasticity Primitives A Database-as-a-Service offering can host hundreds to thousands of databases across tens to hundreds of database servers. Due to the scale of tenants and presence of dynamic workloads, manual administration of resource isolation is untenable. Orchestrating many database servers to host a large number of applications requires changes to existing database systems as well as the design and implementation of new tools and components to ensure that hosted applications continually meet performance objectives. To allow the platform to scale up in the number of tenants it is important that system operations can be managed without direct supervision of an administrator. Therefore, self-managed elasticity primitives are essential for building a scalable database platform that adapts to dynamic workloads while maximizing resource utilization. This section highlights elasticity primitives needed for enabling a Database-as-a-Service using soft isolation as the resource control mechanism. A soft-isolation platform hosting a large number of small tenants across a cluster of servers must address several challenges. One key challenge is the ability to 6

31 The Need for Elasticity Primitives Section 1.3 understand what amount of physical resources a tenants needs in order to meet target SLOs. These resources can include CPU cycles, memory, disk I/O operations, or other related physical resources. Without adequate resource access, a tenant is likely to violate SLOs during periods of high activity. As many applications can be multipurposed or have different users with distinct usage patterns it can be difficult to discover the exact resource requirements for a tenant. In addition to identifying resource requirements, attributing current resource consumption to a given tenant is difficult for architectures where tenants share system processes. Therefore, a platform should have primitives to model resource requirements for a given tenant and attribute resource consumption to tenants. Discovering resource requirements alone is not enough for a database platform to place tenants. In a soft isolation based architecture, resources are shared between tenants without controls for how resources will be shared. When two workloads are placed on a single machine, they will compete for the underlying resources. How the resources will be acquired and used by each tenant will depend heavily on the controls allowed, the database architecture, and the tenant workloads. Often the resources consumed by tenants are not additive when colocated and a model of aggregating resource consumption is required [25, 6, 62]. As the system will make decisions about which tenants will be colocated, a system controller must have an ability to predict or model how various tenants will behave when colocated. Without any colocation primitive, a system would be blind when placing tenants in the absence of strong resource isolation. Such a blind placement would likely result in performance violations for periods of moderate activity. With the presence of dynamic workloads, behavior can change in manner that has not been previously observed. In these cases, performance violations can occur even with perfect resource and colocation modeling primitives enabled. Here, the system must react to a performance crisis resulting from the violated SLO. Several approaches have been proposed to deal with these violations in a soft isolation based platform. If the system utilizes primary copy replication [16] then one option is to shift workloads by promoting a secondary replica to become the new primary replica [61]. If a multi-master replication scheme is utilized then the percentage of work allocated to each replica can be load-balanced [70]. If replication is not enabled or if the replicas are not valid destinations due to the existing workloads at the secondaries, then a databases must be migrated between the servers to update the tenant to server mapping [38]. For this solution, a migration primitive must exist to migrate a tenant s persistent image and active state. Ideally, a live migration [21] primitive is supported to migrate the tenant s active state and image without stopping the database. 7

32 Chapter 1. Introduction Databases amenable to a consolidated environment and hosted by a Databaseas-a-Service platform are likely to have a small physical size (footprint) or low throughput. However, certain classes of applications may have data storage requirements or an active working set that spans the capacity of a single server. For these tenants the database will need to partitioned across two or more database servers. If the tenant requires transactional support, a partitioning primitive will determine how to partitioning the data across servers to minimize distributed transactions while distributing load and storage across servers [29, 65]. As workloads evolve, the layout of data may need to change to maintain performance. Workload changes can result in a hotspot that needs to be split across servers, or changes in workload access patterns that results in too many distributed transactions. Similar to live migration, a live reconfiguration is needed to change partitioned data s layout without taking the system offline. 1.4 Dissertation Overview This dissertation focuses on the design, implementation, and evaluation of primitives required for a soft isolation based database platform, in particular primitives related to the placement and movement of tenants. While these primitives are critical first steps in enabling a scalable and elastic Database-as-a-Service, there are other primitives desirable for a database platform. This includes tools needed to handle the configuration of replication protocols, ensuring privacy of each tenant s data, managing and updating scalable query routers, the generation of SLOs, stronger resource allocation mechanisms, and controlling elasticity to minimize operating costs. Solutions into these tools provides a rich research agenda for future research. This thesis presents that building a scalable, elastic, and autonomic database platform is achievable using existing database architectures by providing solutions to understand workload requirements, predict the impact of colocation, reactively load-balancing tenant placement, and migrating persistent state in a lightweight manner. These issues are addressed in two parts. The first part addresses on primitives related to modeling resource consumption on tenants, modeling colocation impact for tenants, and a load-balancing primitive, which can be used to incrementally place tenants for initial consolidation. The second part focuses of movement primitives to load balance tenants and partitioned databases. 8

33 Dissertation Overview Section Modeling and Placement Primitives Databases are predicated on an architecture that assumes the server is dedicated to the hosted database. Therefore, a database is designed to consume resources of the server regardless of whether it needs the resources. This architecture can make it difficult to attribute accurate resource requirements for a running tenant. As a motivating example, a system hosts a database with a total storage (footprint) of ten GB, but only actively uses two GB of its storage. This means if this database has access to two GB of cache, majority of read queries would not result in disk I/O. However, the buffer pool, or database cache, will fill up to the total amount of allocated buffer space regardless if less is needed. This sample database would use up to ten GB of buffer pool for this tenant, even though the active set is a fraction of that size. In a multitenant environment to place tenants with adequate server resources, it is important to understand what resources a tenant will actually need to answer requests effectively. The greedy design of databases increases the difficulty of attributing resource requirements. Previous research has shown how to identify resource requirements of tenants when they are profiled in an isolated environment [25]. Profiling tenants in isolation works well when tenant behavior is static and predictable. However, when tenants are colocated in a process that shares resources, it is difficult to ascertain resource requirements without isolating tenants on a profiling server. When tenant behavior is dynamic, multipurpose, or subject to ad hoc usage isolated profiling becomes untenable. Therefore, a technique is needed to estimate resource requirements at runtime in a consolidated environment. The first primitive presented estimates a tenant s resource requirements at runtime based on supervised learning techniques. In addition to estimating tenant resource requirements, understanding how behavior changes with colocation is critical for making placement decisions. How tenants share resources is dependent on the database architecture and the tenants current behavior. For example, if a buffer pool uses a least recently used (LRU) page replacement policy, then a tenant s throughput will be a significant factor in determining how it shares a buffer pool. Faster tenants are more likely to have their pages frequently accessed and therefore less likely to have their pages evicted. Understanding how colocation is required to avoid resource starvation from over-consumed resources. Further complicating the problem is the difficulty in understanding the interactions and relationship between the database, the operating system, and the underlying hardware. Building precise models of tenant colocation dependent on these interactions and can result in brittle hand-tuned interaction models. Instead, we propose a colocation model that is empirically 9

34 Chapter 1. Introduction learned by repeatedly observing how different sets of tenants classes behave together. Even with a perfect resource and colocation model, tenant behavior can evolve over time due to changes in the access patterns, an increase in application traffic, or changes to application code. This behavior change can result in increased resource contention between tenants. With soft isolation, resolving this issue requires one or more tenants must be shuffled between servers to provide adequate resource availability to all tenants. Ideally these disruptions are infrequent and the overhead of moving tenants between servers is amortized during periods of regular activity. A technique is presented to identify a set of tenants to move and destinations leveraging the resource and colocation models Movement Primitives Since soft-isolation platforms do not rely on strong resource allocation mechanisms, load-balancing tenants is required to resolve performance crises. Here load-balancing will place workloads that are complementary in regards to resource consumption. There are several methods to enable load-balancing in this environment. As previously stated, replication based approaches are not adequate for all load-balancing scenarios the system may encounter. Therefore, a migration primitive is needed for a soft-isolation based database platform. Ideally, a migration technique will minimize disruption to the system and active transactions. Such a technique is referred to as live migration [21]. This dissertation present the first live migration technique for shared nothing databases that incurs no downtime to the active tenant. Live migration focuses on migrating an entire tenant between two database servers. For partitioned databases this approach cannot be directly applied to reconfigure how the data is partitioned. Since the system is transactional and partitioned a reconfiguration primitive should explicitly consider distributed transactions; a concern not present for the migration on an entire tenant database. Additionally, the data moving is at a smaller granularity than a tenant. A live reconfiguration addresses these issues by changing the layout of data without needing to take the system off-line or migrate an entire database or partition. A technique for live reconfiguration for partitioned main memory databases is addressed by this dissertation. 10

35 Contributions Section Contributions This dissertation makes several contributions to enabling elasticity primitives for a multitenant, scalable, elastic, and self-managed data platform. The presented advances allow traditional relational databases to be used in a distributed scale-out architecture. A focus on minimizing problem constraints is emphasized throughout the presented solutions. While these contributions focus on a shared nothing architecture that serves web and OLTP style workloads, the tools can applied to other target environments. The following contributions are essential in the realization of a virtualized Database-as-a-Service offering: An analysis of popular forms of multitenancy in database systems and how these forms align with cloud computing paradigms. A framework for analyzing database migration. The framework introduces several forms of migration to characterize existing and future migration techniques. The migration framework also identifies attributes to measure and evaluate different migration techniques against each other. An end-to-end multitenant database prototype, that orchestrates a cluster of shared-nothing PostgreSQL servers to ensure latency based performance SLOs are maintained. The system includes mechanisms for database monitoring, crisis mitigation, and a supervised-learning based autonomic controller, Delphi. A technique, Pythia, for a multitenant controller to model tenant resource consumption and model the impact of colocation. Pythia can approximate tenant resource consumption in a consolidated process, which externally reports aggregated resource consumption for all colocated tenants. Pythia leverages tenant resource models, to learn the impact of tenant colocation. Pythia allows for configurable consolidation levels with minimal instructions from a database administrator. A load-balancing primitive to rapidly derive new tenant placement plans when latency SLOs are violated. The load-balancing algorithm is based on a time-bound local search heuristic to use migrations are steps to improve a systems global state, with the optimal state as each node as unlikely as possible to have a resource violation. The first published live migration technique for shared nothing databases, Zephyr. The proposed approach has zero downtime for the migrating tenants and aborts an order of magnitude fewer aborted transactions than a 11

36 Chapter 1. Introduction stop-and-copy based approach. A novel use of unique data page ownership enables lightweight synchronization between sites without use of latency inducing techniques, such as two-phase commit. Squall, the first proposed live reconfiguration technique to update a main memory partitioned database s layout of tuples at runtime. Squall targets main memory databases that use a single threaded transaction manager per partition. The technique focuses on identifying all migrating data for partitioned relational data and how to minimize disruption of reconfiguration in the presence of a single threaded partition. 12

37 Chapter 2 Background A pint of sweat, saves a gallon of blood. George S. Patton Building a database-as-a-service platform requires new components as well as primitives to adapt traditional stand alone databases. A variety of architectures and systems have emerged to build such a database platform. These architectures have emerged in the context of cloud computing environments or for dedicated hosted platforms, or private clouds. In addition to new tools and architectures, a database system must be multiplexed to host multiple applications. This problem is often referred to as multitenancy, where each hosted application is a tenant. This chapter introduces database multitenancy models, basic cloud computing concepts, recent system architectures for multitenancy, and state of the art advancements in database tools for a database platform. 2.1 Multitenancy Models Multitenancy in databases has been prevalent for hosting multiple tenants within a single DBMS while enabling effective resource sharing [50, 9, 68]. Sharing of resources at different levels of abstraction and distinct isolation levels results in various multitenancy models. The three models explored in the past [50] consist of: shared machine (also referred to as shared hardware), shared process, and shared table. SaaS providers like Salesforce.com [81] are a common use cases for database multitenancy, and traditionally rely on the shared table model. The shared process model has been recently proposed in a number of database systems for the cloud, such as RelationalCloud [27], SQLAzure [13], ElasTraS [31]. Nevertheless, some features of cloud computing increases the relevance of the other 13

38 Chapter 2. Background # Sharing Mode 1. Shared hardware 2. Shared VM 3. Shared OS 4. Shared instance 5. Shared database 6. Shared table Isolation IaaS PaaS SaaS VM OS User DB Instance Database Schema Row Table 2.1: Multitenant database models, how tenants are isolated, and the corresponding cloud computing paradigms. models. Soror et al. [75] propose using the shared machine model to improve resource utilization. To improve understanding of multitenancy, we use the classification recently proposed by Reinwald [68] which uses a finer sub-division (see Table 2.1). Though some of these models can collapse to the more traditional models of multitenancy. However, the different isolation levels between tenants provided by these models make this classification interesting and helpful for selecting a target classification when building a multitenant database. Shared Hardware The models corresponding to rows 1 3 share resources at the level of the same machine with different levels of abstractions, i.e., sharing resources at the machine level using multiple VMs (VM Isolation) or sharing the VM by using different user accounts or different database installations (OS and DB Instance isolation). There is no database resource sharing. Rows 1 3 only share the machine resources and thus correspond to the shared machine model in the traditional classification. While these models offer strong isolation between tenants, these models come with a cost of increased overhead due to redundant components and a lack of coordination using limited machine resources in an unoptimized way. The lack of coordination is prominent in the case of using a virtual machine for each tenant, row 1, where each tenant behaves as if it has exclusive disk access [27]. 14

39 Multitenancy Models Section 2.1 Shared Process Rows 4 5 involve sharing the database process at various isolation levels from sharing only the installation binary (database isolation), to sharing the database resources such as the logging infrastructure, the buffer pool, etc. (schema isolation), to sharing the same schema and tables (table row level isolation). How a database instance can be isolated between tenants varies between implementation. For example, with MySQL each tenant can be given their own schema with limited user permissions. Rows 4 5 thus span the traditional classes of shared process (for rows 4 and 5). 1 Shared Table The shared table model uses a design which allows for extensible data models to be defined by a tenant with the actual data stored in single shared table. The design often utilizes pivot tables to provide rich database functionality such as indexing and joins [9]. While this model offers advantages of maintaining a single database instance, isolating tenants for migration becomes difficult due to shared locking mechanisms. The reliance on consolidated pivot and heap tables could lead to poor performance due to all tenants sharing index structures. Additionally, the shared table model requires that all tenants reside on the same database engine and release (version). This limits specialized database functionality, such spatial or object based, and requires that all tenants use limited subset of functionality. This model is ideal when tenant data requirements follow similar structures or patterns, such as in the case of Force.com offering customizations on a customer relationship database [81]. With different forms of multitenancy, components that constitute a tenant vary. We henceforth use the term cell to represent all information necessary to serve a tenant. A multitenant database instance consists of thousands of cells, and the actual physical interpretation of a cell depends on the multitenancy model. Definition 1. A cell is the self-contained granule representing a tenant in the database. The choice of multitenancy model has implications on a tenant s resource isolation, consolidation, functionality, and required development. A key trade-off to explore in multitenacy is the balance between the amount of consolidation that the system provides and the level that resources access is isolated between tenants. When a shared hardware model relies on virtualization to divide a server s 1 The shared instance model is primarily supported by commercial databases that allows multiple databases (processes) to share a common installation (or binary). Example usage includes running isolated production and test databases. This model can map to both shared machine as well as shared process. 15

40 Chapter 2. Background resources between tenants, a strong level of resource isolation is enabled. Here, each tenant is guaranteed a specific amount of resources and the hypervisor provides the ability to monitor and control how resources are utilized. However, as shown in a recent study [26], such a model results in up to an order of magnitude lower performance and consolidation compared to the shared process model. This limitation consolidation is largely driven by two factors. First, virtualization results in redundant database and OS processes which each require dedicated resources to operate. Recent developments have attempted to limit redundancy due to this issue. Second and more importantly, when independent databases processes reside on the same physical server they act in an uncoordinated manner. Databases are designed to greedily consume and explicitly manage resources to amortize the costs associated with various components. An example of this is delaying writes to the disk until periods of low activity or when writes can be batched. Many greedy and uncoordinated database processes will utilized the underlying resources in a ineffective manner that significantly limits the amount of tenants which can be hosted on a single server. On the other hand, the shared table model allows efficient resource sharing amongst tenants but restricts the tenant s schema and requires custom techniques for efficient query processing. Here the overhead of managing the meta data overhead of many independent databases and tables are mitigated due to shared data within fewer tables. This approach maximises the level of tenant consolidation, but has limited resource isolation. With tenants hosted within the same tables, the database has reduced options to ensure each tenant has adequate access to resources. Implementations of the shared table model can rely on additional components, such as the application layer, to provide resource isolation [81]. The shared process model, therefore, provides a good balance of effective resource sharing, schema diversity, performance, and scale. The shared process model has also been widely adopted in commercial and research systems [13, 27]. The shared process model also allows for existing database systems to be used with little to no modification to the database kernel. We therefore focus on the shared process model for this dissertation. 2.2 Multitenancy for the Cloud While broad in concept, three main paradigms have emerged for cloud computing: IaaS, PaaS, and SaaS. We now establish the connection between the database multitenancy models with the cloud computing paradigms (Table 2.1 summarizes this relationship), while analyzing the suitability of the models for various multitenancy scenarios. 16

41 Recent Multitenant Systems Section 2.3 IaaS provides the lowest level of abstraction such as raw computation, storage, and networking. Supporting multitenancy in the IaaS layer thus allows much flexibility and different schema for sharing. The shared hardware model is therefor best suited in IaaS. A simple multi-tenant system could be built of a cluster of high end commodity machines, each with a small set of virtual machines. Each virtual machine would host a single database tenants. This model provides isolation, security, and efficient migration for the client databases with an acceptable overhead, and is suitable for applications with lower throughput but larger storage requirements. PaaS providers, on the other hand, provide a higher level of abstraction to the tenants. There exist a wide class of PaaS providers, and a single multitenant database model cannot be a blanket choice. For PaaS systems that provide a single data store API, a shared table or shared instance could meet data needs for the platform. For instance, Google App Engine uses the shared table model for its data store referred to as MegaStore [10]. However, PaaS systems with the flexibility to support to a variety of data stores, such as AppScale [20], can leverage any multitenant database model. SaaS has the highest level of abstraction in which a client uses the service to perform a limited and focused task. Customization is typically superficial and workflows or data models are primarily dictated by the service provider. With rigid definitions of data and processes, and restricted access to a data layer through a web service or browser, the service provider has control over how the tenants will interact with a data store. The shared table model has thus been successfully used by various SaaS providers [9, 50, 81]. 2.3 Recent Multitenant Systems Many multitenant database systems have been proposed in recent years. This section introduces several of the systems to provide key contributions and any potential issues not addressed by the system. Yang et al. [87] outline a shared process model to build a scalable data platform. The presented system focuses on the system can enable replication within a datacenter and across datacenters. The presented project is one of the earlier papers to formally describe an architecture for a shared process data platform. This system assumes tenant resource requirements are readily available, and that all tenant resource consumption is additive. Tenant placement is treated as a multidimensional bin-packing problem. Similar to Yang et al. the RTP project seeks to dynamically configure tenant placement [70]. RTP focuses on how to place tenants and redistribute workloads between replicas, such that the system is robust to handle any single server failure. 17

42 Chapter 2. Background A thorough evaluation of various placement strategies is presented. The target database is main memory, so therefore the authors focus on univariate workload requirements that are additive. Lang et al. [53] present a technique to place tenants but also provision the amount and types of servers required. The server provisioning strategy accounts for multiple hardware classes, that have various performance characteristics. Lang et al. focus on a provisioning strategy that ensures performance service objects (SLOs) will be met. Here the workload classes are assumed to be known and the provisioning can systematically explore how the various classes behave when colocated on a given hardware configuration. PMAX [57] also seeks to provision a multitenant environment, but here a focus is on maximizing profit by factoring in the SLO violation costs and costs associated with running each server. Here, the workloads are known and fixed, but the arrival rate and value of queries can change. The impact of colocating workloads is assumed to provided either by observation or through an oracle. PMAX demonstrates that popular multi-dimensional bin packing heuristics can be sub-optimal for tenant provisioning strategies. SQLVM [63] is a project to embed the resource allocation and isolation of virtualization technology into the database kernel. SQLVM focuses on how to meter tenant s resource utilization and schedule to database requests in order to provide specific levels of resource access. This project examines how can disk I/O, CPU cycles, and memory for caching can be limited by the database engine. DAX [56] is a multitenant environment that focuses on providing cross datacenter replication for tenants. Here a single active tenant runs in only a single datacenter but the persistent state of the tenant is replicated between many datacenters. This allows a tenant to move between datacenters in a lightweight manner. DAX replaces a local block storage with a distributed key-value store where is each block ID is the key and the block s data is the value. DAX leverages the semantics of transactional consistency to minimize the number of replicas required to acknowledge each operation. SQLAzure [13], ElasTras [31], and Relational Cloud [27] are multitenancy platforms that target the same environment as this dissertation. These projects address different issues related to building a multitenant environment, such as how to partition tenants [29, 13, 31] and how to initially consolidate tenants when the workload remains static [25]. Relational Cloud enables a tenant to be partitioned with distributed transactions, where the other systems focus on tenants that can be hosted by a single physical server. ElasTras relies on a shared storage layer to host the tenant s data and logs. Another similar project, Google s F1 [?] is a 18

43 Recent Multitenant Systems Section 2.3 scalable relational database system that enables automatic partitioning and SQL support. 19

44 Chapter 2. Background 20

45 Part I Modeling and Placement Primitives 21

46

47 Chapter 3 Pythia The purpose of science is not to analyze or describe but to make useful models of the world. A model is useful if it allows us to get use out of it. Edward de Bono Cloud application platforms and large organizations face the challenge of managing, storing, and serving data for large numbers of applications with small data footprints. For instance, several cloud platforms such as Salesforce.com, Facebook, Google AppEngine, and Windows Azure host hundreds of thousands of small applications. Large organizations, such as enterprises or universities, also face a similar challenge of managing hundreds to thousands of database instances for different departments and projects. Allocating dedicated and isolated resources to each application s database is wasteful in terms of resources and is not costeffective. Multitenancy in the database tier, i.e., sharing resources among the different applications databases (or tenants), is therefore critical. We focus on such a multitenant database management system (DBMS) using the shared process multitenancy model where the DBMS comprises a cluster of database servers (or nodes) where each node runs a single database process which multiple tenants share. 3.1 Challenges in Multitenancy A multitenant DBMS must minimize the impact of colocating multiple tenants. The challenge lies in determining which tenants to colocate and how many to colocate at a given server, i.e., learn good tenant packings that balance between over-provisioning and over-booking. Furthermore, the colocated tenants resource 23

48 Chapter 3. Pythia requirements must be complementary to avoid heavy resource contention after colocation. To ensure of service, a multitenant DBMS must also associate meaningful service level objectives (SLOs) in a consolidated setting. If a tenant s SLO is violated, the DBMS must adapt to this performance crisis. The challenge lies in mitigating the crisis which might be caused by a change in this tenant s behavior, a change in a colocated tenant s behavior, or a degradation in the node s performance. A tenant s behavioral change might be due to a change in the query pattern, data access distribution, working set size, access rates, or queries issued on non-indexed attributes while typical queries are on indexed attributes complexity arises from the myriad of possibilities. Adapting to a crisis entails detecting changes, filtering erratic behavior, and devising mitigation strategies. Erratic behavior can arise from temporary shifts in application popularity, periodic analysis, or ad-hoc queries. The problem of designing a self-managing controller is further complicated by the variety of tenant workload types. Many applications use their databases for multiple purposes, such as using the same database for serving, analysis, and logging. Therefore, in addition to workload variations across tenants, a single tenant might also exhibit different behaviors at different time instances. Behavioral changes might have patterns (e.g., diurnal trends of serving and reporting workloads) or might be erratic (e.g., flash crowds). Moreover, dynamics in the workload might or might not have correlation. For instance, hosted business applications observe a spike in multiple tenants activity at the start of the business day. These behavioral dynamics, the interplay of shared resources among colocated tenants, and the complex interactions between the DBMS, OS, and the hardware make analytical models and theoretical abstractions impractical. From the monitoring perspective, a system controller potentially receives hundreds of raw performance measures from the database process and the operating system (OS) at each node. Considering the scale of tens to hundreds of nodes, using all these raw signals to maintain an aggregate view of the entire system and the individual tenants results in an information deluge. One challenge in effective administration is to systematically filter, aggregate, and transform these raw signals to a manageable set of attributes and automate administration with minimal human guidance. An intelligent and self-managing system controller is a significant step towards achieving economies-of-scale and simplifying administration. More than a decade of research has focused on effective multitenancy at different layers of the stack, including sharing the file system or the storage layer [76, 64], sharing hardware through virtualization [45], and in the application and web server layer [80]. Multitenancy in the database tier introduces novel challenges due to 24

49 Controller for a Multitenant DBMS Section 3.2 the richer functionality supported by the DBMS compared to the storage layer, and the complex interplay between CPU, memory, and disk I/O bandwidth observed in a DBMS compared to the stateless applications and web servers. Recent work has focused on various aspects of database multitenancy. Kairos [26] is a technique for tenant placement and consolidation for a set of tenants with known static workloads. Kairos uses direct measurements of the tenants CPU, I/O, memory, and disk resource consumption to suggest consolidation plans. Smart- SLA [85] is a technique for cost-aware resource management using direct resource utilization measurements to learn the average SLA penalty cost in a setting where each tenant has its independent database process and virtual machine. Ahmad and Bowman [6] use machine learning techniques to predict aggregate resource consumption for various workload combinations and proposes a technique that relies on static and known workloads. The authors argue that analytical models for performance and consolidation are hard due to complex component interactions and shifting bottlenecks. Lang et al. [53] propose a SLO-focused framework for static provisioning and placement where tenant workloads are known. In general, existing approaches do not target the problem of continuous tenant modeling, dynamic tenant placement, variable and unknown tenant workloads, and performance crisis mitigation in the shared process multitenancy model, which is critical for deploying shared database services. 3.2 Controller for a Multitenant DBMS We present the design and implementation of Delphi, an intelligent selfmanaging controller for a multitenant DBMS that orchestrates resources among the tenants. Delphi uses Pythia, a technique to learn behavior through observation. 1 Pythia uses DBMS-agnostic database-level performance measures available in any standard DBMS and supervised learning techniques to learn a tenant model representing resource consumption. Pythia learns a node model to determine which combination of tenant types perform well after colocation (good packings) and which combinations do not perform well (bad packings). Pythia continuously models behavior and maintains historical behavior which allows it to detect a change in a tenant s behavior. Once Delphi detects a performance crisis, it leverages Pythia to suggest remedial actions. Identifying a set of tenants to relocate, and finding destinations for these tenants to alleviate latency violations is the core challenge addressed by Pythia. Delphi employs a local search algorithm, hill-climbing, to prune the space of possible tenant packings and uses the node 1 Delphi is an ancient Greek site, where the oracle Pythia resided. 25

50 Chapter 3. Pythia Figure 3.1: Pythia incrementally learns behavior. model to identify potential good packings. Pythia requires minimal human supervision, typically from a database administrator, only for training the supervised learning. Once the models are trained, Delphi can independently orchestrate the tenants, i.e., monitor the system to detect performance crises, load-balance and migrate tenants to mitigate a crisis and to ensure that tenant SLOs are being met. Figure 3.1 presents an overview of Delphi s design. In contrast to existing techniques that directly use OS or VM level resource utilization, such as Kairos [26] and SmartSLA [85], Pythia uses database-level performance measures such as cache hit ratio, cache size, read/write ratio, and throughput. This allows Pythia to maintain a detailed per-tenant profile even when tenants share a database process. OS level measures either provide aggregate resource consumption metrics of all tenants, and the alternative of hosting one tenant per database process degrades performance [26]. In contrast, Pythia results in negligible performance impact by using performance measures available from any standard DBMS implementation. Additionally, Pythia learns tenant behavior without any assumptions or in-depth understanding of the underlying systems. In addition, unlike workload driven techniques [26, 6, 53], Pythia does not require advanced knowledge of the tenants workload or limit the workload types. Moreover, Pythia does not require profiling tenants in a sandbox, a dedicated node for running tenants in isolation, thus making it applicable even in scenarios where production workloads cannot be replayed due to operational or privacy considerations [7]. Therefore, we expect Pythia to have applications in a variety of multitenant systems and environments, while requiring minimal changes to existing systems. Delphi is the first end-to-end framework for the accurate and continuous modeling of tenant behavior in a shared process multitenancy environment. We built 26

51 Delphi Architecture Section 3.3 Figure 3.2: Overview of Delphi s architecture. a prototype implementation of Delphi in a multitenant DBMS running a cluster of Postgres RDBMS servers. Our current implementation uses a set of classifiers to learn tenant and node models, although Pythia can be extended to use additional tenant resource models, or other machine learning techniques such as clustering or regression learning [83]. Pythia learns tenant models with a 92% accuracy, and node models with a 86% accuracy. Once a performance crisis is detected, Delphi can mitigate the crisis by reducing the 99th percentile latency violations by 80% on average. 3.3 Delphi Architecture Figure 3.2 provides an overview of the DBMS architecture we consider in this chapter. The system consists of a cluster of shared-nothing DBMS nodes where each node runs an RDBMS engine which multiple tenants share. Delphi is the overall framework for a self-managing system controller and comprises Pythia, which models tenant and node behavior and maintains historical information, a crisis detection and mitigation engine, and a statistics collector, which gathers system-wide performance statistics. Every DBMS node has a lightweight agent which collects usage statistics at that node. A lightweight embedded web server interfaces the agent to the statistics collector. Delphi periodically requests a snapshot of the performance measures from all nodes. On receipt of a snapshot request, the agent obtains pertenant and aggregate performance statistics from the database process and the OS, collectively called the performance features. The agent then returns the 27

52 Chapter 3. Pythia snapshot to Delphi. We choose a pull based approach over a push based approach where the agents periodically push the snapshot to the Delphi. A pull based approach enables an agent to operate without any persistent state information and allows the snapshot frequency to be configured by Delphi. The agent collects and aggregates statistics only when requested by Delphi. All snapshots collected by Delphi are stored in a history database that stores the per-tenant and per node history and serves as a persistent repository for historical information such as tenant behavior, packings, and actions taken by Delphi to mitigate a performance crisis. The history database can also be used for off-line analysis of Delphi s actions and performance by an administrator. 3.4 Service Level Objectives In order for a shared multitenant DBMS to be attractive for the tenants, the DBMS must support some form of performance guarantees such as response times. Typically, the tenants and the DBMS provider agree on a set of service level agreements (SLAs). However, agreeing on SLAs is typically an arms race between the provider and the tenant often governed by business logic [7]. Automatically assigning performance SLAs based on workload characteristics is a hard problem and beyond the scope of this dissertation. Instead, we focus on service level objectives (SLOs) as a mechanism for quantifying the quality-of-service of a multitenant DBMS for a given tenant s workload. We rely on using a uniform percentile-based latency SLOs for all tenants. 3.5 Effects of Colocation In the shared process multitenancy model, tenants share the DBMS, OS, and hardware. This includes sharing the DBMS s buffer pool, the OS s file system cache (or page cache), the available I/O bandwidth, and CPU cycles. DBMSs are typically optimized for scaling-up a single tenant and are not designed to fairly share resources among tenants. Therefore, it is critical to understand the impact of resource sharing and contention on performance. It is well understood that when colocating multiple tenants, no resource should be over-utilized. Approaches, such as Kairos [26], are known to determine which tenants can be colocated based on the tenant s resource consumption. Furthermore, it is also imperative that colocated tenants have complementary resource consumption. As an example, colocating a mix of disk-heavy and CPU-heavy tenants is probably better than colocating multiple disk-heavy tenants. 28

53 Effects of Colocation Section 3.5 Cache Hit Ratio A B Avg Operations Per Second Difference Figure 3.3: Effects of throughput on cache impedance. In addition to the above heuristics, when multiple tenants share the cache, it is important to consider how the tenants access the cache. Assume we have two tenant databases D 1, and D 2 colocated at a node with enough resources to serve both tenants. If D 1 s throughput is higher than that of D 2, the least recently used (LRU) eviction policy, commonly used in buffer pool implementations, will result in D 1 stealing cache pages from D 2. This is because D 1 is accessing more pages, thus making D 2 s pages candidates for likely eviction. This reduction in D 2 s cache hit ratio will affect its performance. Figure 3.3 demonstrates this behavior in an experiment on a node serving a set of identical tenants where a tenant A has its throughput gradually increasing while the other tenants (B) have a fixed throughput; B represents the average behavior of the remaining tenants. As the throughput difference increases, A steals more cache from the other tenants resulting in an increase in A s cache hit ratio and a decrease in the cache hit ratio of all the remaining tenants (B-Avg), thus affecting performance of the slower tenants. We introduce the concept of cache impedance as a measure of compatibility between tenants to effectively share a cache. A scenario where one tenant dominates the cache over the other tenants is due to a cache impedance mismatch. Therefore, to ensure that tenants continue to perform well after colocation, it is important their cache impedance matches. Four major aspects affect cache access: key or tuple access distribution, database size, throughput, and the read/write distribution; though from our experiments, we have seen throughput to be the most dominant factor. A similar impact of cache impedance is also observed for accesses to the OS page cache and how the tenants share the page cache. However, the interactions between the two levels of caches, the buffer pool and page cache, are complex due to the stateful interactions between the OS and DBMS [6]. We therefore 29

54 Chapter 3. Pythia take an empirical approach to infer a tenant s cache impedance through observed behavior, thus avoiding solutions tailored to a specific implementation. We incorporate the knowledge of cache impedance into the design of Pythia by using signals from both the buffer pool and the page cache to determine the class labels that represent tenant behavior. Pythia learns which combinations of tenant classes are amenable for colocation. While specialized solutions can be built with strong resource isolation between tenants or bypassing the page cache using direct I/O, we aim to present an approach requiring minimal changes to current DBMS architectures. 3.6 Problem Formulation Let D be the set of tenant databases and N be the set of DBMS nodes. Each tenant D i D is represented by a vector of performance features F i and F = {F i D i D}. Let C be a set of class labels corresponding to tenant behavior. Pythia learns the tenant model T : F C, i.e., given a performance feature F i of tenant D i, the tenant model T assigns a label c C to D i. Let P j be the set of tenants at N j, such that each tenant entry, D i, has a corresponding class label C i derived from T and performance measure (e.g.,latency), M i, that constitutes the SLO. Pythia also learns the node model P : P j G, i.e., given a set of tenants P j, and thus corresponding labels, P assigns a label g G that indicates the packing s quality. We set G = { +,, }, where denotes a good packing, + is a good packing with under-utilized resources, and is a bad packing. Delphi periodically collects snapshots from all the DBMS nodes. Each node reports the performance features of all the tenants hosted at that node and the overall node level usage statistics. Delphi monitors all the nodes and ensures that all the tenants SLOs are being met and every node in the system is performing well, i.e., N j N, P(P j ). For every tenant, Pythia maintains a sliding window of the last W snapshots and their corresponding labels. Given a performance crisis at a node N j, i.e., one or more tenants in the packing P j are violating their SLOs, Delphi mitigates the crisis by finding a packing P j P j such that P(P j). Delphi also finds destination node(s) (N k ) for the tenants P j P j that must be moved out of N j. 3.7 Pythia: Learning Behavior Pythia seeks to learn tenant and node models through observation. Using machine learning classification, Pythia assigns a class label to a tenant that ap- 30

55 Pythia: Learning Behavior Section 3.7 proximately describes the tenant s behavior and resource consumption. Pythia also learns which tenant packings are performing well and which ones are violating SLOs. Pythia s design goals are to learn behavior while: (i) requiring minimal changes to the underlying multitenant DBMS, (ii) having no foreknowledge or assumptions (such as static or predefined workloads or working set fitting in memory) on tenant workloads, and (iii) causing negligible performance impact. Pythia uses supervised learning techniques, specifically classifiers, to learn tenant behavior. We use tenant classes that are representative of their resource consumption. Pythia also learns which combination of tenant types perform well together (good packings) and which tenant packings are likely to over consume resources or violate latency SLOs (bad packings). We now explain our feature selection process to identify a small set of performance features that can accurately model a tenant s behavior and then explain how Pythia uses these features to learn the tenant and node models Tenant Feature Selection Pythia uses DBMS-agnostic database-level performance measures. Databaselevel performance measures allow per-tenant monitoring for detailed analysis even in shared process multitenancy. A plethora of performance measures can be extracted from any standard DBMS. Examples are number of cache accesses and cache misses, number of read/update/insert/delete requests, number of pages accessed by each transaction, average response time, and transactions per second. However, using all the measures to characterize a tenant is not desirable since the complex relationship between attributes can be difficult for classifiers to infer. We therefore use our domain knowledge to select a subset of measures. The challenge lies in selecting the measures that correlate to tenants behavior and resource requirements while ensuring high modeling accuracy. To allow Pythia to be used in a large variety of systems, we only select measures available from any standard DBMS implementation, be it SQL or NoSQL, with negligible or no impact on normal operation. We select the feature set guided by our knowledge of the attributes semantics. We now explain the different measures, some of which are derived from other raw measures, we choose as the tenant performance features and also explain the rationale behind their selection. Write percent. The percentage of operations issued by a tenant that are writes, i.e., inserts, deletes, and updates. This measure gives an estimate of the rate at which the pages are updated and is also an indirect indicator of the expected disk 31

56 Chapter 3. Pythia traffic resulting from the writes due to cache pages being flushed, checkpointing, and appends to the transaction log. Average operation complexity. Average number of (distinct) database pages accessed by a single transaction. When transactions are specified in a declarative language, such as SQL, operation complexity is a measure of the resources consumed (such as CPU cycles) to execute that operation. This measure differentiates a tenant issuing simple read/write transactions from one issuing complex transactions performing joins or scans, even though both transaction types might issue the same number of SQL statements. Percent cache hits. Number of database pages accessed that were served from the cache. This measure approximates the number of disk access requests issued by the database process for the tenant. Buffer pool size. Number of pages allocated to the tenant in the DBMS buffer pool, approximates the tenant s memory footprint. OS page cache size. Number of pages allocated in the OS page cache to the tenant s database files. Along with buffer pool size, provides an estimate of the tenant s total memory footprint. Database size. Size of the tenant s persistent data and is representative of the disk storage consumption. Size has an indirect impact on the tenant s disk I/O. Throughput (Transactions per second). Average number of transactions completed for a given tenant in a second. Transactions completed include both the transactions committed and rolled-back. Throughput is an indicator of behavior (such as cache impedance) and resource requirements (such as CPU and disk bandwidth). Our experiments (Section 3.11) demonstrate that these attributes allow Pythia to train accurate models, representative of resource consumption, that can be used by Delphi for effective crisis mitigation Resource-based Tenant Model We characterize a tenants behavior in terms of its expected resource consumption. The rationale for selecting resource-based models is to allow Pythia to associate semantics to a tenant s behavior and reason about resources consumed. Furthermore, the number of critical resources are limited and known to a system administrator. Therefore, tenant behavior can be classified into a handful of resource-based classes without requiring knowledge of tenant workloads. Recent work has investigated classifying workload-based modeling by running queries in isolation [72]. However, using queries to build accurate models and understand tenant interaction, assumes new queries follow existing patterns and limits the 32

57 Pythia: Learning Behavior Section 3.7 opportunity for ad-hoc queries. Additionally, queries that rely on caching to minimize expensive operations, such as nested loop joins, can make modeling interactions difficult. In designing Pythia we sought to avoid these assumptions. We now explain how we select the resource-based class labels, how we determine the rules to assign labels to the tenants, and how we train the tenant model Resource-based classes A tenant s resource consumption has four important components: CPU consumed to fulfill the tenant s requests, main memory (RAM) consumed to cache data, disk capacity consumed to store persistent data, and the disk I/O bandwidth (or disk IOPS) consumed. While networking is an important resource, for now we assume that database connections are ample due to connection pooling and workloads use minimal data transfer. In scenarios where this is critical, network attributes can be be added. In most cases, commodity servers have disks (order of terabytes) much larger than the tenants (order of few gigabytes). Therefore, disk capacity is almost always abundant. However, irrespective of how much RAM is provided, the DBMS and the OS in a running server will invariably use most of the available RAM to cache existing tenants. Therefore, RAM usage is almost always constrained and excess capacity for new tenants is not reserved. However, in practice, a large fraction of the cached pages are not actively used and a newly added tenant will carve out cache capacity, provided the tenant s cache impedance, as described in Section 3.5, matches that of the tenants already being served at the node. In addition to cache impedance, the disk IOPS and the available CPU capacity are two critical resources that determine how well the tenants perform after colocation. In a consolidated setting, we cannot directly determine the exact CPU and disk IOPS consumption or the cache impedance of each tenant, so we model them indirectly. We use the performance features to approximate resource consumption. We base our class labels based on three dimensions: expected disk IOPS (D), throughput (T), and operation complexity (O). We loosely associate D, T, and O to disk bandwidth consumption, cache impedance, and CPU consumption. A human administrator is required to derive resource boundaries for the labels. These boundaries can be derived by observing resource distributions in a system where Pythia will be used. In our evaluation, we partition each continuous dimension into a few buckets whose boundaries are determined by analyzing the distribution of values obtained from an operational system. The D dimension is subdivided into four buckets: small (DS), medium (DM), large (DL), and extra-large (DXL). The T dimension is subdivided into three buckets: small (TS), medium (TM), and large (TL). The O dimension is subdivided into two buckets: small (OS) and 33

58 Chapter 3. Pythia large (OL). The rationale for such a subdivision is that the disk bandwidth is a critical resource for data intensive applications; a finer subdivision allows closer monitoring of a tenant s disk bandwidth consumption. Throughput impacts CPU consumption, disk IOPS, and cache impedance. We therefore consider throughput second after disk and use a coarser subdivision into three buckets. Operation complexity affects CPU consumption and is primarily targeted to differentiate tenants issuing complex queries such as scans, reports, or joins. Complexity comes last and is subdivided into two buckets. We use class labels composed of D, T, and O. For instance, DS-TS-OS represents a tenant with low expected disk consumption, low throughput, and low complexity. Among the possible 24 classes, in practice, some classes fold into one encompassing class as some dimension override the others. As an example, a DXL tenant s resource consumption is high enough that the operation complexity does not matter; we only consider two throughput buckets for such tenants: DXL-TL and DXL-TMS (medium and small). Similarly, for a DL-TL tenant, complexity is irrelevant. Following is the set of class labels we used: C = {DL-TL, DL-TM-OL, DL-TM-OS, DL-TS-OL, DL-TS-OS, DM-TL-OL, DM-TL-OS, DM-TM-OL, DM-TM-OS, DM-TS-OL, DM-TS-OS, DS-TL-OL, DS-TL-OS, DS-TM-OL, DS-TM-OS, DS-TS-OL, DS-TS-OS, DXL-TL, DXL-TMS} Training the model Throughput and operation complexity are directly measured from the database process. The number of disk requests issued per-tenant must be estimated since the OS only provides aggregate measures and the DBMS we use does not directly report resource metrics per-tenant. The database process provides a measure of the number of disk read requests issued to the OS. However, due to the OS page cache, an actual disk access happens only after a miss in the OS page cache. Our first approximation was that every access to the page cache has a uniform probability of a miss. This approach to modeling the OS page cache is inaccurate since it overlooks cache impedance in the page cache level. An artifact of using a DBMS that utilizes the page cache and does not use direct I/O. If a tenant D 1 misses the database cache more frequently compared to another tenant D 2, D 1 issues more requests to the page cache compared to D 2, thus dominating D 2 in the page cache. Therefore, a request by D 2 has a higher likelihood of a miss compared to that of D 1. The probability of a page cache miss is also dependent on other factors such as the tenant s database size, what fraction is cached, and the page cache eviction policy. 34

59 Pythia: Learning Behavior Section 3.7 Let P (A) be the probability that an access to a page missed the buffer pool. If h is the cache hit ratio, then P (A) = 1 h. Let P (B) be the probability of a miss in the page cache. If p is the total number of database pages for the tenant and m is the number of the tenant s pages in the page cache, then P (B) = 1 m/p. The probability of a page being read from the disk is P (A B). For simplicity, if we assume that A and B are independent, then P (A B) = P (A)P (B) = (1 h)(1 m/p). The number of pages accessed per second is given as the product of the operation complexity (o) and throughput (t). Writes contribute to disk activity due to dirty buffer pages, and WAL writes. Update operations provide fixed disk-write activity due to logging, and have a probability of creating another disk-write, if a clean buffer page is dirtied. Outside of logging, updates to an already dirty buffer page may not force a new disk-write. This results in disk activity being difficult to accurately model. Let u be the update operations per second, d be the percentage of pages that are dirty in the buffer pool, and α be a slack variable for updates causing additional disk writes. Therefore, we have Expected Disk IOPS : (o t)(1 h)(1 m/p) + u + α(u d (1 h)) (3.1) This measure (3.1) is an approximation since it simplifies the effect of updates on disk-writes and assumes buffer pool misses and page cache misses are independent. However, our experiments in Section reveal that (3.1) is close enough as a guideline for labeling the training set. We provide all the performance measures to the tenant classifier that learns the function using the attributes, but use the transformed expected disk IOPS in generating labels for our training set. Training the tenant model requires minimal guidance from a human administrator. An administrator analyzes the distribution of the values along the dimensions D, T, and O to determine the boundaries for the buckets that form the class labels for training. For our evaluation, we derived boundaries by examining the distributions of attributes against server resource consumption when tenants were run in isolation. Once the bucket-boundaries are determined, the administrator assigns labels to the tenants based on their respective performance features. Pythia trains a classifier on the labeled training set to learn the tenant model T. Pythia is designed to work with various tenant models, as long as a model label exists that captures the multiple dimensions of resource utilization. Modeling database interaction in a multitenant environment is challenging due to shifting bottlenecks and unforeseen interactions. This work focuses on demonstrating the potential of a framework that relies on machine learning to model tenant behavior and predict interaction. The tenant model presented in this dissertation is a representative example. For future work, we plan to examine additional tenant 35

60 Chapter 3. Pythia models, such as online models based on query analysis [72], providing administrators feedback on models, and assisting tenant modeling through unsupervised learning. 3.8 Node Model for Tenant Packing Pythia uses the tenant model to learn which tenant classes perform well together and which do not. The goodness of a packing depends on the class of tenants that comprise the packing. A node model is trained for a single hardware configuration. A set of tenants at a node is represented as the packing vector. If C is the number of tenant classes, then a tenant packing is represented by a vector of length C. A position in the vector represents a tenant class and the number of tenants of that class contained in the packing; if a type is absent in the packing, the corresponding count is set to 0. For example, if c 1, c 2, and c 3 are the known tenant classes and a packing had two tenants of type c 1 and three tenants of type c 3, the vector for the packing is [2, 0, 3]. The node feature representing a packing at a node is the packing vector. We train the node model (P) by providing a set of labels (G) representing the goodness of a packing. In its simplest form, G can be { +,, } representing good or bad packings respectively. A packing is good ( ) if all tenants meet their SLOs and under( +) if SLOs are met and server resources are under-utilized. A latency SLOs is composed of an upper bound time for a given percentile. Let S be the set latencies, composed of s i, the latency limit for the ith percentile in milliseconds. For our evaluation we set S = {s 95 : 500, s 99 : 2000} based on discussions with several cloud service providers. Relaxing or tightening SLOs depends entirely on the applications using the service. The binary labeling technique captures the SLOs but does not consider utilization of the node. For instance, a packing might be good but the node s resources might be under-utilized or over-utilized. We therefore augment G to include information about utilization. As noted earlier, disk IOPS and CPU are two critical resources. We use idle CPU percent and the percentage CPU cycles spent on IO (IOWait) as indicators of node utilization both in terms of CPU and disk bandwidth; too many disk requests are reflected in high IOWait. A node s utilization is subdivided into three categories: if idle is above a certain upper bound (U u ) then the node is under-utilized (Under), if idle is below a lower bound (U l ) or IOWait is over a threshold U w then the node is over-utilized (Over), idle percent in the range (U l, U u ] and IOWait less than U w is considered good utilization (Good). If any tenant violates a latency SLO the node is labeled as Over, regardless of 36

61 Node Model for Tenant Packing Section 3.8 resource consumption. Composing the utilization based division with the SLO based division results in a set of labels that captures both utilization and SLOs: G = {Under( +), Good( ), Over( )}. To train the node models, a human administrator specifies the parameters S, U l, U u, and U w. A simple rule-based logic assigns labels to the node based on the node feature. Once the training set is labeled, Pythia trains a classifier to learn the node model P.. The node model is incrementally updated to reflect new observations in the running system Utilizing Machine Learning We use Weka, an open-source machine learning library, to train Pythia s models. We experimented with multiple classifiers such as decision trees, random forests, support vector machines, and classifiers based on regression [83]. The training data is obtained by augmenting an operational system, for which Pythia will be trained, to collect the tenant and node features which are then labeled as described earlier. In our evaluation Pythia utilizes Random Forests, an ensemble decision tree classifier, due to its high accuracy and resistance to overfitting. Once the tenant and node models are trained, they are stored and served in-memory; Delphi uses these models for intelligent tenant placement. In this section, we presented one way of training the models in Pythia as a representative example. However, Pythia can be adapted to work with a different set of performance features, tenant and node labels, and semantics associated with the label. The role of a domain expert or a system administrator is to determine representative features and assign labels so that the tenant and node models can be trained accordingly. For example, disk I/O is limited by the underlying hardware and experienced administrators can easily identify and categorize ranges of disk consumption. Moreover, Pythia can also be extended to use other forms of machine learning such as clustering or regression learning [83]. Exploring such directions are possible directions of future work; our initial focus was to leverage classification with domain knowledge to explore the end-to-end design space. 37

62 Chapter 3. Pythia 3.9 Delphi Implementation We implemented Delphi on a multitenant DBMS with each node running Postgres, an open-source RDBMS engine. 2 All the database level performance features (F) are obtained using two Postgres extensions and without any modification to the Postgres code. Early in our project, we also explored MySQL and found the majority of performance measures comprising F are available through a thirdparty MySQL extension ExtSQL. 3 In this section, we explain Delphi s components other than Pythia, i.e., the statistics gathering component, and the crisis detection and mitigation component Statistics Collection Each DBMS node has an agent which interfaces with Delphi. The agent collects the tenants performance statistics by querying the database process. In our prototype, tenants share the same Postgres instance and have independent schemas (or databases, in Postgres terminology). To gather the performance statistics, we use two extensions to Postgres in addition to Postgres internal statistics. The extension pg buffercache provides detailed information about the state of the database buffer by table, and the extension pgfincore peeks into the OS s page cache to determine which parts of a tenant s database is cached by the OS. Both extensions expose the statistics as a table which the agent queries using SQL issued through a local JDBC connection. A number of queries are issued to Postgres, pg buffercache, and pgfincore to obtain statistics such as per-tenant database and OS page cache allocation, cache-hit rations, number of dirty pages, and read/write ratios. The agent also collects aggregate node-level usage statistics such as percentage idle CPU, CPU cycles blocked on I/O calls, CPU clock speed and number of cores, memory usage, number of disk blocks read or written, and disk I/O operations per second for all drives hosting database files or the transaction log. The agent can also be configured to follow database logs to record database events such as a checkpoint initialization or slow queries. The statistics collector requests a snapshot via the agent s web server which obtains the performance measures from the local database; statistics are reset after collection. The response by the agent wraps all the statistics and the time since the last report in a flexible interchange format, JSON in our case, allowing easy extensibility. This entire process takes on the order of few milliseconds and MySQL: ExtSQL: 38

63 Crisis Detection and Mitigation Section 3.10 allows lightweight statistics collection. The impact of monitoring on database latency was observed to be a few milliseconds Crisis Detection and Mitigation Monitoring and Crisis Detection The statistics collector periodically gathers statistics from all the DBMS nodes to create an aggregate view of the system. For every incoming snapshot, Delphi uses Pythia s tenant model to determine each tenant s class. Delphi maintains a per-tenant sliding window of the last W snapshots; all Delphi s actions are based on W. The class labels in W are used to determine a representative label for each tenant. For instance, assume Delphi maintains 5 snapshots, and in this window a tenant D i has 4 labels corresponding to class c j and 1 label corresponding to c k. Delphi represents D i as {0.8c j, 0.2c k }, i.e., D i is of type c j with confidence 80% and type c k with confidence 20%. Delphi s use of a window W, rather than using the last snapshot, provides a more confident view about shifts in behavior. It allows Delphi to filter spurious behavior, such as sudden spikes in activity or higher than average response times resulting from system maintenance activities such as checkpoints. Using percentile latency SLOs also limits the impact of a few queries with high latency. Crisis mitigation steps, such as migrating some tenants out of a node, are expensive and hence Delphi must filter out spurious behavior and react only when a shifting trend is observed. For a given tenant packing P j at node N j, Delphi determines whether all tenants SLOs, S, are being met. If all SLOs are met and no resource is being over-utilized, then this packing is an instance of a good packing for the node model. The slack in resource consumption determines the aggressiveness of consolidation. If one or more SLOs are violated, then this packing is an instance of a bad packing. Once a performance crisis corresponding to a bad packing is detected, Delphi searches for a good packing and takes the remedial measures. The node model P receives continuous feedback about good and bad tenant packings and is incrementally updated as Delphi observes new packings and their outcomes. We incrementally re-train P using the negative examples, i.e., cases where the model s prediction was inaccurate, and a sampling of positive examples that are not repetitive, as many packings and their outcomes are repetitive during steady state. 39

64 Chapter 3. Pythia Crisis Mitigation Mitigating a performance crisis for a bad packing P j at node N j entails identifying a packing P j P j such that P(P j), where corresponds to an over packing according to the node model P. We formulate this problem of finding the packing P j as a search problem through the combinations of the tenant packings, a well-studied problem in artificial intelligence [69]. The search algorithm performs what-if analysis using the tenant model P to determine potential destinations node that can accommodate a subset of tenants in Pj M without itself deteriorating to a bad packing as predicted by P. Once a good packing P j is determined, the tenants Pj M = (P j P j) must be migrated out of N j. Pythia must find one or more destination nodes that can accommodate Pj M. Therefore any tenant in p M j Pj M requires a destination node N k serving a packing P k such that P(P k ) P(P k ) where P k = P k p M j. Destinations must be found for all tenants included in Pj M. We implemented a few different search algorithms in designing Pythia. Breadth first search (BFS) [69] first tries all combinations of migrating one tenant, then combinations of a pair of tenants, and so on until it finds a good packing according to P. Using an exhaustive search algorithm would often not converge on a solution, either due to the search space complexity or not being able to satisfy a goal test of having no nodes in violation. Additionally, we do not expect BFS to scale with a large number of tenants and nodes. The local search algorithm, hill-climbing becomes the natural selection, due to the ability to provide a time-bounded best solution, and the ability to treat packing as an optimization problem [69]. With hill-climbing, all immediate neighbors (potential migrations) are examined, and the move providing the largest improvement is selected. Therefore only the local state is considered when making search decisions. This process is repeated, until no additional step can improve the state (a local maximum) or time has expired. Each step is evaluated with a heuristic cost estimate h to find a migration which provides the largest improvement to the tenant packing, by finding a local minimum for h. The naive cost function h attempts to minimize the number of nodes which are labeled as over( ). h = 1 (3.2) {N i N P(P i ) } However, in packings with minimal excess capacity, this cost function (3.2) to minimize the number of over nodes, would simply overload one node to the point of being unresponsive. The next step would be to minimize the number of tenants in violation. This requires we extend the node model to provide a confidence score λ for a given label. In place of a single label, most classifiers can produce a set 40

65 Crisis Detection and Mitigation Section 3.10 of labels and confidences. We denote a function that provides a confidence rating given a packing and label, as λ( C, G) R [0, 1]. The new cost function follows as: h = N i N(λ(P i, ) P i ) 2 (3.3) This cost function has an artifact of migrating tenants between nodes that were not in violation, to reduce the overall score. Healthy nodes with a large number of nodes, can be labeled as having a small confidence of being over( ). This cost function would favor migrating from a large number of tenants with a low over-score, rather than a small number of tenants, in violation, with a higher over-score. We use a minimal threshold, σ for λ to register a score. After examining latencies and IO wait times for λ( C, ) we settled on σ = 0.35, due to reduced likelihood of high latencies or strained I/O. These results are included in Section Our final cost function is set to: h = (λ(p i, ) P i ) 2 (3.4) {N i N λ(p i, )>σ} In case the search algorithm cannot find a suitable packing, or cannot converge after a few iterations, it concludes that new nodes need to be added to accommodate the changing tenant requirements and behaviors. In a cloud infrastructure, such as Amazon EC2, Delphi can automatically add new servers to elastically scale the cluster. In a statically-allocated infrastructure typical in classical enterprises, Delphi flags this event for capacity planning by a human administrator. Since the workloads are dynamic and we employ heuristics to find a solution, a stable state is not guaranteed. Additional heuristics, such as maximum allowed moves in a time period, are used to prevent excessive tenant movement. Delphi must migrate the tenants p M j for which it found a potential destination node. This problem of migrating a tenant database in a live system is called live database migration [37]. Once a tenant is migrated, the outcome is recorded in the history database. Migrating a tenant incurs some cost, such as a few aborted transactions and an increase in response times. Our current search algorithm does not consider this cost in determining which tenants to migrate. Ideally, Delphi will factor migration cost into decision making, however this requires accurate models to predict migration cost, which is influenced by tenant attributes, colocated workloads, and network congestion. We evaluated regression models based on the attributes in Section to predict migration cost, but interference from source and destination workloads resulted in inaccurate models. Accurate models to predict migration cost, augmenting the search algorithm to consider a predicted 41

66 Chapter 3. Pythia migration cost, and techniques to recognize patterns in workloads are worthwhile directions of future extensions. It is possible to use replicas for crisis mitigation and load balancing in Pythia. Leveraging existing replicas to migrate workload instead of data migration reduces the migration cost. Synchronous replication protocols make this operation simple and quick, but at the cost of increased latency for update operations during normal operation. Asynchronous replication could also be used, but automating the process of migrating workload to a secondary replica, when the replicas can potentially be lagging, while preserving data consistency requires careful system design and additional approaches. Moreover, even in a system with multiple replicas, Pythia can help select a candidate replica to migrate the workload to. In scenarios where the existing replicas are not suitable destinations (since they might become overloaded as a result of this workload migration), Pythia can also help select a viable destination where a new replica can be regenerated. Our decision to focus on migration limits the problem s scope. In the future, we plan to explore how Pythia can be extended to support a hybrid of workload and data migration Experimental Evaluation We deployed and evaluated Delphi on a cluster of 16 servers dedicated to database processes, six servers dedicated to generating client workloads (workers), and one server dedicated to hosting Delphi and a benchmark coordinator. The database servers run PostgreSQL 9.1 on CentOS 6.3, with two quad-core Xeon processors, 32 GB RAM, 1 GB Ethernet, and 3.7 TB striped RAID HDD array. Adhering to PostgreSQL best practice, the buffer pool size is configured to 8 GB, with the remaining memory dedicated to use by the OS page cache. Default settings are used for other configuration options. The following section describes benchmarks used for workload generation, the methodology used to generate various tenant workloads, a validation of the tenant and node models, and a detailed evaluation of Pythia Benchmark and Tenant Description Delphi targets multitenant environments that serve a wide variety of tenant types and workloads where the tenants often use their database for multiple purposes. Classical database benchmarks, such as the TPC suite, focus on testing the performance and limits of a single high performance DBMS dedicated either for transaction processing (TPC-C) or for data analysis (TPC-H). Existing benchmarks provide little support for evaluating the effects of colocating multiple 42

67 Experimental Evaluation Section 3.11 tenants, systematically generating workloads for large numbers of small tenants, generating correlated and uncorrelated workload patterns, or generating workloads that change or evolve with time. We therefore designed and implemented a custom framework to evaluate and benchmark multitenant DBMSs. Our multitenant benchmark is capable of generating a wide variety of workloads, such as lightweight tenants with minimal load and resource consumption, a mix of reporting and transactional workloads, and workloads that change behavior with time, thus emulating the variety of tenants a multitenant DBMS can host [7]. The benchmark is a distributed set of load generator workers orchestrated by a master. The core of the load generator comprises a set of configurable predefined workload types. Our current implementation supports the following workload classes: a light workload of short transactions, composed of 1-4 read and write operations; a lightweight market-based web application which tracks frequent clicks, reads products for browsing, places transactional orders, and reports on related items and popular ads; a time series database with a heavy insert workload and periodic reporting; a YCSB-like [22] workload on larger databases with 80% of operations, on 20% of the data; and a set of YCSB-like workloads with bounded random configurations. A tenant s workload is specified as one workload type, a database size varying between 100MB to 14 GB, and a vector of randomized configuration parameters, including throughput and number of client threads. Therefore a tenant s workload can potentially comprise multiple combinations of workload configurations in different time intervals. Using the randomized configuration, 350 tenants were generated and associated with a random id. Each tenant was run in isolation for a warm-up period of at least 30 minutes, and then latencies were measured over 10 minutes. To ensure tenants are amenable to consolidation with our latency SLOs, tenants with latency 95th percentile greater than 500 milliseconds (ms) or 99th percentile greater than 2000 ms are removed from the set of candidate tenants, which left 314 tenants. While this workload combination does not encapsulate the variety of workloads encountered in a real database platform, it is more robust than using a combination of heterogeneous workloads, such as only using YCSB or TPC-C for all tenants. The benchmark master receives a mapping of tenants-to-servers from Delphi, and executes all workloads by distributing workloads at the thread granularity in round-robin to all worker nodes. Worker nodes connect directly to database servers via JDBC. Periodically the master collects rolling logs of all latencies, tagged by operation type, workload class, and tenant ID. Implementing such a distributed load testing environment is necessary to generate enough client load to stress 16 database servers concurrently, aggregate usage statistics, and emulate a distributed tenant query router. 43

68 Chapter 3. Pythia Model Evaluation Before describing the experimental evaluation of Pythia, we briefly revisit model generation and provide a basic validation that the proposed models capture behavior that Pythia utilizes for placement decisions. Pythia s models must be trained before Delphi can leverage Pythia for managing the tenants. Training data is collected from the operational system and a human administrator provides rules to label the training data, which is then fed into Pythia to learn tenant and node models. A model s accuracy is computed as the percentage of occurrences where Pythia s predicted tenant or node label matches that provided by the administrator. To measure accuracy, we used cross-validation, where the labeled data is partitioned into a training set and a validation set, the models are trained on the training set and tested on the validation set. Accuracy of the tenant model was about 92% while that of the node accuracy was about 86%. We also validated that when a node s resources are not thrashing, then a tenant s class label is static if the tenant s workload is static. When resource capacity becomes constrained, due to migration or thrashing, we did observe that tenant labels do fluctuate. Labeling based on a sliding window can reduce fluctuation, but fluctuations could result in a misclassification of a node, and requires additional iterations to resolve. Improving tenant interference modeling is left for future work. Delphi is an initial step in building a multitenant controller, and to focus the problem on tenant placement we currently do not factor the cost of migration in placement decisions. Therefore in all evaluations, when any new combination of colocated tenants is evaluated (a run), the framework runs a staggered half-hour warm up period followed by a statistic reset. After warm up, evaluations run for a given time period with snapshots recorded every five minutes and node statistics captured every 30 seconds Tenant Model Evaluation In determining which generated tenants are amenable to the defined latency SLOs, all tenants are run in isolation on a database server. Here the tenants performance features are labeled using the rules described in Section The labeled data set serves as our initial training set for tenant models. To validate that the resource-based tenant models do capture relative resource consumption, we examine the tenant models when run in isolation. The models are labeled using only the database features captured by an agent, and the servers resource utilization is only used for validation. Figure 3.4 shows that our models are representative of actual resource usage, without direct monitoring. Disk activity is 44

69 Experimental Evaluation Section 3.11 Average Disk IOPS IOPS IO Max Avg. IO Max CPU % Avg. CPU % User Max User Mean S M L Disk label X S L Op. Complexity label (a) Expected disk IOPS and percentage of CPU on I/O Wait (b) Operation Complexity and CPU Utilization Figure 3.4: Tenant model resource consumption when run in isolation. examined in Figure 3.4a where we compare label buckets of small, medium, large, and extra-large against average disk IOPS and the max CPU cycles blocked on I/O (IO Max). As predicted the disk component of a tenant s label corresponds to the average observed disk activity. Figure 3.4b shows that average operation complexity (number of pages accessed per transaction), which translates into CPU consumption. Our hypothesis is that high operation complexity includes CPU intensive queries, including, reporting queries that access many pages, complex join operations, or long running transactions that require concurrency validation. Here we show labels with small and large operation complexity against mean and max CPU cycles used on user processes, which is primarily composed of the database processes. The range for CPU cycles appear low, but these percentages are across 16 OS threads. Servers with fewer cores, would exhibit higher percentages. To experiment with the robustness of the tenant model, we ran a TPC-C like tenant without having any TPC-C workloads in the training set. We ran the workload with five warehouses and a throttled single terminal. As expected the model labeled the tenant DM-TS-OL, as having medium disk access, low throughput and large operational complexity due to complex transactions Node Model Evaluation In contrast to the tenant model, labeling the node training data can be substantially automated. The input to the node model is a vector of tenant model counts that are colocated on this node. The training requires observing many combinations of colocated tenant workloads. An administrator sets the parameters for determining a node s health, by determining acceptable ranges of resource 45

70 Chapter 3. Pythia Mean 95th % Latency (ms) Good Over Under Label Confidence Percentage Max IO Wait Cycles Good Over Under Label Confidence Percentage (a) Average tenant 95th percentile latency. (b) Average server max CPU cycles blocked on I/O. Figure 3.5: Node model performance by label confidence consumption, such as disk IOPS or CPU consumption, and the percentile-based latency response time SLOs. With the model parameters defined and the ability to run synthetic workloads, Delphi is able to the automatically build the node model. Node models are valid for one type of hardware configuration. Figure 3.5 presents average resource consumption by the node s label and Pythia s confidence of the provided label. Figure 3.5a shows the average tenant 95th percentile latency, and Figure 3.5b, the average maximum percentage of CPU cycles blocked on I/O (IO Wait). The distribution for CPU utilization is similar to I/O, but with a sharp plateau for over labels. The corresponding graph is omitted for space. These results demonstrate that as Pythia becomes more confident about a predicted node label, the results trend towards expected behavior. For example, as a node label increases in confidence of being over, we observe latencies, CPU utilization, and cycles blocked on I/O spike. As a node label becomes more confidently under, latencies, CPU utilization, and blocked I/O cycles decrease. These results imply that Pythia is able to predict the expected resource consumption for tenants. These figures include data collected from the experiments described in Section Crisis Mitigation To evaluate Pythia s effectiveness in mitigating a performance crisis, we provide a random tenant packing with a set of nodes in violation, and initiate loadbalancing to resolve the crisis. We then iteratively add new tenants to the system, which can result in new tenant violations. If any step does not contain a violation, we continue to add tenants to trigger a violation. The process is repeated 46

71 Experimental Evaluation Section 3.11 until a violation cannot be resolved. We compare Pythia to a greedy baseline load-balancing algorithm Hottest-to-Coldest (HtoC). HtoC is modeled on the greedy load-balancing baseline used in the evaluation of the large scale storage system, Ursa [84]. This algorithm attempts to iteratively balance load, by moving tenants from over-loaded (hot) nodes to underloaded (cold) nodes. Faced with a violation, non-violating nodes are inserted into a queue of possible destinations for violating tenants. The queue is sorted in descending order for excess CPU capacity (Idle CPU). Idle CPU capacity is a natural single metric to use for resource capacity, as non-idle cycles include cycles for the database process, kernel usage, and CPU cycles blocked on I/O. HtoC iterates through violating nodes by lowest idle cycles, and migrates one random tenant to a node removed from the head of the destination queue. This process repeats until a solution can be found, or until a maximum iteration count is reached and no solution is found. We compared moving a random tenant with moving all violating tenants, and found that a random tenant resulted in fewer violations, had lower average latencies, and resolved crisis with fewer iterations. For this evaluation we initially assign a small uniform number (two or three) of tenants to all servers and iteratively run the following steps. Initially, all tenants are warmed up for thirty minutes, agent statistics are reset, and then collect a snapshot after running all tenants for five minutes. Delphi checks if any tenant is experiencing a performance crisis with any tenant violating performance SLOs. If no node is in violation, we distribute one new random tenant per server and repeat, starting with a new warm up. If any node is in violation, we attempt to mitigate the crisis by balancing the load through tenant migration. After the tenant repacking is executed, a snapshot is measured after warm up. This rebalancing is allowed to repeat for six iterations, if the re-packing cannot converge by then the experiment interval ends. We alternate complete incremental packing runs between Pythia and our greedy baseline load-balancing HtoC, giving both algorithms an identical list of tenants to use. We start with a low number of initial tenants, and allow each load-balancing algorithm to pack tenants incrementally to avoid using an arbitrary dense starting point that may favor one solution. This also allows both algorithms to be evaluated in light and dense tenant packings. Because Pythia is more judicious with tenant packings, we can also use this experiment to evaluate the ability for tenant packing, or consolidation. Throughout all of these experiments, HtoC was never able to pack more tenants than Pythia. On average Pythia was able to successfully pack 71 tenants, a 45% improvement over HtoC s 49 tenants. The maximum number of tenants packed using Pythia was 80, and 64 for HtoC. We expect a larger number of tenants could have been packed for both algorithms if smaller tenants were used, durability settings were 47

72 Chapter 3. Pythia relaxed, or an array of SSDs were used. On successful load-balances, Pythia converged after 1.75 iterations, where HtoC resolved after 2.25 iterations on average. Pythia migrated an average 5.25 tenants per round, and HtoC migrated 2.25 tenants per round. One reason for increased tenants migrated is that Pythia would often shuffle tenants between non-violating nodes, in order to make capacity on ideal destinations for tenants from violating nodes. Figure 3.6 demonstrates Pythia and HtoC s ability to mitigate a crisis by examining the impact of load balancing on tenant latency and resource consumption. The data here is captured from all incremental growth rounds that are successfully load-balanced. Figure 3.6a shows the before and after median tenant latencies hosted on nodes experiencing a performance crisis. Figure 3.6d shows the same data, but only the decrease in median latencies from load-balancing, instead of the before and after latencies. It is important to note, while HtoC and Pythia exhibit comparable performance gains for mean and 95th percentile latencies, Pythia decreases the 99th percentile latency by about 50% more, despite hosting 45% more tenants on average. Figures 3.6b and 3.6e show the impact of load-balancing on nodes not experiencing a SLO violation. As expected, Pythia has a larger impact on latency due to the increased number of tenants migrated from violating nodes. For both approaches, the increase in latency is small compared to the decrease in latency of violating nodes. Again, we assume that any database hosted in a multitenant environment can tolerate some variance in latency, provided latency SLOs are met. While violating latencies are similar for both approaches, Figure 3.6c shows HtoC s violating resource consumption is substantially worse than Pythia. Additionally, the resolved state for Pythia has lower CPU usage than non-violating nodes(not depicted) by 1% point on average. A goal of load-balancing with Pythia is to implicitly consider cache impedance when matching tenants, in order to provide tenants with adequate cache access to meet latency SLOs. Figure 3.6f compares the relative differences in resource attributes between the resolved states of Pythia and HtoC. Pythia s resolved state results in a higher average tenant cache hit ratio and a higher percentage of the database that is in the page cache and buffer pool (cache coverage), which results in reduced disk activity for the tenants. An improvement of 13% on cache coverage and 8% on cache hit ratio alleviates a substantial amount of disk seeks, thus reducing disk contention between tenants. Interestingly, after Pythia mitigates a performance crisis, the tenants remaining on previously violating nodes have a higher cache hit ratio, a smaller average buffer size, and a lower variation in buffer size, when compared with the resolved HtoC nodes. The higher cache hit ratio combined with smaller buffer size suggests that tenants with smaller working sets remain together. The reduced variation, measured by standard deviation, means 48

73 Summary Section 3.12 that the buffer size is more uniform, each tenant is getting a relatively equal share of the buffer pool, and that the cache is not cannibalized by dominate tenants. We therefore, conclude that Pythia is matching tenants cache impedance when selecting the ideal packing to resolve a crisis. To gather additional insight into the packing limits of both algorithms, we selected two successful packings of 64 tenants to grow at a smaller rate. We repeatedly grew these packings for both algorithms, by adding one tenant to four random nodes in each growth round. Pythia was able to successfully pack 72, 76, and 80 tenants; HtoC could not successfully pack beyond 68 tenants. Figure 3.7 shows latency distributions by total tenant count, for all successful growth rounds for both experiments described in this section. The boxplots show sampled minimum, lower quartile, median, upper quartile, and sampled maximum percentile latencies. The 95th percentile latencies for HtoC are very similar to Pythia s, so this graph is omitted. As we can see from Figure 3.7 the 99th percentile latencies are a primary driver for SLO violations, in both Pythia and HtoC. The sampled maximum and upper quartile latencies for HtoC rise much faster than Pythia, resulting in violations from fewer tenants. Our belief is that Pythia is optimizing packings for the 99th percentile, as most packing violations result in violations for the 99th percentile. This is a likely reason why this latency category has the biggest gains for Pythia. Experimenting with a 99th percentile latency of 10,000 ms and a 95th percentile of 500 ms, Pythia successfully packed 88 tenants, while HtoC could still not pack more than 68 tenants due to 95th percentile violations Summary Multitenant DBMSs consolidate large numbers of tenants with unpredictable and dynamic behavior. Designing a self-managing controller for such a system faces multiple challenges such as characterizing tenants, reducing the impact of colocation, adapting to changes in behavior, and detecting and mitigating a performance crisis. The complex interplay among tenants, DBMS, and the OS, as well as aggregated resource consumption measures make the task of monitoring and load balancing difficult. We designed and implemented Delphi, a self-managing controller for a multitenant DBMS that monitors and models tenant behavior, ensures latency SLOs, and mitigates performance crises without requiring major modifications to existing systems. Delphi leverages Pythia, a technique to classify tenant behavior, and learn good tenant packings. Pythia does not make assumptions about the tenant workloads or the underlying DBMS and OS implementations. Our analysis revealed unexpected interactions arising from tenant colocation and identified tenant behavior that are most sensitive to resource star- 49

74 Chapter 3. Pythia vation. Our experiments, using a variety of tenant types and workloads, demonstrated that Pythia can learn a tenant s behavior with more than 92% accuracy and learn quality of packings with more than 86% accuracy. Using Pythia, Delphi can mitigate a performance crisis by selectively migrating tenants to improve 99th percentile response times by 80%. 50

75 Summary Section 3.12 Latency (ms) Mean 95% 99% Latency (ms) Latency Mean Latency 95% Latency 99% 0 HtoC In Violation HtoC Resolved Pythia In Violation Pythia Resolved (a) Median latency for nodes experiencing SLO violations, before and after load-balancing. 0 HtoC In Violation HtoC Resolved Pythia In Violation Pythia Resolved (b) Median latency for nodes not initially experiencing SLO violations, before and after load-balancing. CPU % HtoC In Violation HtoC Resolved Pythia In Violation IO Max IO Mean User Max User Mean Pythia Resolved Latency Decrease (ms) HtoC Pythia Latency Mean Latency 95th% Latency 99th% (c) Change in CPU utilization for nodes in violation. (d) Median latency decrease after crisis mitigation for nodes experiencing crisis. Latency Increase (ms) HtoC Pythia Latency Mean Latency 95th% Latency 99th% HtoC to Pythia % Change Cache Tenant Buffer Hit % Mean Size Cache Tenant Buffer Coverage Size Std Dev (e) Median latency increase after crisis mitigation for nodes not initially in crisis. (f) Relative differences from HtoC to Pythia s resolved state. Figure 3.6: Comparing improvements to nodes in violation, and the impact on nodes not in violation 51

76 Chapter 3. Pythia 95th % Latency (ms) Pythia Total Tenant Count (a) Pythia 95th percentile 99th % Latency (ms) Pythia Total Tenant Count (b) Pythia 99th percentile 99th % Latency (ms) HtoC Total Tenant Count (c) HtoC 99th percentile Figure 3.7: Tenant latencies by platform total tenant count. 52

77 Part II Movement Primitives 53

78

79 Chapter 4 Forms of Database Migration In the midst of movement and chaos, keep stillness inside of you. Deepak Chopra The unpredictable usage patterns for the tenants in a multitenant DBMS mandate the need for elasticity. Migration is a key component for elasticity and load balancing, and hence, migration should be supported as a first class notion in any multitenant DBMS. We now classify forms of migration and identify state-of-theart migration techniques. With this understanding we propose a classification of migration techniques along with a set of metrics to compare the proposed forms. Downtime is the time a cell may be unavailable during migration. Interruption of service is the number of in-flight transactions of a tenant that fail during migration due to loss of transaction state, or not meeting the transactional requirements. Required coordination refers to the extent of coordination needed to initiate as well as complete the migration. Note that in a distributed autonomic system, a component within the DBMS should coordinate migration, i.e. determine when to migrate as well as the source and destination machines, and cells to migrate. The migration overhead is the system overhead or performance penalty incurred during migration. The abstract form definitions below identify Form of Downtime Interruption External Migration Migration of Service Coordination Overhead Asynchronous Moderate Moderate High High Synchronous Minimal Minimal Moderate Moderate Live None Minimal Minimal Minimal Table 4.1: Summary of the forms of migration and the associated costs. 55

80 Chapter 4. Forms of Database Migration the goals of migration and are independent of any multitenancy model. Table 4.1 summarizes these forms of migrations and compares their relative costs. 4.1 Asynchronous migration Asynchronous migration is an immediate, blocking 1 migration which relies on a coordinating process to copy the cell from a source host to a destination host. The blocking stems from disabling the source during the copy to ensure consistency, resulting in a period of downtime. This migration is immediate due to a prompt migration upon initiation. A naive implementation is to stop the database process and copy the database between nodes. Copying can be performed by either a file copy or via a backup and restore process. To minimize impact a database could be flushed and set to read only to allow for some operations during migration. A stale replica (maintained by lazy replication) can be leveraged for migration; here a coordinator process disables the source to replay final updates at the destination. Once the migration has completed, the coordinator will redirect traffic to the destination. As the coordinator has more control over the migration initialization, this form works well for large cells with regular periods of inactivity. 4.2 Synchronous migration Synchronous migration is an eventual, non-blocking 1 migration where a source and destination operate as a tightly coupled cluster. This requires the destination to act as an eager replica of the source, where updates must synchronously occur at the source and destination. If the destination host does not have an up to date replica of the cell, the source and destination hosts are configured to run as a synchronized cluster, and the destination gradually acquires a synchronized state by replaying writes that were performed on the source DBMS. Once a stable state is reached, the coordinating process notifies the source host to stop serving the cell, and all future connections are sent to the destination host. Many popular RBDMSs have the ability to run in a master-slave mode in order to efficiently replicate data across hosts in a cluster. Synchronous migration can be achieved using a method proposed by Yang et al. [87] which uses two-phase commit and a read one/write all master-slave mode. Even though many DBMSs support a 1 Blocking and non-blocking migration refers to potential blocking of client database calls, and not the internal implementation used to achieve the migration. 56

81 Live migration Section 4.3 clustered mode off the shelf, changing a lazy replication to a synchronized, or eager replication, often requires short periods of downtime to change server states. Synchronous migration is eventual due to the synchronization period required to complete migration. A minimal amount of downtime and interruption of service may occur while switching the primary master to the destination. The minimal operational overhead originates from the hosts needing to run in a mode which is ready for clustering. The coordinator is responsible for redirecting client connections to the destination host for cells. 4.3 Live migration Live migration is an immediate, non-blocking 1 migration of a cell from a source host directly to a destination host with no downtime and minimal interruption of service. All client connections are migrated without the need to reconnect. To initiate migration, a coordinating process simply notifies the source host of the destination and relies on the live migration process to independently manage itself. Several existing techniques can be utilized for database migration. VM migration has been thoroughly researched and provides an effective means for live migration of a VM without interrupting processes [21, 55, 17]. If virtual machines are used for tenant isolation, live virtual machine migration can be leveraged for quick database migration with minimal interruption of service. We were able to migrate a running 1 GB TPC-C in less than 20 seconds on average, with only a 5-10% increase of response time due to the VM overhead. However, this ease of migration and tenant isolation comes at a cost of increased overhead and limited consolidation due to duplicated OS and DB processes [28]. To allow more tenants to be consolidated at a single node, multiple cells must share the same database process and VM. In this case, VM migration does not allow fine-grained load balancing of cells, and all cells contained in a VM must then be migrated together. Recent research has explored implementing live database migration cognizant of the semantics of the database process. We have proposed Zephyr [37], a technique to migrate a cell in a shared nothing database architecture with no downtime. Zephyr uses a synchronized dual mode where both the source and the destination nodes concurrently execute transactions on the cell; the source completes execution of the transactions that were active at the start of migration, while the destination executes new transactions. As the first step of migration, Zephyr copies a wireframe of the database to the destination node. This wireframe consists of the minimal information needed for the destination to start executing 57

82 Chapter 4. Forms of Database Migration transactions but does not include the actual application data stored in the cell. The wireframe includes database metadata to authenticate new connections and meta information about the tables and indices. For a database using B+-trees, the wireframe includes only the internal nodes of the tree; the leaf nodes containing the actual data is replaced by sentinels at the destination. Zephyr does not allow structural changes to the indices during migration. Once resources for the cell has been initiated, the destination starts executing new transactions while the source continues executing transactions that were active at the start of migration. Pages are pulled by the destination as transactions at the destination access them. Transactions may be aborted at the source when they access a page that has already been migrated and at both nodes when they result in change structural changes to the indices. Once transactions at the source complete, migration completes by pushing pages to the destination. We have also proposed Albatross [33], a technique to migrate a cell in shared storage architectures with no aborted transactions and minimal performance impact. In a shared storage architecture, the persistent image of a cell is stored in a network addressable storage abstraction and hence does not need migration. Albatross focusses on migrating the database cache and the state of active transactions. In Albatross, the source takes a quick snapshot of a cell s cache and the destination warms up its cache with this snapshot. While the destination initializes its cache, the source continues executing transactions. The destination therefore lags the source. Albatross uses an iterative phase where changes made to the source node s cache are iteratively copied to the destination. When the same amount of data is being copied in consecutive iterations or a maximum number of iterations is reached, transactions are blocked at the source and an atomic handover completes migration. The state of active transactions is copied in the final handover phase to allow them to resume execution at the destination which already has a warmed cache. Live migration is the ideal candidate for database migration and is the hardest to implement. Asynchronous migration is at the other end of the spectrum and the baseline form of migration in system implementations not designed for migration, while synchronous migration strikes a middle ground. Ideally, an autonomic DBMS is aware of a cell s service level agreement, and can leverage a migration form that minimizes the impact on performance. At one extreme is the shared hardware model which uses virtualization to multiplex multiple VMs on the same machine with strong isolation. Each VM has only a single database process with the database of a single tenant. At the other extreme is the shared table model which stores multiple tenants data on shared tables with the finest level of isolation. In the different models, tenants 58

83 Live migration Section 4.3 data is stored in various forms. For shared machine, an entire VM corresponds to a tenant, while for shared table, a set of rows in a table correspond to a tenant. Thus, the association of a tenant to a database can be more than just the data for the client, and can include metadata or even the execution state. As the level of isolation moves away from shared hardware (row 1), the difficulty of cell migration increases; this is due to an increase in shared components, such as transaction managers, buffer pools, etc, which need to have a cell partitioned, or isolated, in order to migrate tenant without interrupting co-located tenants. With this understanding of the models and the abstraction corresponding to tenants, we now delve into analyzing the interplay of the different forms of multitenancy and the cloud paradigms. 59

84 60

85 Chapter 5 Zephyr May the wind always be at your back. Irish Blessing The increasing popularity of service oriented computing has seen hundreds of thousands of applications being deployed on various cloud platforms [40]. The sheer scale of the number of application databases, or tenants, and their small footprint (both in terms of size and load) mandate a shared infrastructure to minimize the operating cost [87, 50, 81, 30]. These applications often have unpredictable load patterns, such as flash crowds originating from a sudden and viral popularity, resulting in the tenants resource requirements changing with little notice. Load balancing is therefore an important feature to minimize the impact of a heavily loaded tenant on the other co-located tenants. Furthermore, a platform deployed on a pay-per-use infrastructure (like Amazon EC2) provides the potential to minimize the system s operating cost. Elasticity, i.e. the ability to scale up to deal with high load while scaling down in periods of low load, is a critical feature to minimize the operating cost. Elastic load balancing is therefore a first class feature in the design of modern database management systems for the cloud [30, 31], and requires a low cost technique to migrate tenants between hosts, a feature referred to as live migration [21, 55]. 1 Our focus is the problem of live migration in the database layer supporting a multitenant cloud platform where the service provider manages the applications databases. Force.com, Microsoft Azure, and Google AppEngine are examples of such multitenant cloud platforms. Even though a number of techniques are prevalent to scale the DBMS layer, elasticity is often ignored primarily due to static 1 Our use of the term migration is different from migration between different database versions or different schema. 61

86 Chapter 5. Zephyr infrastructure provisioning. In a multitenant platform built on an infrastructure as a service (IaaS) abstraction, elastic scaling allows minimizing the system s operating cost leveraging the pay-per-use pricing. Most current DBMSs, however, only support heavyweight techniques for elastic scale-up where adding new nodes requires manual intervention or long service disruption to migrate a tenant s database to these newly added nodes. Therefore, to enable lightweight elasticity as a first class notion, live migration is a critical functionality. We present Zephyr, 2 a technique for live migration in a shared nothing transactional database. Das et al. [32] proposed a solution for live database migration in a shared storage architecture while Curino et al. [27] outlined a possible solution for live migration in a shared nothing architecture. Zephyr is the first complete solution for live migration in a shared nothing database architecture. Zephyr minimizes service interruption for the tenant being migrated by introducing a synchronized dual mode that allows both the source and destination to simultaneously execute transactions for the tenant. Migration starts with the transfer of the tenant s metadata to the destination which can then start serving new transactions, while the source completes the transactions that were active when migration started. Read/write access (called ownership) on database pages of the tenant is partitioned between the two nodes with the source node owning all pages at the start and the destination acquiring page ownership on-demand as transactions at the destination access those pages. The index structures are replicated at the source and destination and are immutable during migration. Lightweight synchronization between the source and the destination, only during the short dual mode, guarantees serializability, while obviating the need for two phase commit [43]. Once the source node completes execution of all active transactions, migration completes with the ownership transfer of all database pages owned by the source to the destination node. Zephyr thus allows migration of individual tenant databases that share a database process at a node and where live VM migration [21] cannot be used. Zephyr guarantees no service disruption for other tenants, no system downtime, minimizes data transferred between the nodes, guarantees safe migration in the presence of failures, and ensures the strongest level of transaction isolation. Zephyr uses standard tree based indices and lock based concurrency control, thus allowing it to be used in a variety of DBMS implementations. Zephyr does not rely on replication in the database layer, thus providing greater flexibility in selecting the destination for migration, which might or might not have the tenant s replica. 2 Zephyr, meaning a gentle breeze, is symbolic of the lightweight nature of the proposed technique. 62

87 Background Section 5.1 However, considerable performance improvement is possible in the presence of replication when a tenant is migrated to one of the replicas. We implemented Zephyr in an open source RDBMS. Our evaluation using a variety of transactional workloads shows that Zephyr results in only a few tens of failed operations, compared to hundreds to thousands of failed transactions when using a simple heavyweight migration technique. Zephyr results in no operational overhead during normal operation, minimal messaging overhead during migration, and between 10-20% increase in average transaction latency compared to an execution where no migration was performed. These results demonstrate the lightweight nature of Zephyr allowing live migration with minimal service interruption. 5.1 Background System Architecture We use a standard shared nothing database model for transaction processing (OLTP) systems executing short running transactions, with a two phase locking [39] based scheduler, and a page based model with a B+ tree index [16]. Figure 3.2 provides an overview of the architecture. Following are the salient features of the system. First, clients connect to the database through query routers that handle client connections and hide the physical location of the tenant s database. Routers store this mapping as metadata which is updated whenever there is a migration. Second, we use the shared process multitenancy model which strikes a balance between isolation and scale. Conceptually, each tenant has its own transaction manager and buffer pool. However, since most current systems do not support this, we use a design where co-located tenants share all resources within a database instance, but is shared nothing across nodes. Finally, there exists a system controller that determines the tenant to be migrated, the initiation time, and the destination of migration. The system controller gathers usage statistics and builds a model to optimize the system s operating cost while guaranteeing the tenant s SLAs. Pythia as presented in this dissertation is such a controller Migration Cost The goal of any migration technique is to minimize migration cost. Das et al. [32] discuss some measures to quantify the cost of migration. Low migration cost allows the system controller to effectively use it for elastic load balancing. 63

88 Chapter 5. Zephyr Service interruption: Live migration must ensure minimal service interruption for the tenant being migrated and should not result in downtime. 3 We use downtime for an entire system outage and service interruption for small interruption in service for some tenants. The number of transactions or operations aborted during migration is a measure of service interruption and is used to determine the impact of migration on the tenant s SLA. Migration Overhead: Migration overhead is the additional work done or resources consumed to enable and perform migration. This cost also includes performance impact as a result of migration, such as increase in transaction latency or reduction in throughput. This comprises of: Overhead during normal operation: Additional work done during normal database operation to enable migration. Overhead during migration: Performance impact on the tenant being migrated as well as other tenants co-located at the source or destination of migration. Overhead after migration: Performance impact on transactions executing at the destination node after migration. Additional data transferred: Since the source and destination of migration do not share storage, the persistent image of the database must be moved from the source to the destination. This measure accounts for any data transfer that migration incurs, in addition to transferring the persistent database image Known Migration Techniques Most enterprise database infrastructures are statically provisioned for the peak capacity. Migrating tenants on-demand for elasticity is therefore not a common operation. As a result, live migration is not a feature supported off-the-shelf by most database systems, resulting in the use of heavyweight techniques. We now discuss two known techniques for database migration. Stop and copy: This is the simplest and arguably most heavy-handed approach to migrate a database. In this technique, the system stops serving updates for the tenant, checkpoints the state, moves the persistent image, and restarts the tenant at the destination. This technique incurs a long service interruption and a high post migration penalty to warm up the cache at the destination. The advantage is 3 A longer interruption might result in a penalty. For instance, in platforms like Windows Azure, service availability below 99.9% results in a penalty. windowsazure/sla/ 64

89 Background Section 5.1 its simplicity and efficiency in terms of minimizing the amount of data transferred. However inefficient this technique might be, this is the only technique available in many current database systems (including RDBMSs like MySQL and Key-Value stores such as HBase) to migrate a tenant to a node which is not already running a replica. Iterative State Replication: The long unavailability of stop and copy arises due to the time taken to create the checkpoint and to copy it to the destination. An optimization, Iterative State Replication (ISR), is to use an iterative approach, similar to [32], where the checkpoint is created and iteratively copied. The source checkpoints the tenant s database and starts migrating the checkpoint to the destination, while it continues serving requests. While the destination loads the checkpoint, the source maintains the differential changes which are iteratively copied until the amount of change to be transferred is small enough or a maximum iteration count is reached. At this point, a final stop and copy is performed. The iterative copy can be performed using either page level copying or shipping the transaction log and replaying it at the destination. Consider applications such as shopping cart management or online games such as Farmville that represent workloads with a high percentage of reads followed by updates, and that require high availability for continued customer satisfaction. In ISR, the tenant s database is unavailable to updates during the final stop phase. Even though the system can potentially serve read-only transactions during this window, all transactions with at least one update will be aborted during this small window. On the other hand, Zephyr does not render the tenant unavailable by allowing concurrent transaction execution at both the source and the destination. However, during migration, Zephyr will abort a transaction in two cases: (i) if at the source it accesses an already migrated page, or (ii) if at either node, it issues an update operation that modifies the index structures. Hence, Zephyr may abort a fraction of update transactions during migration. The exact impact of either technique on transaction execution will depend on the workload and other tenant characteristics, and needs to be evaluated experimentally. The iterative copying of differential updates in ISR can lead to more data being transferred during migration, especially for update heavy workloads that result in more changes to the database state. Zephyr, on the other hand, migrates a database page only once and hence is expected to have lower data transfer overhead. Since ISR creates multiple checkpoints during migration, it will result in higher disk I/O at the source. Therefore, when migrating a tenant from a heavily loaded source node, this additional disk I/O can result in significant impact on co-located tenants which are potentially already disk I/O limited due to increased load. 65

90 Chapter 5. Zephyr Notation D M N S N D T Si, T Di P k Description The tenant database being migrated Source node for D M Destination node for D M Transaction executing at nodes N S and N D respectively Database page k Table 5.1: Notational Conventions. However, due to the log replay, the destination will start with a warm cache and hence will minimize the post migration overhead. On the other hand, Zephyr does not incur additional disk I/O at the source due to checkpointing, but the cold start at the destination results in higher post migration overhead and more I/O at the destination. Therefore, Zephyr results in less overhead at the source and is suitable for scale-out scenarios where the source is already heavily loaded, while ISR is attractive for consolidation during scale-down where it will result in lower impact on tenants co-located at the destination. Finally, since ISR creates a replica of the tenant s state at another node, it can iteratively copy the updates to multiple nodes, thus creating replicas on the fly during migration. Zephyr however does not allow for this easy extension. It is therefore evident that ISR and Zephyr are both viable techniques for live database migration; a detailed experimental comparison between the two is left for future work. This chapter focusses on Zephyr since it is expected to have minimal service interruption which is critical to ensure high tenant availability. 5.2 Zephyr Design In this section, we provide an overview of Zephyr using some simplifying assumptions to ease presentation. We assume no failures, small tenants limited to a single node in the system, and no replication. Furthermore, the index structures are made immutable during migration. Failure handling and correctness is discussed in Section 5.3, while an extended design relaxing these assumptions is described in Section 5.4. The notational conventions used are summarized in Table

91 Zephyr Design Section 5.2 Figure 5.1: Timeline for different phases during migration. Vertical lines correspond to the nodes, the broken arrows represent control messages and the thick solid arrows represent data transfer. Time progresses from top towards the bottom Design Overview Zephyr s main design goal is to minimize the service interruption resulting from migrating a tenant s database (D M ). Zephyr does not incur a stop phase where D M is unavailable for executing updates; it uses a sequence of three modes to allow the migration of D M while transactions are executing on it. During normal operation (called the Normal Mode), N S is the node serving D M and executing all transactions T S1,..., T Sk on D M. A node that has the rights to execute update transactions on D M is called an owner of D M. Once the system controller determines the destination for migration (N D ), it notifies N S which initiates migration to N D. Figure 5.1 shows the timeline of this migration algorithm and the control and data messages exchanged between the nodes. As time progresses from the top to the bottom, Figure 5.1 shows the progress of the different migration modes, starting from the Init Mode which initiates migration, the Dual Mode where both N S and N D share the ownership of D M and simultaneously execute transactions on D M, and the Finish Mode which is the last step of migration before N D assumes full ownership of D M. Figure 5.2 shows the transition of D M s data through the three migration modes, depicted using ownership of database pages and executing transactions. Init Mode: In the Init Mode, N S bootstraps N D by sending the minimal information (the wireframe of D M ) such that N D can execute transactions on D M. The 67

92 Chapter 5. Zephyr (a) Dual Mode. (b) Finish Mode. Figure 5.2: Ownership transfer of the database pages during migration. P i represents a database page and a white box around P i represents that the node currently owns the page. wireframe consists of the schema and data definitions of D M, index structures, and user authentication information. Indices migrated include the internal nodes of the clustered index storing the database and all secondary indices. Non-indexed attributes are accessed through the clustered index. In this mode, N S is still the unique owner of D M and executes transactions (T S1,..., T Sk ) without synchronizing with any other node. Therefore, there is no service interruption for D M while N D initializes the necessary resources for D M. We assume a B+ tree index, where the internal nodes of the index contain only the keys while the actual data pages are in the leaves. The wireframe therefore only includes these internal nodes of the indices for the database tables. Figure 5.3 illustrates this, where the part of the tree enclosed in a rectangular box is the index wireframe. At N S, the wireframe is constructed with minimal impact on concurrent operations using shared multi-granularity intention locks on the indices. When N D receives the wireframe, it has D M s metadata, but the data is still owned by N S. Since migration involves a gradual transfer of page level ownership, both N S and N D must maintain a list of owned pages. We use the B+ tree index for tracking page ownership. A valid pointer to a database page implies unique page ownership, while a sentinel value (NULL) indicates a missing page. In the init mode, N D therefore initializes all the pointers to the leaf nodes of the index to the sentinel value. Once N D completes initialization of D M, it notifies N S, which then initiates the transition to the dual mode. N S then executes the Atomic Handover protocol which notifies the query router to direct all new transactions to N D. Dual Mode: In the dual mode, both N S and N D execute transactions on D M, and database pages are migrated to N D on-demand. All new transactions (T D1,..., T Dm ) arrive at N D, while N S continues executing transactions that were 68

93 Zephyr Design Section 5.2 Figure 5.3: B+ tree index structure with page ownership information. A sentinel marks missing pages. An allocated database page without ownership is represented as a grayed page. active at the start of this mode (T Sk+1,..., T Sl ). Since N S and N D share ownership of D M, they synchronize to ensure transaction correctness. Zephyr however requires minimal synchronization between these nodes. At N S, transactions execute normally using local index and page level locking, until a transaction T Sj accesses a page P j which has already been migrated. In our simplistic design, a database page is migrated only once. Therefore, such an access fails and the transaction is aborted. When a transaction T Di executing at N D accesses a page P i that is not owned by N D, it pulls P i from N S on demand (pull phase as shown in Figure 5.2a); this pull request is serviced only if P i is not locked at N S, in which case the request is blocked. As the pages are migrated, both N S and N D update their ownership mapping. Once N D receives P i, it proceeds to execute T Di. Apart from fetching missing pages from N S, transactions at N S and N D do not need to synchronize. Due to our assumption that the index structure cannot change at N S, local locking of the index structure and pages is enough. This ensures minimal synchronization between N S and N D only during this short dual mode, while ensuring serializable transaction execution. When N S has finished executing all transactions T Sk+1,..., T Sl that were active at the start of dual mode (i.e. T(N S )= φ), it initiates transfer of exclusive ownership to N D. This transfer is achieved through a handshake between N S and N D after which both nodes enter the finish mode for D M. Finish Mode: In the finish mode, N D is the only node executing transactions on D M (T Dm+1,..., T Dn ), but does not yet have ownership of all the database pages (Figure 5.2b). In this phase, N S pushes the remaining database pages to N D. While the pages are migrated from N S, if a transaction T Di accesses a page that is 69

94 Chapter 5. Zephyr not yet owned by N D, the page is requested as a pull from N S in a way similar to that in the dual mode. Ideally, N S must migrate the pages at the highest possible transfer rate such that the delays resulting from N D fetching missing pages is minimized. However, such a high throughput push can impact other tenants colocated at N S and N D. Therefore, the rate of transfer is a trade-off between the tenant SLAs and migration overhead. The page ownership information is also updated during this bulk transfer. When all the database pages have been moved to N D, N S initiates the termination of migration so that operation switches back to the normal mode. This again involves a handshake between N S and N D. On successful completion of this handshake, it is guaranteed that N D has a persistent image of D M, and so N S can safely release all of D M s resources. N D executes transactions on D M without any interaction with N S. Once migration terminates, N S notifies the system controller Migration Cost Analysis Migration cost in Zephyr results from copying the initial wireframe, operation overhead during migration, and transactions or operations aborted during migration. In the wireframe transferred, the schema and authentication information is typically small. The indices for the tables however have a non-trivial size. A simple analysis provides an estimate of index sizes. Assuming 4KB pages, 8 byte keys (integers or double precision floating point numbers), and 4 byte pointers, each internal node in the tree can hold about 4096/ keys. Therefore, a three-level B+ tree can have up to = leaf nodes, which can index a ( )/ MB database, assuming 80% page utilization. Similarly, a four-level tree can index a 125 GB database. For a three level tree, the size of the wireframe is a mere / MB while for a 4-level tree, it is about 400 MB. For most multitenant databases whose representative sizes are in the range of hundreds of megabytes to a few gigabytes, an index size of the order of tens of megabytes is a realistic conservative estimate [87, 81]. These index sizes add up for the multiple tables and indices maintained for the database. Overhead during migration stems from creating the wireframe and fetching pages over the network. N S uses standard multi-granularity locking [44] of the index to construct the index wireframe. This scan to create the wireframe needs intention read locks at the internal nodes which only conflict with write locks [16] on the internal node. Therefore, this scan can execute in parallel with any transaction T Si executing at N S, only blocking update transactions that result in an update in the index structure that requires a conflicting write lock on an internal node. On the other hand, on-demand pull of a page from N S over the network is 70

95 Correctness and Fault Tolerance Section 5.3 also not very expensive compared to fetches from the disk disks have an access latency of about a millisecond while most data center networks have round trip latencies of less than a millisecond. The cost incurred by this remote pull is therefore of the same order as a cache miss during normal operation resulting in a disk access. Assuming an OLTP workload with predominantly small transactions, the period for which D M remains in the dual mode is expected to be small. Therefore, the cost incurred in this short period in the dual mode is expected to be small. Another contributor to the migration cost is failed transactions at N S resulting from accesses to pages that have been migrated. In its simplest form as described, Zephyr does not guarantee zero transaction failure; this however can be guaranteed by an extended design as shown later in Section Correctness and Fault Tolerance Any migration technique should guarantee transaction correctness and migration safety in the presence of arbitrary failures. We first prove that Zephyr guarantees serializable isolation even during migration. We then prove the atomicity and durability properties of both transaction execution as well the migration protocol Isolation guarantees Transactions executing with serializable isolation, use two phase locking (2PL) [39] with multi-granularity [44]. In the init mode and finish mode, only one of N S and N D is executing transactions on D M. The init mode is equivalent to normal operation while in finish mode, N S acts as the storage node for the database serving pages on demand. Guaranteeing serializability is straightforward in these modes. We only need to prove correctness in the dual mode where both N S and N D are executing transactions on D M. In the dual mode, N S and N D share the internal nodes of the index which are immutable in our design, while the leaf nodes (i.e. the data pages) are still uniquely owned by one of the two nodes. To guarantee serializability, we first prove that the phantom problem [39] is impossible, and then prove general serializability of transactions executing in the dual mode. The phantom problem arises from predicate based accesses where a transaction inserts or deletes an item that matches the predicate of a concurrently executing transaction. Lemma 1. Phantom problem: Local predicate locking at the internal index nodes and exclusive page level locking between nodes is enough to ensure impossibility of phantoms. 71

96 Chapter 5. Zephyr Proof. Proof by contradiction: Assume for contradiction that a phantom is possible resulting in predicate instability. Let T 1 and T 2 be two transactions such that T 1 has a predicate and T 2 is inserting (or deleting) at least one element that matches T 1 s predicate. T 1 and T 2 cannot be executing at the same node, since local predicate locking would prevent such a behavior. Therefore, these transactions must be executing on different nodes. Without loss of generality, assume that T 1 is executing at N S and T 2 is executing at N D. Let T 1 s predicate match pages P i, P i+1,..., P j representing a range of keys. Since Zephyr does not allow an update that changes the index during migration, therefore, T 2 cannot insert to a newly created page at N D. Therefore, if T 2 was inserting to (or deleting from) one of the pages P i, P i+1,..., P j while T 1 was executing, then it implies that both N S and N D have ownership of the page. This results in a contradiction. Hence the proof. Lemma 2. Serializability at a node: Transactions executing at the same node (either N S or N D ) cannot have a cycle in the conflict graph involving these transactions. The proof of Lemma 2 follows directly from the correctness of 2PL [39], since all transactions executing at the same node use 2PL for concurrency control. Lemma 3. Let T Sj be a transaction executing at N S and T Di be a transaction executing at N D, it is impossible to have a conflict dependency T Di T Sj. Proof. Proof by contradiction: Assume for contradiction that there exists a dependency of the form T Di T Sj. This implies that T Sj makes a conflicting access to an item in page P i after T Di accessed P i. Due to the two phase locking rule, the conflict T Di T Sj implies that commit of T Di precedes the conflicting access by T Sj, which in turn implies that T Sj accesses P i after it was migrated to N D as a result of an access by T Di. This leads to a contradiction since in Zephyr, once P i is migrated from N S to N D, all subsequent accesses to P i at N S fail. Hence the proof. Corollary 4 follows by applying induction on Lemma 3. Corollary 4. It is impossible to have a path T Di... T Sj graph. in the conflict Theorem 5. Serializability in dual mode. It is impossible to have a cycle in the conflict graph of transactions executing in the dual mode. 72

97 Correctness and Fault Tolerance Section 5.3 Proof. Proof by contradiction: Assume for contradiction that there exists a set of transactions T 1, T 2,..., T k such that there is a cycle T 1 T 2... T k T 1 in the conflict graph. If all transactions are executing at the same node, then this is a contradiction to Lemma 2. Consider the case where some transactions are executing at N S and some at N D. Let us first assume that T 1 executed at N S. Let T i be the first transaction in the sequence which executed at N D. The above cycle implies that there exists a path of the form T i... T 1 where T i executed at N D and T 1 executed at N S. This is a contradiction to Corollary 4. Similarly, if T 1 executed at N D, then there exists at least one transaction T j which executed at N S, which implies a path of the form T 1... T j, again a contradiction to Corollary 4. Hence the proof. Snapshot Isolation (SI) [12], arguably the most commonly used isolation level, can also be guaranteed in Zephyr. A transaction T i writing to a page P i must have unique ownership of P i, while a read can be performed from a snapshot shared by both nodes. This condition of unique page ownership is sufficient to ensure that during validation of transactions in SI, the transaction manager can detect two concurrent transactions writing to the same page and abort one. Zephyr therefore guarantees transactional isolation with minimal synchronization and without much migration overhead Fault tolerance Our failure model assumes that all message transfers use reliable communication channels that guarantee in-order, at most once delivery. We consider node crash failures and network partitions; we however do not consider malicious node behavior. We assume that a node failure does not lead to loss of the persistent disk image. In case of a failure during migration, our design first recovers state of the committed transactions and then recovers the state of migration. Transaction State Recovery Transactions executing during migration use write ahead logging for transaction state recovery [16, 59]. Updates made by a transaction are forced to the log before it commits, thus resulting in a total order on transactions executing at the node. After a crash, a node recovers its transaction state using standard log replay techniques, ARIES [59] being an example. In the dual mode, N S and N D append transactions to their respective node s local transaction log. Log entries in a single log file have a local order. However, since the log for D M is spread over N S and N D, a logical global order of 73

98 Chapter 5. Zephyr transactions on D M is needed to ensure that the transactions from the two logs are applied in the correct order to recover from a failure during migration. The ordering of transactions is important only when there is a conflict between two transactions. If two transactions, T S and T D, executing on N S and N D, conflict on item i, they must access the same database page P i. Since at any instant of time only one of N S and N D is the owner of P i, the two nodes must synchronize to arbitrate on P i. This synchronization forms the basis for establishing a total order between the transactions. During migration, a commit sequence number (CSN) is assigned to every transaction at commit time, and is appended along with the commit record of the transaction. This CSN is a monotonically increasing sequence number maintained locally at the nodes and determines the order in which transactions commit. If P i was owned by N S and T S was the last committed transaction before the migration request for P i was made, then CSN(T S ) is piggy-backed with P i. On receipt of a page P i, N D sets its CSN as the maximum of its local CSN and that received with P i such that at N D, CSN(T D ) > CSN(T S ). This causal conflict ordering creates a global order per database page, where all transactions at N S accessing P i are ordered before all transactions at N D that access P i. We formally state this property as Theorem 6: Theorem 6. The transaction recovery and the conflict ordering protocol ensures that for every database page, conflicting transactions are replayed in the same order in which they committed. Migration State Recovery Migration progress is logged to guarantee atomicity and consistency in the presence of failures. Migration safety is ensured by using rigorous recovery protocols. A failure of either N S or N D in the dual mode or the finish mode requires coordinated recovery between the two nodes. We first discuss recovering from a failure during transition of migration modes and discuss recovery after failure in different migration modes. Transitions of Migration Modes: During migration, a transition from one state to another is logged. Except for the transition from the init mode to dual mode, which involves the query router metadata in addition to N S and N D, all other transitions involve only N S and N D. Such transitions occur through a onephase handshake between N S and N D (as shown in Figure 5.1). At the occurrence of an event triggering a state transition, N S initiates the transition by sending a message to N D. On receipt of the message, N D moves to the next migration mode, forces a log entry for this change, and sends an acknowledgment to N S. Receipt 74

99 Correctness and Fault Tolerance Section 5.3 of this acknowledgment completes this transition and N S forces another entry to its log. If N S fails before sending the message to N D, the mode remains unchanged when N S recovers, and N S re-initiates the transition. If N S fails after sending the message, then it knows about the message after it recovers and establishes contact with N D. Therefore, a state transition results in two messages and two writes to the log. Logging of messages at N S and N D provides message idempotence, detects and rejects duplicate messages resulting from failure of N S or N D, and guarantees safety with repeating failures. Atomic Handover: A transition from the init mode to the dual mode involves three participants (N S, N D, and the query router metadata) that must together change the state. A one-phase handshake is therefore not enough. We use the two-phase commit (2PC) [43] protocol, a standard protocol for atomic commitment over multiple sites. Once N D has acknowledged the initialization of D M, N S initiates the transition and sends a message to the router to direct all future transactions accessing D M to N D, and a message to N D to start accepting new transactions for D M whose ownership is shared with N S. On receipt of the messages, both N D and the router log their messages and reply back to N S. Once N S has received messages from both N D and the router, it logs the successful handover in its own log, changes its state to dual mode and sends acknowledgments to N D and the router which update their respective states. Atomicity of this handover process follows directly from the atomicity proof of 2PC [43]. This protocol also exhibits the blocking behavior of 2PC when N S (the coordinator) fails. This blocking however only affects D M which is anyways unavailable as a result of N S s failure. Atomic handover therefore does not introduce any additional blocking when compared to traditional 2PC where a coordinator failure blocks any other conflicting transaction. Recovering Migration Progress: The page ownership information is critical for migration progress as well as safety. A simple fault-tolerant design is to make this ownership information durable any page (P i ) transferred from N S is immediately flushed to the disk at N D. N S also makes this transfer persistent, either by logging the transfer or by updating P i s parent page in the index, and flushing it to the disk. This simple solution will guarantee resilience to failure but introduces a lot of disk I/O which considerably increases migration cost and impacts other co-located tenants. An optimized solution uses the semantics of the operation that resulted in P i s on-demand migration. When P i is migrated, N S has its persistent (or at least recoverable) image. After the migration of P i, if a committed transaction at N D updated P i, then the update will be in N D s transaction log. Therefore, after a 75

100 Chapter 5. Zephyr failure, N D recovers P i from its log and the persistent image at N S. The presence of a log entry accessing P i at N D implies that N D owns P i, thus preserving the ownership information after recovery. In case P i was migrated only for a read operation or if an update transaction at N D did not commit, then this migration is not persistent at N D. When N D recovers, it synchronizes its knowledge of page ownership with that of N S, any missing page P i is detected during this synchronization. For these missing pages, either N S or N D can be assigned P i s ownership; assigning it to N D will need copying P i to N D yet again. On the other hand, if N S fails after migrating P i, it recovers and synchronizes its page ownership information with N D when the missing P i is detected, and N S updates its ownership mapping. Failure of both N S and N D immediately following P i s transfer is equivalent to the failure of N D without P i making it to the disk at N D, and all undecided pages can be assigned ownership as described earlier. Logging the pages at N D guarantees idempotence of page transfers, thus allowing migration to deal with repeated failures and prevent lost updates at N D. These optimizations considerably reduces the disk I/O during the dual mode. However, in the finish mode, since pages are transferred in bulk, the pages transferred can be immediately flushed to the disk; the large number of pages per flush amortizes the disk I/O. Since the transfer of pages to N D does not force an immediate flush, after migration terminates, N D must ensure a flush before the state of D M can be purged at N S. This is achieved through a fuzzy checkpoint [16] at N D. A fuzzy checkpoint is used by a DBMS during normal operation to reduce the recovery time after a failure. It causes minimal disruption to transaction processing, as a background thread scans through the database cache and flushes modified pages, while the database can continue to process updates. As part of the final state transition, N D initiates a fuzzy checkpoint and acknowledges N S only after the checkpoint completes. After the checkpoint, N D can independently recover and N S can safely purge D M s state. This recovery protocol guarantees that in the presence of a failure, migration recovers to a consistent point before the crash. Theorem 7 formalizes this recovery guarantee. Theorem 7. Migration recovery: At any instant during migration, its progress is recoverable, i.e. after transaction state recovery is complete, database page ownership information is restored to a consistent state and every page has exactly one owner. Failure and Availability: A failure in any migration mode results in partial or complete unavailability. In the init mode, N S is still the exclusive owner of D M. If N D fails, N S can single-handedly abort the migration or continue processing new 76

101 Correctness and Fault Tolerance Section 5.3 transactions until N D recovers and migration is resumed. In case this migration is aborted in the init mode, N S notifies the controller which might select a new destination. A failure of N S however makes D M unavailable and is equivalent to N S s failure during normal operation. In this case, N D can abort migration at its discretion. If N S fails in the dual mode or the finish mode, then N D can only process transactions that access pages whose ownership was migrated to N D before N S failed. This is equivalent to a disk failing, making parts of the database unavailable. When N D fails, N S can only process transactions that do not access the migrated pages. A failure of N D in the finish mode however makes D M unavailable since N D is now the exclusive owner of D M. This failure is equivalent to N D s failure during normal operation Migration Safety and Liveness Migration safety ensures correctness in the presence of a failure, while liveness ensures that something good will eventually happen. We first establish formal definitions for safety and liveness, and then show how Zephyr guarantees these properties. Definition 2. Safety of migration requires the following conditions: (i) Transactional correctness: serializability is guaranteed for transactions executing during migration; (ii) Transaction durability: updates from committed transactions are never lost; and (iii) Migration consistency: a failure during migration does not leave the system s state and data inconsistent. Definition 3. Liveness of migration requires the following conditions to be met: (i) Termination: if N S and N D are not faulty and can communicate with each other for a sufficiently long period during migration, this process will terminate; and (ii) Starvation Freedom: in the presence of one or more failures, D M will eventually have at least one node that can execute its transactions. Transaction correctness follows from Theorem 5. We now prove transaction durability and migration consistency. Theorem 8. Transaction durability: Changes made by a committed transaction are never lost, even in the presence of an arbitrary sequence of failure. Proof. The proof follows from the following two conditions: (i) during normal operation, transactions force their updates to the log before commit, making them durable; and (ii) on successful termination of migration, N S purges its transaction log and the database image only after the fuzzy checkpoint at N D completes, ensuring that changes at N S and N D during migration are durable. 77

102 Chapter 5. Zephyr Theorem 9. Migration consistency: In the presence of arbitrary or repeated failures, Zephyr ensures: (i) updates made to data pages are consistent even in the presence of failures; (ii) a failure does not leave a page P i of D M without an owner; and (iii) both N S and N D are in the same migration mode. The condition for exclusive page ownership along with Theorem 5 and 6 ensures that updates to the database pages are always consistent, both during normal operation and after a failure. Theorem 7 guarantees that no database page is without an owner, while the atomicity of the atomic handover and other state transition protocols discussed in Section guarantee that both N S and N D are in the same migration mode. Theorem 5, 8, and 9 therefore guarantee migration safety. Theorem 10. Migration termination: If N S and N D are not faulty and can communicate for a long enough period, Zephyr guarantees progress and termination. Proof. Zephyr successfully terminates if: (i) the set of active transactions (T) at N S at the start of the dual mode have completed, i.e. T = φ; and (ii) the persistent image of D M is migrated to N D and is recoverable. If N S is not faulty in the dual mode, all transactions in T will eventually complete, irrespective of whether N D has failed or not. If there is a failure of N S at any point during migration, after recovery, it is guaranteed that T = φ. Therefore, the first condition is guaranteed to be satisfied eventually. After the condition T = φ, if N S and N D can communicate long enough, all the pages of D M at N S will be migrated and recoverable at N D. Theorem 11. Starvation freedom: Even after an arbitrary sequence of failures, there will be at least one node that can execute transactions on D M. The proof of Theorem 11 follows from Theorem 9 which ensures that N S and N D are in the same migration mode, and hence have a consistent view of D M s ownership. Theorem 10 and 11 together guarantee liveness. Zephyr guarantees safety in the presence of repeated failures or a network partition between N S and N D, though progress is not guaranteed. Even though such failures are rare, proven guarantees in such scenarios improves the users reliance on the system. 5.4 Optimizations and Extensions We now discuss some extensions that relax some of the assumptions made to simplify our initial description of Zephyr. 78

103 Optimizations and Extensions Section Replicated Tenants In our discussion so far, we assume that the destination of migration does not have any prior information about D M. Many production database installations however use some form of replication for fault-tolerance and availability. In such a scenario, D M can be migrated to a node which already has its replica. Since most DBMS implementations use lazy replication techniques to circumvent the high cost of synchronous replication [16], replicas often lag behind the master. Zephyr can be adapted to leverage this form of replication. Since N D already has a replica, there is no need for the init mode. When N S is notified to initiate migration, it executes the atomic handover protocol to enter the dual mode. Since N D s copy of the database is potentially stale, when a transaction T Di accesses a page P i, similar to the original design, N D synchronizes with N S to transfer ownership. N D sends the sequence number associated with its version of P i to determine if it has the latest version of P i ; P i is transferred only if N D s version is stale. Furthermore, in the finish mode, N S only needs to send a small number of pages that were not replicated to N D due to a lag in replication. Replication can therefore considerably improve the performance of Zephyr Sharded Tenants Our initial description assumes that a tenant is small and is served from a single node, i.e. a single partition tenant. However, Zephyr can also handle a large tenant that is sharded across multiple nodes, primarily due to the fact that N S completes the execution of all transactions that were active when migration was initiated. Let D M consist of partitions D M1,..., D Mp and assume that we are migrating D Mi from N S to N D. Transactions accessing only D Mi are handled similar to the case of a single partition tenant. Let T i be a multi-partition transaction where D Mi is a participant. If T i was active at the start of migration, then N S is the node that executes T i, and D Mi will transition to finish mode only when all such T i s have completed. On the other hand, if T i started after D Mi had transitioned to the dual mode, then N D is the node executing T i. At any given node, T i is executed in the same way as in a small single partition tenant Data Sharing in Dual Mode In Dual Mode, both N S and N D are executing update transactions on D M. This design is reminiscent of data sharing systems [19], the difference being that our design does not use a shared lock manager. However, our design can be augmented to use a shared lock manager to support a larger set of operations 79

104 Chapter 5. Zephyr during migration, including arbitrary updates and minimizing transaction aborts at N S. In the modified design, we replace the concept of page ownership with page level locking, allowing the locks to be shared when both N S and N D are reading a page. Every node in the system has a Local Lock Manager (LLM) and a Global Lock Manager (GLM). The LLM is responsible for the local locking of pages while the GLM is responsible for arbitrating locks for remote pages. In all migration modes except dual mode, locks are local and hence serviced by the LLM. However, in the dual mode, N S and N D must synchronize through the GLMs. The only change needed is in the page ownership transfer, with the rest of the algorithm remains unchanged. Note that scalability limitations of a shared lock manager is not significant in our case since any instance of the lock manager is shared only between two nodes. We now describe how this extended design can remove some limitations of the original design. Details have been omitted due to space constraints. In the original design of Zephyr, when a transaction T Di requests access for a page P i, N D transfers ownership from N S. Therefore, future accesses to P i (even reads) must fail to ensure serializable isolation. In this extended design, if N D only needs a shared lock on P i to service reads, then N S can also continue processing reads from T Sk+1,..., T Sl that access P i. Furthermore, even if N D had acquired an exclusive lock, N S can request a lock to N D s GLM for the desired lock on P i. This allows processing transactions at N S that access a migrated page; the request to migrate the page back to N S might be blocked in case it is locked at N D. The tradeoff associated with this flexibility is the cost of additional synchronization between N S and N D to arbitrate shared locks, and the higher network overhead arising from the need to potentially copy P i multiple times, while in the initial design, P i was migrated exactly once. The original design made the index structure at both N S and N D immutable during migration and did not allow insertions or deletions that required a change in the index structure. The shared lock manager in the modified design circumvents this limitation by sharing locks at the index level as well, such that normal index traversal will use shared intention locks while an update to the index will acquire an exclusive lock on the index nodes being updated. Zephyr, adapted to the data sharing architecture, allows more flexibility by allowing arbitrary updates and minimizing transactions or operations aborted due to migration. The implication on the correctness is straightforward. Since page ownership can be transferred back to N S, Lemma 3 does not hold any longer. However, Theorem 5 still holds since page level locking is done in a two phase manner using the shared lock managers, which ensures that a cycle in the conflict graph is impossible. The detailed proof is omitted for space limitations. Similarly, 80

105 Implementation Details Section 5.5 the proof for Lemma 1 has to be augmented with the case for index changes. However, since index changes will need the transaction inserting an item (T 2 in Lemma 1) to acquire an exclusive on the index page being modified, it will be blocked by the predicate lock acquired by the transaction with the predicate (T 1 in Lemma 1) on the index pages. Therefore, transactional correctness is still satisfied in the modified design; the other correctness arguments remain unchanged. In summary, all these optimizations provide interesting trade-offs between minimizing the service disruption resulting from migration and the additional migration overhead manifested as higher network traffic and increased synchronization between N S and N D. A detailed analysis and evaluation is left for future work. 5.5 Implementation Details Our prototype implementation of Zephyr extends an open source OLTP database H2 [46]. H2 is a lightweight relational database with a small footprint built entirely in Java supporting both embedded and server mode operation. Though primarily designed for embedded operation, one of the major applications of H2 is as a replacement of commercial RDBMS servers for development and testing. It supports a standard SQL/JDBC API, serializable and repeatable reads isolation levels [12], tree indices, and a relational data model with foreign keys and referential integrity constraints. H2 s architecture resembles the shared process multitenancy model where an H2 instance can have multiple independent databases with different schemas. Each database maintains its independent database cache, transaction manager, transaction log, and recovery manager. In H2, a database is stored as a file on disk which is internally organized as a collection of fixed size database pages. The first four pages store the database s metadata. The data definitions and user authentication information is stored as a metadata table (called INFORMATION SCHEMA) which is part of the database. Every table in H2 is organized as a tree index. If a table is defined with a primary key which is of type integer or real number, then the primary key index stores data for the table. In case the primary key has other types (such as varchar) or if the primary key was not specified at table creation, the table s data are stored in a tree index whose key is auto-generated by the system. A table can have multiple indices which are maintained separate from the primary key index. The fourth page in the database file stores a pointer to the root of the INFORMATION SCHEMA table, which in turn stores pointers to the other user tables. H2 supports classic multi-step transactions with serializable and read committed isolation level. 81

106 Chapter 5. Zephyr We use SQL Router 4, an open source package, to implement the query router. It is a JDBC wrapper that transparently migrates JDBC connections from N S to N D. This SQL router runs a server listener that is notified when D M s location changes. When migration is initiated, N S spawns a migration thread T. In init mode, T transfers the database metadata pages, the entire INFORMATION SCHEMA table of H2, and the internal nodes of the indices. Conceptually, this wireframe can be constructed by traversing the index trees to determine the internal index nodes. This however might incur a large number of random disk accesses for infrequently accessed parts of the index, which can considerably increase the migration overhead. We therefore use an optimization in the implementation where T sequentially scans through the database file and transfers only the internal nodes of the indices. When processing a database index page, it synchronizes with any concurrent transactions and obtains the latest version from the cache, if needed. Since the index structure is frozen during migration, this scan uses shared locking, allowing other update transactions to proceed. T notifies N D of the number of pages skipped, which is used to update page ownership information at N D. In the dual mode, N D pulls pages from N S on-demand while N S continues transaction execution. Before a page is migrated, N S obtains an exclusive lock on the page, updates the ownership mapping, and then sends it to N D. This ensures that the page is migrated only if it is not locked by any concurrent transaction. In the finish mode, N S pushes all remaining pages that were not migrated in the dual mode, while serving any page fetch request from N D ; pages transferred twice as a result of both the push from N S and pull from N D are detected at N D and duplicate pages are rejected. Since N S does not execute any transactions in finish mode, this push does not require any synchronization at N S. 5.6 Experimental Evaluation We now present a thorough experimental evaluation of Zephyr for live database migration using our prototype implementation. We compare Zephyr with the offthe-shelf stop and copy technique that stops the database at N S, flushes all changes, copies over the persistent image, and restarts the database at N D. Our evaluation uses two server nodes that run the database instances and a separate set of client machines that generate load on the database. Each server node has a 2.40GHz Intel Core 2 Quad processor, 8 GB RAM, a 7200 RPM SATA hard drive with 32MB Cache, and runs a 64-bit Ubuntu Server Edition with Java 1.6. The nodes are connected via a gigabit switch. Workload is generated from a

107 Experimental Evaluation Section 5.6 No. of failed operations Zephyr Stop&Copy Percent read operations (a) Failed operations No. of failed operations NoInserts 5%Inserts 1/4Inserts Percent read operations (b) Different insert ratios Transaction latency (ms) Normal Zephyr Stop&Copy Percent read operations (c) Average latency Percent database pages 5%Inserts 1/4Inserts Percent read operations (d) Pages pulled in dual mode Figure 5.4: Impact of the distribution of reads, updates, and inserts on migration cost; default configurations used for rest of the parameters. We also vary the different insert ratios 5% inserts correspond to a fixed percentage of inserts, while 1/4 inserts correspond to a distribution where a fourth of the write operations are inserts. The benchmark executes 60,000 operations. different set of client machines. Since migration only involves N S and N D, our evaluation focusses only on these two nodes and is oblivious of other nodes. We measure the migration cost as the number of failed operations, the amount of data transferred during migration, and the impact on transaction latency during and after migration Benchmark Description We use the Yahoo! cloud serving benchmark (YCSB) [22] in our evaluation. YCSB emulates a synthetic workload generator that can be parameterized to vary the read/write ratio, access distributions, etc. Since the underlying database layer is multitenant, we run one benchmark instance for each tenant database. YCSB was originally designed to evaluate Key-Value stores, and hence primarily designed 83

108 Chapter 5. Zephyr for single key operations or scans. We augmented this workload model and added multi-step transactions, where each transaction consists of multiple operations, the number of operations in a transaction (called transaction size) is another workload parameter. The number of read operations in a transaction is another parameter and so is the access distribution to select the rows accessed by a transaction. These parameters allow us to evaluate the behavior of the migration cost for different workloads and access patterns. We use the cost measures discussed in Section The workload emulates multiple user sessions where a user connects to a tenant s database, executes hundred transactions and then disconnects. A workload consists of sixty such sessions, i.e. a total of 6, 000 transactions. The default configurations use transactions with ten operations, 80% being read operations, 15% update operations and 5% new rows inserted. Each tenant s database consists of a single table with an integer primary key and ten columns of type varchar. Keys accessed by a transaction are chosen from a Zipfian distribution over a database with 100, 000 rows ( 250 MB on disk); the Zipfian co-efficient is set to 1.0. The workload generator is multi-threaded with target throughput of 50 transactions per second (TPS). The default database page size is set to 16 KB and the cache size is set to 32 MB. These default configurations are representative of medium sized tenants [87, 81]. We vary these parameters, one at a time, to analyze their impact on migration cost Migration Cost Our first experiment analyzes the impact on migration cost when varying the percentage read operations in a transaction. Figure 5.4a plots the number of failed operations during migration; clients continue issuing operations on the tenant even during migration. A client thread sequentially issues the operations of a transaction. All operations are well-formed, and any error reported by the database server after an operation has been issued account for a failed operation. As is evident from Figure 5.4a, the number of failed operations in Zephyr is one to two orders of magnitude lesser when compared to stop and copy. Two reasons contribute to more failed operations in stop and copy: (i) abortion of all transactions active at the start of migration, and (ii) abortion of all new transactions that access the tenant when it is unavailable during migration. Zephyr does not incur any unavailability; operations fail only when they result in a change to the index structure during migration. Figure 5.4b plots the number of failed operations when using Zephyr for workloads with different insert ratios. Zephyr results in only few tens of failed oper- 84

109 Experimental Evaluation Section 5.6 ations when the workload does not have a high percentage of inserts, even for cases with a high update proportion. As the workload becomes predominantly read-only, the probability of an operation resulting in a change in the index structure decreases. This results in a decrease in the number of failed operations in Zephyr. Stop and copy also results in fewer failed operations for higher values of read percentages, the reason being the smaller unavailability window resulting from fewer updates that need to be flushed before migration. Figure 5.4c plots the average transaction latency as observed by a client during normal operation (i.e. when no migration is performed) and that with a migration occurring midway; the two bars correspond to the two migration techniques used. We report latency averaged over all the 6, 000 transactions that constitute the workload. We only report latency of committed transactions; aborted transactions are ignored. When compared to normal operation, the increased latency in stop and copy results from the cost of warming up the cache at N D and the cost of clients re-establishing the database connections after migration. In addition to the aforementioned costs, Zephyr fetches pages from N S on-demand during migration; the page can be fetched from N S s cache or from its disk. This results in additional latency overhead in Zephyr when compared to stop and copy. Figure 5.4d shows the percentage of database pages pulled during the dual mode of Zephyr. Since the dual mode runs for a very short period, only a small fraction of pages are pulled on demand. In our experiments, stop and copy took 3 to 8 seconds to migrate a tenant. Since all transactions in the workload have at least one update operation, when using stop and copy, all transactions issued during migration are aborted. On the other hand, even though Zephyr requires about 10 to 18 seconds to migrate the tenant, there is no downtime. As a result, the tenants observe few failed operations. Zephyr also incurs minimal messaging overhead beyond that needed to migrate the persistent database image. Every page transferred is preceded with its unique identifier; a pull request in the dual mode requires one round trip of messaging to fetch the page from N S. Stop and copy only requires the persistent image of the database to be migrated and does not incur any additional data transfer/messaging overhead. We now evaluate the impact of transaction sizes and load (see Figure 5.5). Varying the transaction size implies varying the number of operations in a transaction. Since the load is kept constant at 50 TPS, a higher number of operations per transaction implies more operations issued per unit time. Varying the load implies varying the number of transactions issued. Therefore, higher load also implies more operations issued per unit time. Moreover, since the percentage of updates is kept constant, more operations result in more updates. For stop and 85

110 Chapter 5. Zephyr No. of failed operations Zephyr Stop&Copy 3000 Slope: Zephyr:2.48, S&C: Operations per transaction (a) Transaction size No. of failed operations Zephyr Stop&Copy 1000 Slope: Zephyr:0.48, S&C: Transactions per second (b) Load Figure 5.5: Impact of varying the transaction size and load on number of failed transactions. We also report the slope of an approximate linear fit of the points in a series. copy, more updates result in more data to be flushed before migration. This results in a longer unavailability window which in turn results in more operations failing. On the other hand, for Zephyr, more updates imply a higher probability of changes to the index structure during migration, resulting in more failed operations. However, the rate of increase in failed operations is lower in Zephyr when compared to stop and copy. This is evident from the slope of an approximate linear fit of the data points in Figure 5.5; the linear fit for Zephyr has a considerably smaller slope than that for stop and copy. This shows that Zephyr is more robust to the use for a variety of workloads. The effect on transaction latency is similar and hence is omitted. We also varied the cache size allocated to the tenants, however, the impact of cache size on service interruption was not significant. Even though a large cache size will result in potentially more changes to be flushed to the disk, the Zipfian access distribution coupled with a high percentage of read operations result in very few changed objects in the cache. Figure 5.6a plots the impact of the database size on failed operations. In this experiment, we increase the database size up to 500K rows in the database (about 1.3 GB). As the database size increases, more time is needed to copy the database s persistent image, resulting in a longer unavailability window for stop and copy. On the other hand, for Zephyr, a larger database implies a longer finish mode. However, since Zephyr does not result in any unavailability, the database size has almost no impact on the number of failed operations. This is again evident from the slope of the linear fit of the data points; the slope is considerably higher for stop and copy, while that of Zephyr is negligible. Therefore, Zephyr is more robust for larger databases when compared to stop and copy. 86

111 Experimental Evaluation Section 5.6 No. of failed operations Zephyr Stop&Copy Slope: Zephyr:0.047, S&C: Number of rows (thousands) (a) Database size No. of failed operations Zephyr Stop&Copy Page size (KB) (b) Database page size Figure 5.6: Impact of the database page size and database size on number of failed operations. Figure 5.6b shows an interesting interplay of the database page size on the number of operations failing. As the database page size increases, the number of failed operations decreases considerably for Zephyr, while that of stop and copy remains almost un-affected. When the page size is small, each page could fit only a few rows. For instance, in our setting, each row is close to a kilobyte, and a 2K page is already full with two rows. As a result, a majority of inserts result in structural changes to the index, which result in a lot of these inserts failing during migration. If we consider the experiment with 2K page size, more than 95% of the failed operations were inserts. However, as the page size increases, the leaf pages have more unused capacity. Therefore, only a few inserts result in a change to the index structure. Since stop and copy is oblivious of the page size and transfers the raw bytes of the database file, its performance is almost unchanged as a result of a change in the page size. However, when the page size is increased beyond the block size of the underlying filesystem, reading a page from the disk becomes more expensive resulting in an increase in the transaction latency when the page size is larger than the file system block size. In summary, Zephyr results in minimal service interruption. In a cloud platform, high availability is extremely critical for customer satisfaction, thus making Zephyr more attractive. In spite of Zephyr not allowing changes to the index structure during migration, it resulted in very few operations failing. A significant failure rate was observed only with a high row size to page size ratio. Zephyr is therefore more robust to variances in read-write ratios, database sizes, and transaction sizes when compared to stop and copy, thus making it suitable for a variety of workloads and applications. 87

112 Chapter 5. Zephyr 5.7 Summary Live migration is an important feature to enable elasticity as a first class feature in multitenant databases for cloud platforms. We presented Zephyr, a technique to efficiently migrate a tenant s live database in a shared nothing architecture. Our technique uses a combination of on-demand pull and asynchronous push to migrate a tenant with minimal service interruption. Using light weight synchronization, we minimize the number of failed operations during migration, while also reducing the amount of data transferred during migration. We also provided a detailed analysis of the guarantees provided and proved the safety and liveness of Zephyr. Our technique relies on generic structures such as lock managers, standard B+ tree indices, and minimal changes to write ahead logging, thus making it suitable to be used in a variety of standard database engines with minimal changes to the existing code base. Our implementation in a standard lightweight open source RDBMS implementation shows that Zephyr allows lightweight migration of a live tenant database with minimal service interruption, thus allowing migration to be effectively used for elastic load balancing. In the future, we plan to augment this technique with the control logic that determines which tenant to migrate and where to migrate. This control logic along with the ability to migrate live tenants, together form the basis for autonomous elasticity in multitenant databases for cloud platforms. 88

113 Chapter 6 Squall The pessimist complains about the wind; the optimist expects it to change; the realist adjusts the sails. William Arthur Ward Changes in Internet usage trends in the last decade have given rise to numerous Web-based, front-end applications that support a large number of concurrent users. Creating large-scale, data-intensive applications is easier now than it has ever been, in part due to the proliferation of open-source distributed system tools, cloud-computing platforms, and affordable mobile devices. Developers are able, in a short amount of time, to deploy applications that have the potential to reach millions of users and collect large amounts of data from a variety of sources. Many of the database management systems (DBMS) used in these applications are based on the system architectures developed in the early 1980s. But now the processing and storage needs of emerging Internet-scale, Big Data applications are surpassing the limitations of these legacy systems. There is substantial evidence that shows that many organizations struggle with scaling traditional DBMSs for modern OLTP applications [73, 74]. One approach to overcome this impediment is to switch to a main memory DBMS [42, 35]. Such systems achieve better performance for these front-end workloads by eschewing the legacy, disk-oriented architecture that slows down traditional systems, such as heavy-weight concurrency control and recovery algorithms [48, 78]. Although large-memory compute nodes are more affordable today, the amount of memory on a single node might not be enough for some applications. Furthermore, if the database is only stored on a single node, then it may take a long time to get the database back on-line if that node fails. This downtime is unacceptable in today s world where services are expected to be continuously available and the cost of such an outage can be significant [8]. 89

114 Chapter 6. Squall This argues for the use of a distributed DBMS architecture oriented around main memory storage where the database is deployed on a cluster of sharednothing nodes. Recent examples of these distributed DBMSs include H-Store [52], MemSQL [2], and SQLFire [4]. These systems spread databases across sharednothing nodes into disjoint segments called partitions. This approach has been shown to significantly outperform traditional DBMSs for OLTP applications [78]. Some DBMSs, such as H-Store (and its commercial version VoltDB [5]) and SQL- Fire, provide support for ACID transactions through a SQL interface. Relational DBMSs that achieve performance and scalability similar to NoSQL DBMSs [18] without sacrificing the benefits of strong transactional guarantees are colloquially known as NewSQL [65]. Even if a database is in main memory, however, does not mean that the DBMS is immune to problems resulting from changes in workload demands or access patterns. Sudden increases in load or in the popularity of a particular item in the database can negatively impact the performance of the overall DBMS. Modern distributed systems can, in theory, add and remove resources dynamically, but in practice it is difficult to scale databases in this manner [51]. Traditionally, increasing system capacity involves either scaling up a server by upgrading hardware, or scaling out by adding additional servers to the system in order to distribute load. Either scenario involves migrating data and bringing servers off-line during maintenance windows [36]. Previous work has shown how to migrate a database from one node to another incrementally to avoid having to shutdown the system [37, 11]. This approach allows the DBMS to move partitions between nodes, but it does not allow the DBMS to split one partition into multiple partitions. For example, if there are particular entities in a partition that are extremely popular (e.g., Justin Bieber s Twitter account, Tone Loc s Tumblr page), then instead of migrating the entire partition to a new node it is better to move those entities to their own partition. In other words, if a hotspot forms within a partition, the system may need to split the partition to alleviate the hotspot. A better approach therefore is to dynamically reconfigure the physical layout of the database while the system is live. Some distributed NoSQL DBMSs, such as MongoDB [3], can split and migrate partitions to new nodes based on their storage size [41]. These systems do not support atomic operations on multiple objects, and thus it is easy to reconfigure the database in these environments. Although such trade-offs are appropriate for many situations, NoSQL DBMSs are insufficient for OLTP applications that need multi-operation transactions that may span multiple partitions. Another approach is to pre-allocate multiple virtual partitions for each real partition at start-up and then migrate some of them to new nodes when one 90

115 Background Section 6.1 needs to balance the load [67]. However, the DBMS has no knowledge or control of the contents of these partitions. Thus, it has no way of knowing whether the migration will result in the desired change in performance until after the virtual partitions have been migrated. To the best of our knowledge no DBMS today supports fine-grained, tuple-level load balancing that is needed for the system to be truly autonomous. These previous solutions also do not address the problem of managing data replication or support systems with multi-partition transactions. Given the lack of solutions for transactionally safe live reconfigurations, we have developed Squall, a lightweight and efficient technique for migrating data in distributed, main memory DBMSs. Squall s key contributions are (1) an efficient process for determining what data needs to be migrated during a reconfiguration and (2) a non-blocking mechanism for performing live reconfigurations that minimizes the impact on data migration on the overall throughput and latency of the DBMS. We implemented Squall in the H-Store [1] NewSQL DBMS and measured the system s performance using two OLTP workloads. Our results demonstrate that we are able to reconfigure a DBMS with no downtime and only an average 37% decrease in throughput for TPC-C and 21% throughput decrease for YCSB. 6.1 Background We begin with an overview of the underlying architecture of H-Store. Although we use H-Store in our analysis, our work is applicable to any partitioned, main memory OLTP DBMS H-Store Architecture We define an H-Store instance as a cluster of two or more nodes deployed within the same administrative domain. A node is a single physical computer system that contains a transaction coordinator that manages one or more partitions. H-Store is optimized for the efficient execution of transactions as pre-defined stored procedures 1. Each stored procedure is comprised of (1) parameterized queries and (2) control code that contains application logic intermixed with invocations of those queries. We use the term transaction to refer to an invocation of a stored procedure. Client applications initiate transactions by sending the procedure name and input parameters to any node in the cluster. The location where the transaction s control code executes is known as its base partition [66]. 1 Although H-Store supports ad-hoc transactions, we assume that the majority of transactions are executed as stored procedures. 91

116 Chapter 6. Squall Client Application Procedure Name Input Parameters Core Txn Coordinator Core Execution Engine... Execution Engine Partition Data Main Memory Partition Data Figure 6.1: The H-Store architecture from [66]. The base partition ideally will have most (if not all) of the data the transaction needs [65]. Any other partition involved in the transaction but is not the base partition is referred to as a remote partition. As shown in Fig. 6.1, each partition is assigned a single-threaded execution engine that is responsible for executing transactions and queries for that partition. A partition is protected by a single lock managed by its coordinator that is granted to transactions one-at-a-time based on the order of their arrival timestamp [14, 24, 82]. A transaction acquires a partition s lock if (1) the transaction has the lowest timestamp that is not greater than the one for the last transaction that was granted the lock and (2) it has been at least 5 ms since the transaction first entered the system [78]. This wait time ensures that distributed transactions that send their lock acquisition messages over the network to remote partitions are not starved. We assume that the standard clock-skew algorithms are used to keep the nodes CPU clocks synchronized. Serializing transactions at each partition in this manner has several advantages for OLTP workloads. In these applications, most transactions only access a single entity in the database at a time. That means that H-Store is faster than a traditional DBMS if the database is partitioned in such a way that most transactions only access a single partition [65]. The downside of this approach, however, is that transactions that need to access data at two or more partitions are slow. If a transaction attempts to access data at a partition that it does not have the lock for, then the DBMS aborts that transaction (releasing all of the locks that it holds), reverts any changes, and then restarts it once the transaction re-acquires 92

117 Background Section 6.1 all of the locks that it needs again. This removes the need for distributed deadlock detection, resulting in better throughput for short-lived transactions in OLTP applications [48]. All data in H-Store is stored in main memory. To ensure that all modifications to the database are durable and persistent, each H-Store node continuously writes asynchronous snapshots of the entire database to disk at fixed intervals [54, 78]. In between these snapshots, the DBMS writes out a record to a command log for each transaction that completes successfully [58]. This record only contains the original request information sent from the client. The DBMS combines multiple records together and writes them in a group to amortize the cost of writing to disk [49, 82]. Any modifications that are made by a transaction are not visible to the application until this record has been written. In addition to snapshots and command logging, main memory databases often use replication to provide durability and high availability. Each partition is fully replicated by another secondary partition that is hosted on a different node. All transactions for a replica are executed by the primary copy. The primary partition synchronously (eagerly) replicates the transaction at the secondary partitions [58]. Heartbeat and watchdog processes detect failures and promote secondaries Database Partitioning A partition plan for a database in H-Store is comprised of (1) partitioned tables, (2) replicated tables, and (3) transaction routing parameters [65]. A table can be horizontally divided into multiple, disjoint fragments whose boundaries are based on the values of one (or more) of the table s columns (i.e., the partitioning attributes). Alternatively, the DBMS can replicate non-partitioned tables across all partitions. This table-level replication is useful for read-only or read-mostly tables that are accessed together with other tables but do not fit into the overall partitioning plan of the database. A transaction s routing parameters identify the transaction s base partition. Administrators deploy databases using a partition plan that minimizes the number of distributed transactions by collocating the records that are used together often in the same partition [29, 65]. A partition plan can be implemented in several ways, such as using hash, range, or round-robin partitioning [34]. We assume that the DBMS maintains an internal catalog that maps partitions to nodes in the cluster. For this project, we modified H-Store to support the explicit mapping of data to partitions. Fig. 6.3 shows a sample partition plan for a simplified TPC-C database shown in Fig The WAREHOUSE table is partitioned by its id column (W ID). Since there is a foreign key relationship between these two two tables, the CUSTOMER 93

118 Chapter 6. Squall Figure 6.2: Simple TPC-C data, showing WAREHOUSE and CUSTOMER partitioned by warehouse IDs. plan{ "warehouses (W ID)": { "Partition 1" : 0-2, "Partition 2" : 3-4, "Partition 3" : 5-6, "Partition 4" : 7- } } Figure 6.3: A sample partition plan to control data layout. For TPC-C in this example, all tables are either replicated or partitioned by their foreign key relationship to the warehouse table. table is also partitioned by its W ID attribute. Hence, all data related by a given W ID (WAREHOUSE and CUSTOMER) are collocated on a single partition. Any stored procedure or transaction that attempts to read or modify either table should include the transaction routing parameter (W ID). If this is not included, then all partitions must be involved in the transaction. We will use this simplified example throughout the chapter for exposition. Since there is a relationship between a C ID to W ID, the CUSTOMER table does not need an explicit mapping in Fig

119 Motivation Section TPS % 20% 50% 80% 100% Percent of Operations for One Warehouse Figure 6.4: As workload skew increases on a single warehouse in TPC-C, the collocated warehouses experience reduced throughput due to contention. 6.2 Motivation Partitioned, main memory DBMSs like H-Store are able to execute singlepartition transactions more efficiently than systems that use a heavyweight concurrency control scheme. However, they still are susceptible to performance degradations due to changes in workload access patterns [65]. For example, the way the database is partitioned may need to adapt to changes in an application s behavior. Such a change could either cause a larger percentage of transactions to access multiple partitions or cause the partitions to grow larger than the amount of memory available on their node. As with any distributed system, DBMSs need to react to such changes to avoid nodes becoming overloaded. Failing to do so in a timely manner can have a significant impact not only on performance but availability in distributed DBMSs [41]. Ideally, the system can modify the physical layout of the database without needing to take the application off-line. We now demonstrate the impact of this problem with a series of experiments that are designed to motivate Squall The Need for Reconfiguration To demonstrate the effect of over-loaded partitions on performance, we ran a series of benchmarks to measure the amount of time transactions spend at stall points created by skewed hot-spots. For these experiments, we used a variant of the TPC-C benchmark on H-Store. Transaction requests are submitted from up to 64 clients running on a separate node in the same cluster. We postpone the details of these workloads and the execution environment until Section

120 Chapter 6. Squall In one experiment, we measure the impact of skew on H-Store s throughput. We vary the percentage of transactions submitted by the clients that target a single tuple in the database. As shown in Fig. 6.4, the throughput of the system degrades by 33% when 80% of the workload targets a single warehouse. The fundamental problem with main memory DBMSs is that their improved performance is only achievable when the database is smaller than the amount of physical memory available in the system. If the database does not fit in memory, then the operating system will start to page virtual memory, and main memory accesses will cause page faults. These faults cause the execution of transactions to stall while the page is fetched from disk. This is a significant problem in a DBMS, like H-Store, that executes transactions serially without the use of heavyweight locking and latching. The above examples show that both workload skew and partition overloading have a significant impact on the throughput of a distributed DBMS like H-Store. The solution to this problem is for the DBMS to respond to these adverse conditions by migrating data to either re-balance existing partitions or to offload data to new partitions. But as we now discuss, achieving this goal is non-trivial as it can put additional strain on an already overloaded system The Impact of Reconfiguration A change in the number of nodes or workload patterns can create the need for new partition plans and to shuffle data items between partitions. In the worst case scenario, the majority of data on every partition needs to be migrated to a new partition or every partition has data migrating out. This can result in significant impact on the system s throughput and latency. The impact of any reconfiguration solutions can be measured using the following metrics [36]: Service Interruption: The number of transactions or operations that are aborted due to a reconfiguration. Downtime: The amount of time that the service is not available to service any requests. External Coordination Overhead: The amount of work an external service must do to manage the reconfiguration. Reconfiguration Overhead: Any additional latency incurred during the reconfiguration. Reconfiguration Time: The amount of time that a reconfiguration takes to complete. 96

121 Overview of Squall Section 6.3 Figure 6.5: As a systems partition plan changes, Squall must manage and track the progress of reconfiguration at each node to ensure correct data ownership in a lightweight manner. In addition to these issues, the DBMS must ensure the correctness of the results for all queries executed during the reconfiguration. This is non-trivial because during a reconfiguration the active location of a tuple needed for a particular query may not be known. For example, suppose the sample database in Fig. 6.2 is switched to use the partitioning plan in Fig With this change, all of the data associated with W ID 2 (i.e., Warehouse, Customers) will be migrated from partition 1 to partition 3. If a transaction arrives at partition 2 that executes a query that accesses a customer associated with W ID 2, then the system does not know whether the data that it needs has already been moved or not. A false negative occurs if this query executes at partition 3 that contains CUSTOMER tuples where W ID=2 before these tuples have migrated from their current location at partition 1. Likewise, a false positive occurs if this query executes at partition 1 during the reconfiguration and the result includes tuples that should have been moved to node Overview of Squall Squall is a method for performing live reconfigurations in a distributed DBMS. We define a live reconfiguration as the ability of a DBMS to change the assignment of data to partitions and migrate data without taking any part of the system off-line. Squall defines the runtime steps to complete the reconfiguration in a transactionally consistent and safe manner. Squall ensures that the DBMS does not incur false negatives (i.e., that a tuple is assumed not to exist at a partition 97

122 Chapter 6. Squall when it actually does) or false positives (i.e., that a tuple is incorrectly assumed to exist at partition). Although there are many challenges associated with live reconfiguration and load-balancing, determining when a reconfiguration should occur and how the partition plan should evolve are beyond the scope of this dissertation. We instead focus on how to address live reconfiguration in the presence of distributed transactions, replication, and partitioned data access. In this work, we assume that a separate system controller initiates the reconfiguration process by providing the DBMS with the new partition plan. A reconfiguration can cause the number of partitions in the cluster to increase (i.e., data from existing partitions are sent to a new, empty partition), decrease (i.e., data from a partition being removed is sent to other existing partitions), or stay the same (i.e., data from an existing partition is sent to another existing partition). The key advantage of Squall over previous approaches is that it does not require the DBMS to pause or block transactions when moving data between partitions, thereby minimizing downtime and latency overhead during the reconfiguration. Squall processes a live reconfiguration in three stages: (1) initializing the reconfiguration process at a leader and notifying the DBMS cluster, (2) migrating data between partitions, and (3) identifying when the reconfiguration process has terminated. In this section, we begin by discussing these three steps. We provide further details of Squall s data migration protocol in Section 6.4. We discuss additional aspects of this process, such as fault tolerance and dynamic chunking, in Sections 6.5 and Initialization A new reconfiguration begins when the DBMS is notified by an external system controller. This notification contains (1) the new partition plan for the database and (2) the designated leader node for the operation. The leader can be any node in the cluster, but it is typically one of the nodes that contains a partition affected by the reconfiguration. If a new node is added to the cluster for reconfiguration, then the node must be on-line and initialized before the reconfiguration can begin. This includes populating schema, authentication, and communication information for the new node. The leader node invokes a special transaction that exclusively locks every partition in the cluster and checks to see whether it is allowed to start the reconfiguration process. The locking used by this transaction is identical to a normal distributed transaction that touches every partition in the system. The request 98

123 Overview of Squall Section 6.3 is allowed to proceed if (1) the system has terminated all previous reconfigurations and (2) the DBMS is not writing out a snapshot of the database to disk. If either of these two conditions are not satisfied, then the transaction aborts and is re-queued after the operation blocking it finishes (i.e., the current in-flight reconfiguration or snapshot). If two concurrent reconfiguration transactions are issued, the request with the larger timestamp is rejected. This ensures that all partitions have a consistent view of data ownership and prevents deadlocks caused by concurrent reconfigurations. If all of the partitions agree to start the reconfiguration, each partition enters a special reconfiguration mode. When a partition enters this mode, Squall examines the new partition plan to determine what of its data (if any) is affected by the reconfiguration. If the partition is moving a portion of its data to another partition, then it checks two things. First, it determines what data is leaving the partition and which data will be moving into the partition. The incoming and outgoing tuples are broken into ranges on the partitioning attributes and are associated with the source and destination partition for each range. As we discuss in Section 6.6, this step is necessary because Squall may need to split tuple ranges into smaller chunks and build temporary indexes for certain tables to keep track of individual tuples during the migration process. After this local data analysis is complete, each partition notifies the leader and waits for a response. The leader will either (1) acknowledge the reconfiguration to all partitions or (2) send an abort message to reset the reconfiguration state and releases the lock. If all of the partitions agree to proceed with the reconfiguration, then Squall begins the data migration. Since the reconfiguration transaction only modifies meta-data, the transaction is extremely short and has a negligible impact on performance Data Migration Migration is the process of moving tuples from their current partition to another location based on the new partition plan. The easiest way to transfer data from one partition to another in a distributed DBMS is to stop executing all transactions at those two partitions during the transfer process. This ensures that all transactions that execute either before or after the stop have a consistent view of the database at those partitions and that all updates are propagated accordingly. But this stop-and-copy approach is unacceptable for OLTP applications cannot tolerate any downtime. It is non-trivial, however, to transfer data while the system is still executing transactions. For example, if half of the data that a transaction needs to access 99

124 Chapter 6. Squall "plan":{ "plan":{ "warehouses":{ "warehouses":{ "Partition 1" : 0-2 "Partition 1" : 0-1 "Partition 2" : 3-4 "Partition 2" : 3-4 "Partition 3" : 5-8 "Partition 3" : 2,5 "Partition 4" : 9- "Partition 4" : 6- } } } } Old Plan New Plan Figure 6.6: Sample Updated Partition Plan. has already been migrated to a different partition, it is not obvious whether it is better to propagate changes to that partition or restart the transaction and re-execute it at the new location. The challenge is in how to coordinate live data reconfiguration between partitions to ensure that no data is loss or duplicated, and with minimal impact to the DBMS s performance. To overcome these problems, Squall tracks the location of migrating tuples during the migration process at each partition. This allows the execution engines at each partition to determine whether it has all of the tuples that are needed for a particular transaction. With Squall, transactions are scheduled at the partition according to the new plan, which reactively pulls data on-demand [71, 37]. While this method introduces latency for the on-demand data pulls, it always advances the progress of data migration, only migrates the active data being accessed, has no external coordination, and results in no downtime to synchronize the state of data. Squall uses an on-demand mechanism to pull data from the destination in response to a transaction that accesses migrating data. To ensure that the reconfiguration process completes in a timely manner, Squall migrates additional data during periods of inactivity in the system. Each partition is only responsible for tracking the progress of migrating data between itself and other partitions. In other words, each partition tracks the status of only tuples it will be migrating. Through the use of data structures to track migrating ranges and keys, a partition can identify if data is actively local or whether another partition must be involved to either migrate the data or access the data Termination Once a partition recognizes that it has received, or pulled, all of the data affected by the new partition plan, it notifies the current leader that the data mi- 100

125 Managing Data Migration Section 6.4 gration is finished at that partition. When the leader receives acknowledgments from all of the partitions in the cluster, it notifies all partitions that the reconfiguration process is complete. Each partition removes all of its tracking data structures, temporary indexes, and exits the reconfiguration mode. 6.4 Managing Data Migration The migration of data between partitions in a transactionally safe and consistent manner is the most challenging aspect of the live reconfiguration problem in a distributed DBMS. We now discuss this facet of Squall in greater detail. We first describe how Squall divides each partitions data into ranges and maintains this tracking information. We then describe two methods for moving data in Squall: reactive migration and asynchronous migration. The former ensures that hot data is moved to its new location quickly without the use of complex usage modeling [77], while the asynchronous variant ensures that the reconfiguration completes. This entire process is completely autonomous; the administrator does not need to provide Squall with any information about how to split tables into chunks or the ordering of migrations between partitions. For this exposition, we use the sample shown in Fig We also refer to the partition losing data as the source partition and the partition receiving data is the destination partition. Although a partition can be both a source and destination during a single reconfiguration, we refer to a partition as either a source or destination for a particular tuple in the database Identifying Migrating Data When a new reconfiguration process begins, Squall calculates the difference between the original partition plan and the new one to determine (1) the set of incoming tuples per partition and (2) the set of outgoing tuples per partition. These sets are defined as ranges on the values of the tables partitioning attributes. For example, for the database shown in Fig. 6.6, the incoming warehouses for partition 3 is noted as W ID = [2, 2) and the outgoing warehouses as W ID = [6, ) For tables partitioned on a non-unique foreign key, such as the CUSTOMER table partitioned by its WAREHOUSE id, the partitioning ranges for the foreign key relationship are used for this table. In other words, every table that has migrating data must independently have ranges on its partition attribute specified. When reconfiguring between two partition plans, each partition independently identifies all the data ownership changes for itself. Each partition uses foreign key relationships and the new partition plan to derive each table s incoming and 101

126 Chapter 6. Squall outgoing tuple reconfiguration ranges. Following the same example, partition 3 s outgoing reconfiguration range for WAREHOUSE is [6, ) = partition 4 and for CUSTOMER is [6, ) = partition 4. Squall splits large ranges into smaller, disjoint chunks. This is necessary because the distribution of partition keys can create discrepancies in the physical size of tuples covered by a range. For example, there could be as few as 500 customers for W ID 6 and 7, but as many as 24,000 customers associated with W ID 8. It is important that the size of each reconfiguration range is properly calculated to reduce the impact on performance due to the data migration. In Section 6.6, we discuss how Squall computes the near-optimal sizes of these chunks and a way to efficiently migrate small partition key ranges that contain a large amount of physical tuples while maintaining correctness. As the DBMS migrates ranges from one partition to another, Squall tracks their status. This is necessary in order for Squall to determine when the reconfiguration is complete, what data the partition has received, and what outgoing data has been successfully migrated. To enable predicate queries at a different granularity than the reconfiguration range, Squall can use either individual keys or split ranges for tracking ownership. For example, in Fig. 6.6, partition 4 is pulling W ID = 6-8 for multiple tables from partition 3. If a query scheduled at partition 4 touches WAREHOUSE with W ID = 6, partition 4 should pull this warehouse from partition 3 without the need to pull other warehouse tuples. After migrating the data, partitions 3 and 4 should record that WAREHOUSE table with W ID = 6 has been migrated, and that W ID [7,] still remains at partition 3. On the other hand, if a query at partition 4 accessed W ID >= 6 and W ID <= 7, the corresponding range for [6,8] could be split into two ranges [6,7] and [8,8). Squall can migrate only the required data and annotate that new WAREHOUSE [6,7] range as migrated. Splitting reconfiguration ranges in conjunction with tracking individual keys, enables range predicate queries on partitions keys that are not discrete (i.e., strings or floats). Squall tracks the migration progress of each non-replicated table with two data structures, a sorted list of reconfiguration ranges and a hash map of individual keys that have been migrated. For a partition p j, the incoming range structure pulled ranges, is initialized to the ranges identified by the initialization step. Every range annotates if it has been fully migrated or partially migrated (e.g., keys within a range have been pulled). Each table s pulled keys hash map specifies if an individual tuple has been migrated for this table. Similar to the incoming structures, the transferred ranges specifies the ranges of keys that still need to be migrated and the transferred keys stores the individual keys that already have been migrated out. Since the tracking keys are based on partitioning attributes, there 102

127 Managing Data Migration Section 6.4 can be multiple tuples associated with a single key. Fig. 6.7 demonstrates how each partition identifies the incoming (and outgoing) ranges for a reconfiguration. As the reconfiguration progress, Squall will track data at the key and range level for each table by its partition key. For replicated tables, Squall schedules ranges to be copied, not migrated, in a round-robin manner from all the partitions with the fewest amount of outgoing ranges. The mechanism for copying ranges is similar to migrating range, except when a partition sends the copied range is does not delete the range locally Reactive Migration The primary migration mechanism in Squall is for the destination to reactively pull data from the source on behalf of a query executing at the destination. When a transaction t executes at p j, Squall traps tuple accesses for each query 2 invoked by t to check if any tuples are migrating in the reconfiguration. For this discussion, we define x as a single tuple in the database, p s as a source partition, and p d as a destination partition. Squall determines if a query touches migrating data by extracting the specified partitioning values for the tables targeted by that query and checking it against the reconfiguration tracking ranges. Because a transaction may have been scheduled before the reconfiguration began, Squall must take this into account when processing and routing transactions during a reconfiguration. If any value matches an outgoing range (e.g., a value is scheduled to leave this partition during the reconfiguration), then Squall treats this as a source data access and therefore must determine if the data is currently local or remote. If any partitioning value matches an incoming reconfiguration range (e.g., a value is scheduled to migrate in during the reconfiguration), then we treat the operation as a destination data access and block the query execution until the requested data is pulled. If a transaction arrives at p j that touches a data item moving from p s to p d and j s j d, p j must route the transaction to one of the partitions. If p s or p d is on the same physical node as p j, Squall checks with the local partition on the current location of the migrating data item. On the other hand, if neither partition is local p j, it will assume the data is at p d. This procedure is followed regardless whether t is a single-partition transaction, a distributed transaction originating at p j, or a distributed transaction that originated at p s. When a transaction t at p j accesses a data item x that is migrating away from p j to p d, Squall checks if the data item is still present locally. While any transactions at a remote partition involving x are scheduled at p d after the re- 2 We refer to a query as an operation in a transaction that reads or writes data. 103

128 Chapter 6. Squall Figure 6.7: Tracking the partition s progress at different granularities. configuration begins, t may have been scheduled before the reconfiguration was initialized or a local partition may not have migrated out x yet. Squall will check if x s partitioning value is in the transferred key map or ranges. If so, x has already been migrated out and t must be restarted at p d. If t requires other data items at p j, then t is restarted as a distributed transaction that involves p d. If no match is found, p j can process the transaction and it is executed at p j, as x has not been migrated out. When any query of t involves range or non-equality predicates, the transfered key map is not consulted to check ownership, only the transferred range. If a corresponding range has been migrated, then the source must restart t. When a query of t at p j is trapped and touches a tuple x that is migrating to p j from p s, Squall must verify that the data item has been received before allowing the query to proceed. If x s partitioning value is found in pulled key map or if x s corresponding pulled range has been migrated, then data is present and t is immediately processed. If a query of t uses a range or non-equality predicate, the transferred map is not used and only the pulled ranges are checked. If x does not match any range in the pulled ranges, then t involves no data items migrating in and can be processed. Otherwise, x matches a pulled range that has not been migrated, meaning that there is at least one tuple that must be migrated in to process t. The partition blocks the execution of t and initiates a data pull request for x at p s. Upon receipt of the data pull response, t is unblocked and processed. Squall allows for multiple tuples to be pulled from multiple partitions for a single transaction t. In this case, t is blocked until all data pull requests are completed. A data pull request from partition p d to p s contains the ID of the requesting transaction, and a list of data items to pull. Each data item specifies a table and either a single partition key or a range of partition keys. To minimize the latency 104

129 Managing Data Migration Section 6.4 of the pulling transaction, p s schedules data pull requests as a special transaction ahead of any scheduled transaction, but does not preempt any active transaction. When the data pull request is processed, the appropriate data for the request is extracted from the storage module via the execution engine. If an ordered index on the partition column exists for the target table, then the execution engine scans the index to find the appropriate tuples. If there is no index or only a hash index, a full table scan is used to identify migrating tuples. Since the tables are entirely in memory, a table scan is not prohibitively expensive. While Squall could construct a temporary index to alleviate excessive table scans, determining when to construct temporary indexes is beyond the scope of this dissertation. During this extraction, the storage module immediately deletes the matching tuples from tables and indexes to free memory, and stores a copy of the extracted data in a temporary table with the pull s meta data. The appropriate transferred data structures are updated to reflect that data items have been migrated out. The response for the data pull request is packaged and sent to p d. Once p d acknowledges the receipt of the response, the entry in the temporary extract table can be deleted. Since the partition executors are single threaded, there is no concurrent access allowed at p s from the time the pull request is processed until the time the response is sent to p d. As both partitions are locked during the migration of data items, no other transaction can concurrently issue a read or update query at these partitions, thereby preventing any transaction anomalies due to reconfiguration. When p d receives the response for a data pull the storage module inserts the tuples in the appropriate tables, updates indexes, updates pulled data structures, and records that the pull request ID was answered and acknowledges the receipt of data to p s. If all the data pull requests for t have been responded to, then p d will resume processing the blocked transaction t. Squall uses deadlock detection to ensure that p s and p d are not blocked on waiting for data items from each other or in a cyclic chain. In our current implementation, this is based on timeouts for transactions. Additionally, any data pull request that is not acknowledged within a timeout window will be resent. A data pull request can be initiated by p d on behalf of a multi-partition query. If p s is participating in the same transaction as p d, a pull request from p d to p s would unnecessarily timeout. This is due to the pull request at p s being blocked until the active transaction completes. Since the pull request is required by p d for the transaction to complete, a deadlock occurs until the transaction timeout is reached. Since the originating transaction ID is included in the data pull request, Squall avoids this deadlock by allowing p s to process data pull requests if the ID 105

130 Chapter 6. Squall matches the transaction currently being processed and p s has not modified the requested data item. If p s has modified such an item, we restart the transaction Asynchronous Migration Although reactive migration allows the DBMS to transfer the most popular tuples immediately, it does not guarantee that the reconfiguration will complete. If Squall only uses a reactive approach, then the data that is not accessed by transactions during the reconfiguration is never migrated. Therefore, during periods of inactivity, Squall also migrates data asynchronously, pulling tuples from the source partitions to their destinations. When the reconfiguration is initialized at p d, all incoming tuple ranges identified for p d are individually scheduled as asynchronous data pull requests in a local queue. The asynchronous pull requests operate similarly to live pulls but are given the lowest priority in the transaction scheduling. Squall attempts to spread out asynchronous pulls between transaction requests to amortize the impact of pulling cold tuples. The pulling partition (p d ) only removes and executes the asynchronous pull from the queue after a minimum amount of time has passed since the last successful asynchronous data pull, to not have them throttle the performane of the system. Once the asynchronous pull is issued, the execution engine for p d resumes processing transactions without waiting for the response. When the asynchronous pull arrives at p s, it is scheduled as a normal transaction, and not ahead of any transactions currently queued. When the DBMS processes the asynchronous pull at p s, it verifies that the requested range has not been migrated by a data pull request. If not, the execution engine attempts to extract the matching tuples as before. However, the extraction process only extracts as many tuples as up to the chunk size limit. If this limit is reached during the extraction, the execution engine checks if there is another tuple matching the pull request. If the extracted response is marked as incomplete, then the tracking range marked as partially migrated. The response is sent (along with the incomplete annotation), and the asynchronous pull request is rescheduled at p s. This process repeats until the extract is complete. p d schedules the pull responses as a normal transaction. If any query involves a range that has been partially migrated, Squall must invoke a data pull request and flush matching asynchronous data pull reply messages to ensure correctness. Any pending asynchronous requests for the range must also be canceled so as to not have unnecessary pulls effecting performance. 106

131 Fault Tolerance Section Replication Management As replication is configured at the granularity of partitions and during reconfiguration tuples migrate between partitions, it is important to have all of the partition s replicas reflect data ownership throughout the reconfiguration. To guarantee recovery and consistency for a migrating tuple x, partition p s continues to send updates to secondary replicas until x is pulled by p d. When p s receives a data pull request for x, it notifies the secondary replicas to extract x and move it to a temporary extract table. When p d receives the data pull response it must forward the response to its secondary replicas to insert x. The secondary replica inserts must be acknowledged before p d finishes the migration of x and the data pull response acknowledgment is sent to p s. Receipt of this acknowledgement allows p s and its replicas to delete x from the temporary extract table. Replicas independently track the reconfiguration process through local copies of the tracking data structures. This process ensures that there are only active replicas for each data item. If there is a failure during a reconfiguration a partition s secondary replica is upgraded to a primary with the normal mechanism [15], and continues the reconfiguration. The same process is applicable when more than one tuple is being migrated, as a single data pull request is always limited to a pair of partitions. 6.5 Fault Tolerance H-Store enables high availability and fault tolerance through replicating partitions on other nodes [78]. All reads and writes are serviced by the primary replica copy and updates are synchronously applied by replicas. Heartbeats between nodes and watchdog processes determine when a server has failed. On failure, a replica takes over the primary role. All future requests for this replica are directed to the new primary replica. In addition to replication, command logging and snapshots enable fault tolerance for system wide crashes, such as data center or power outages. This section discusses how Squall uses these mechanisms to manage fault tolerance during the reconfiguration process Failure Handling There are three cases of partition failure that Squall handles: (1) the reconfiguration leader failing, (2) a source partition failing, and (3) a destination partition failing. A partition failure can involve all three scenarios (e.g., the leader fails and has in- and out-going data). Since replicas independently track the progress of 107

132 Chapter 6. Squall reconfiguration, they are able to replace the primary replica during reconfiguration if needed. If any node fails during reconfiguration, it is not allowed to rejoin the cluster until the reconfiguration has complete. Afterwards it will recover the updated state from its primary node. During the data migration phase, the leader s state is replicated by notifying replicas when a partition completes its local migration. If the leader fails during the data migration phase, a hot replica will be able to resume tracking the progress of reconfiguration. If after fail-over, the new primary leader replica detects that all replicas have already sent messages to indicate completion, the new leader will resend the reconfiguration complete message to all partitions. The former leader may have failed while sending out this message, and with the updated partition plan included in the termination request, the partitions can idempotently apply the termination. If the leader fails during the initialization phase, the new primary replica does not need to resume the initialization as the reconfiguration transaction never completed. The system s timeout mechanism will restart the initialization request at the new leader. When a partition fails while migrating data, the secondary replica replacing the partition s primary replica must reconcile any potential data pull requests that occurred during the failure. If any tuples are in the temporary extract table (the staging area for deleting migrating tuples), the replica infers that the primary had the same tuples in the extract table, but it is unclear if the pull response was sent or acknowledged. The replica must therefore re-send the pull data response to p d. Since pull responses contain the request ID, p d can determine if the data has been applied already. Regardless, p d acknowledges the pull data to enable the new p s and its replicas to remove the associated extract table entry. At this point, since the replicas are tracking progress of outgoing data ranges, they are able to reconstruct the identical state of outgoing ranges. If a replica is elected as the primary and the partition was migrating data, the replica must reconcile its state with the failed primary. The replica is uncertain if the transaction that pulled the data completed or if the data pull response was acknowledged. If the failure occurred before either event, the corresponding event would be-restarted by the original caller. A restarted transaction would be processed without migrating the data, and a duplicate data pull response can be answered by the replica keeping a list of processed data pull requests. The only operation required by the new primary is to reconstruct and schedule the list of asynchronous data pull requests based on the original incoming ranges. 108

133 Dynamic Data Chunking Section Crash Recovery During a reconfiguration, the DBMS suspends all of its checkpoint operations. This ensures that the partitions checkpoints stored on disk are consistent (i.e., a tuple does not exist in two partitions at the same time). The DBMS continues to write transactions entries to its command log during data migration. If the entire system crashes due to power failure after a reconfiguration completes but before a new snapshot is taken, then the DBMS recovers the database from the last checkpoint and performs the migration process again. The DBMS first scans the command log to find the starting point after the last checkpoint entry. It then searches for an entry of the first reconfiguration transaction that started after the checkpoint. If one is found, then the DBMS extracts the partition plan information from the entry and uses that as the current partition plan for the database. The execution engine for each partition then reads in the contents from the last snapshot taken at its partition. For each tuple in a snapshot, the Squall determines what partition should store that tuple, since it may not be the same partition that is reading in that snapshot. Once the snapshot has been loaded into memory from the file on disk, the DBMS then replays the command log to restore the database to state that it was in before the crash. The DBMS s coordinator ensures that these transactions are executed in the exact order that they were originally executed the first time. Hence, the state of the database after this recovery process is guaranteed to be correct, even if the number of partitions change due to the reconfiguration. This is because (1) transactions are logged and replayed in serial order, so the reexecution occurs in exactly the same order as in the initial execution, and (2) replay begins from a transactionally-consistent snapshot that does not contain any uncommitted data, so no rollback is necessary at recovery time [47, 58]. 6.6 Dynamic Data Chunking With single threaded system design, large reconfiguration ranges can increase the latency of transactions by blocking the partitions involved in the migration for the data pull request and response, essentially working as long running transactions effecting throughput of the system. To minimize the overhead of reconfiguration, the size of each reconfiguration range should be ideally smaller than a certain max range size. This max range size is a function of I/O and network bounds of the DBMS. When a table is partitioned by a primary key or the table has a clustered index on the partitioned column, determining reconfiguration ranges to respect a 109

134 Chapter 6. Squall max range size is straight forward. Squall searches the clustered index to find the offsets in the clustered index that divide the table into such ranges. With an unclustered index on the partitioned column, such as a hash index, estimating the size of ranges is difficult. For example, using the example from Fig. 6.2, assume we are moving warehouse 3 from partition 2 to partition 3. And lets say a pull of more than 2 MB introduces high latency. The average size of a CUSTOMER tuple is 1 MB, thus migrating more than two tuples at a time creates an unacceptable latency overhead. But the customer table requires three tuples to be migrated (C ID = (1,3,1004)). Therefore, Squall must break the CUSTOMER range into two chunks. Looking at the C ID distribution, one way to split CUSTOMER ranges into two chunks is: ,1000-KEY MAX. Splitting the keys to respect the max range size is difficult without knowing the key distribution. The solution described in Section is to have the source partition to fill up a data pull request until max range size is reached. This approach however has a limitation since once a range begins migrating chunks, neither the source nor the destination can safely execute a query that accesses data within this range. Taking the example described above, consider an insert that is scheduled at partition 2 with C ID = 10 and W ID = 3 during reconfiguration. If any portion of the Customers table has been migrated, this insert can not be allowed, as the record may exist and have been migrated, or a conflicting insert may be scheduled at partition 3. This operation is not be allowed to proceed until the entire part of the table corresponding to W ID =3 has been migrated to avoid conflicts. We propose a scheme which chunks reconfiguration ranges and enables the source and destination to deterministically derive ownership of tuples during reconfiguration at a granularity finer than the partition key. During initialization, the leader includes the max range size and each table s average tuple size with the initial request. For tables partitioned on a clustered index, the chunked reconfiguration ranges are determined as described earlier. Whereas for tables partitioned on non-clustered indexes, the partitioned data is sampled to approximate how much data will be migrated. If the data to be migrated is approximated to be larger than the max range size, then Squall will chunk the table s ranges into multiple data pulls instead of using a single data pull. Each partition responds to the leader of reconfiguration during initialization of the reconfiguration, indicating each table s outgoing ranges and corresponding number of chunks for each range based on the sampling. This information is aggregated by the leader and is included in the message sent to all partitions to start the reconfiguration. A sample response could be for migrating W ID 3-6 with customers for W ID = 3 needing 10 chunks: 110

135 Experimental Evaluation Section 6.7 WAREHOUSE (W ID): [3-7) CUSTOMERS (W ID): 3 (10 Chunks), [4-7) When migrating customers with a W ID = 3, both source and destination know there are 10 chunks. A tuple is associated with a chunk by hashing a joined primary key and partitioned key, modulo the number of chunks. During the reconfiguration, the data pull requests for chunked tables include the partition ID and chunk ID. Both the source and destination are able to identify which chunk a tuple belongs to at runtime. Since the DBMS can associate any tuple with a chunk, this allows both the source and destination to safely execute a query that accesses an individual tuple for a table that has been partially migrated. Any query that uses a range predicate on chunked data is executed as a distributed transaction that involves both source and destination. 6.7 Experimental Evaluation To evaluate Squall s techniques, we integrated it with H-Store [1] using the simple, non-hash chunking approach as described in Section 6.6 and ran several experiments using two OLTP benchmarks with differing workload complexities. For this evaluation, we implemented an external controller that initiates the reconfiguration procedure at fixed time in the benchmark trial. We also implemented a stop-and-copy protocol in H-Store as a baseline for the comparison with Squall. In our implementation, a distributed transaction locks the entire cluster and then performs the data migration. All transactions are blocked at all partitions until this process completes. The experiments were conducted on a cluster with the following specifications. Each node has a Intel Xeon E5620 CPU running 64-bit CentOS Linux with Open- JDK 1.7. We used the April 2013 release of H-Store. The nodes are in a single rack connected by a 10GB switch with an average RTT of 0.15 ms Workloads We now describe the two workloads from H-Store s built-in benchmark framework that we used in our evaluation. YCSB: The Yahoo! Cloud Serving Benchmark is a collection of workloads that are representative of large-scale services created by Internet-based companies [23]. For all of the YCSB experiments in this chapter, we use a 4GB YCSB database containing a single table with 4 million records. Each YCSB tuple has a primary key and 10 columns each with 100 bytes of randomly generated 111

136 Chapter 6. Squall TPS TPS Stop and Copy YCSB Squall YCSB Elapsed Time (seconds) (a) Throughput (YCSB Uniform) Latency (ms) Latency (ms) Stop and Copy YCSB 45 Squall YCSB Elapsed Time (seconds) (b) Latency (YCSB Uniform) TPS TPS Stop and Copy YCSB Squall YCSB Elapsed Time (seconds) (c) Throughput (YCSB Skewed) Latency (ms) Latency (ms) Stop and Copy YCSB 45 Squall YCSB Elapsed Time (seconds) (d) Latency (YCSB Skewed) TPS TPS Stop and Copy TPC-C Squall TPC-C Elapsed Time (seconds) (e) Throughput (TPC-C) Latency (ms) Latency (ms) Stop and Copy TPC-C 3000 Squall TPC-C Elapsed Time (seconds) (f) Latency (TPC-C) Figure 6.8: Partition Addition A reconfiguration to expand a cluster with two nodes from 6 partitions to 8 partitions. This expansion acts a reshuffle as all data items are evenly distributed between 8 partitions after reconfiguration. string data. The workload consists of two types of transactions; one that reads a single record and one that updates a single record. Our YCSB workload generator supports executing transactions with either a uniform access pattern or with Zipfian-skewed hotspots. 112

137 Experimental Evaluation Section 6.7 TPS TPS Stop and Copy YCSB Squall YCSB Elapsed Time (seconds) (a) Throughput (YCSB Uniform) Latency (ms) Latency (ms) Stop and Copy YCSB 45 Squall YCSB Elapsed Time (seconds) (b) Latency (YCSB Uniform) TPS TPS Stop and Copy YCSB Squall YCSB Elapsed Time (seconds) (c) Throughput (YCSB Skewed) Latency (ms) Latency (ms) Stop and Copy YCSB 45 Squall YCSB Elapsed Time (seconds) (d) Latency (YCSB Skewed) TPS TPS Stop and Copy TPC-C Squall TPC-C Elapsed Time (seconds) (e) Throughput (TPC-C) Latency (ms) Latency (ms) Stop and Copy TPC-C 3000 Squall TPC-C Elapsed Time (seconds) (f) Latency (TPC-C) Figure 6.9: Node Addition A reconfiguration to expand from 4 partitions on one node to 8 partitions on two nodes. This expansion attempts to minimize the data movement, by having each partition migrate half of its data to exactly one new partition. TPC-C: This benchmark is the current industry standard for evaluating the performance of OLTP systems [79]. It consists of nine tables and five procedures that simulate a warehouse-centric order processing application. The key aspect about this benchmark is that the two most executed transactions vary whether 113

138 Chapter 6. Squall they touch multiple partitions based on their input parameters. In order to support fine-grained reconfigurations, we use a database with 32 warehouses and scale down the amount of data associated with each warehouse ( 250MB). For both benchmarks, transaction requests are submitted from 60 up to 180 client threads running on five dedicated node in the same cluster. Each client submits transactions to any DBMS node in a closed loop (i.e., it blocks after it submits a request until the result is returned). For aggregate results we execute each benchmark three times and report the average results. In each trial, the DBMS warms-up for 30 seconds and then the performance metrics are collected for five minutes. The throughput results are the number of transactions completed divided by the total time (excluding the warm-up period). We use the mean throughput of all clients within a second interval. For all time series graphs, the dashed vertical-line denotes the interval with the start of a reconfiguration, and the light dotted line is the interval containing the end of the reconfiguration. The latency results are measured as the time from when the client submits a request to when it gets the transaction s result Cluster Expansion We first evaluate how well Squall is able to increase the number of partitions during a live reconfiguration in H-Store with minimal impact to the throughput and latency of transactions. We test two different reconfiguration scenarios: (1) one partition is added to each node in the cluster and (2) one node with four partitions is added to the cluster. The reconfiguration request is issued by the experiment controller 90 seconds into the benchmark trial. We now discuss several aspects of the results from these experiments. Partition Addition: The initial configuration of the DBMS is two nodes with three partitions each, with the database being split evenly amongst each partition. Upon reconfiguration, each node adds a fourth partition and evenly redistributes all data amongst all partitions in the cluster. In this data reshuffling a majority of the partitions are exchanging between two partitions. The results in Fig. 6.8 demonstrating the low impact of reconfiguration experienced by YCSB. Here, the reconfiguration progress much slower than the baseline, but avoids the downtime of two to five seconds experienced by stop and copy. This downtime results in thousands of aborted transactions due to the off-line system. For TPC-C Squall experiences a disruption similar to stop and copy. This is primarily due to two of the transaction classes accessing two larger tables (customer and stock) by the partition key. Since our implementation uses the static chunking 114

139 Experimental Evaluation Section 6.7 described in Section 6.6, for correctness we must pull the entire range for a given partition key. The extract and load for larger tables can each take anywhere from 500 ms to 2000 ms to move the data and update local indexes. During this process neither involved partition may process any local or distributed transaction, resulting in a slight back up in the transaction queue. Node Addition: The initial configuration of the DBMS is one node with four partitions. The reconfiguration causes a new node with four partitions to added to the cluster. Each partition sheds half of its data to a new partition, reducing the amount of data being reconfigured. The DBMS will migrate an equal amount of data from the four original partitions to these four new partitions. Since clients connect to any partition and are not aware of the data partitioning, a large percentage of transactions now require a latency inducing redirect. The performance characteristics in Fig. 6.9 are similar to the previous experiments, outside of the more pronounced reconfiguration transition in the uniform YCSB experiment. The lack of a hot dataset results in the reconfiguration having a more drawn out reconfiguration, whereas in the skewed YCSB and TPC-C benchmarks reactive pulls migrate hot data items promptly Cluster Consolidation Next, we investigate the performance impact of Squall and the stop-and-copy schemes when the number of partitions in the cluster contracts. As in the previous experiment, the initial configuration of the DBMS is two nodes each with four partitions. We test two different reconfiguration scenarios: (1) each node removes one partition and (2) one node is the removed from the cluster. Since the results are consistent with the previous experiments we do not show (1) and the latency graphs of (2) due to space constraints. The contraction experiments on average require a longer window to complete the reconfiguration, which is likely due to client transactions which can continue to arrive at the source until the reconfiguration completes. The consistent performance impact of Squall implies that it is well suited for both expansion and contraction reconfigurations that do not have an immediate deadline. When using Squall, the reconfiguration takes longer than Stop and Copy as data is either moved in response to a transaction requesting data at the source, or asynchronous pulls that are intentionally spaced apart to minimize the impact of pulling ranges with a single threaded design. However, during the entire reconfiguration process nodes are available to process transactions. Since the data 115

140 Chapter 6. Squall TPS TPS Stop and Copy YCSB Squall YCSB Elapsed Time (seconds) (a) Throughput (YCSB Uniform) Latency (ms) Latency (ms) Stop and Copy YCSB Squall YCSB Elapsed Time (seconds) (b) Latency (YCSB Uniform) TPS TPS Stop and Copy YCSB Squall YCSB Elapsed Time (seconds) (c) Throughput (YCSB Skewed) Latency (ms) Latency (ms) Stop and Copy YCSB 45 Squall YCSB Elapsed Time (seconds) (d) Latency (YCSB Skewed) TPS TPS Stop and Copy TPC-C Squall TPC-C Elapsed Time (seconds) (e) Throughput (TPC-C) Latency (ms) Latency (ms) Stop and Copy TPC-C Squall TPC-C Elapsed Time (seconds) (f) Latency (TPC-C) Figure 6.10: Node Removal A reconfiguration to contract from 8 partitions on two nodes to 4 partitions on one node. is entirely in-memory, a stop and copy approach quickly migrates data between partitions with only a few seconds of downtime. For highly available systems this downtime maybe unacceptable. For example, if a system desires five nines (99.999%) of uptime, each week the system can only afford 6 seconds of downtime. 116

141 Future Work Section 6.8 TPS % Decrease MB Pulled in TPC-C Reconfiguration Figure 6.11: The impact of migrating larger databases on mean throughput Database Size Sensitivity Analysis As discussed in Section 6.4.1, there are several aspects of the database that can affect the run time behavior of Squall. The most substantial of these is the amount of data that has to be migrated. In this next experiment, we vary the size of the migrated data to measure its impact on throughput. We use the TPC- C benchmark with various amounts of data associated with each warehouse. We run several different reconfigurations and measure what is the impact of migrating data on the system s throughput. As shown in the results in Fig. 6.11, the larger database sizes disrupt transaction throughput from having more data to migrate overall and more data required to answer transactions that access multiple tables or tables partitioned on non-unique columns. 6.8 Future Work There are several areas of future research that we are interested in exploring with Squall. With Squall we have addressed how to reconfigure a partitioned database in a live manner. Further we want to develop on-line models to detect when the DBMS should perform a reconfiguration and then to automatically generate a new near-optimal partition plan for the database [36]. Because Squall supports fine-grained reconfigurations and partition assignment, we are able to evaluate a variety possible scenarios and models. We are also extending the operational side of Squall to include several optimizations to reduce the total time of a reconfiguration. For example, we are interested in employing machine learning (ML) techniques to infer what data transactions immediately when they are queued and then prefetch them from the 117

142 Chapter 6. Squall source partitions [66]. We are also investigating use of ML for implementing dynamic chunking where the size of chunks can change automatically in response to variations in the load at runtime. Lastly, we are interested in expanding the number of reconfiguration protocols implemented in H-Store (beyond Squall and stop-and-copy) to perform a more thorough evaluation. From this investigation, we hope to develop techniques that will allow the DBMS to automatically choose the right protocol to employ at runtime for an arbitrary application. 6.9 Summary We introduced a new approach, called Squall, for fine-grained reconfiguring OLTP databases in partitioned, main memory distributed DBMSs. Squall supports the migration of data between partitions in a cluster in a transactionally safe manner even in the presence of distributed transactions. We performed an extensive evaluation of our approach on a main memory distributed DBMS using OLTP benchmarks. We compared Squall with the naïve stop-and-copy technique. The results from our experiments show that Squall can reconfigure a database with no downtime and a minimal overhead on transaction latency. 118

143 Part III The End for Now 119

144

145 Chapter 7 Conclusion and Future Work Not all those who wander are lost. J. R. R. Tolkien 7.1 Conclusion Database proliferation within organization drives increased costs and wasted resources. These costs arise from human administration, capital expenditures for hardware, licensing, and resources required to power, cool, and store the servers. The costs associated with running each database server does not heavily fluctuate with the server s utilization. While high levels of data processing will require additional cooling to mitigate generated heat, the majority of aforementioned costs remain regardless of the server s utilization level. With a historical architecture that assumes each database instance is dedicated to hosting one application, hosting applications that do not always require the resources of a dedicated server results in wasted resources and costs. Building a database-as-a-service offering to consolidate hosted databases into a reduced number of servers improves the effectiveness of database management and reduces costs. Moving to a database service platform benefits both the service provider and service users. Service providers that host many database applications are able to leverage economies of scale to amortize the costs associated with hosting and can become specialized in the automation of administrative tasks. For users of a database service, they can reduce the upfront investment typically associated with provisioning and setting up a new database system. Additionally the user can leverage a pay-as-you-go model, which enables them to pay based on the how much they are utilizing the hosted database. 121

146 Chapter 7. Conclusion and Future Work A database platform must balance the level of tenant consolidate and how resources are shared and isolated between hosted tenants. This dissertation focuses on a database platform that relies on soft isolation, or the intelligent placement of tenants, to control how tenants receive the required resources to process their database requests in a timely manner. This dissertation also focuses on a multitenancy model that uses a single database process to host multiple tenants on a single server. We target this shared process model for its ability to leverage unmodified database engines and effective tenant consolidation from limited and coordinated database activities. In other multitenancy models, such as shared hardware, where multiple database processes exists on a single server, redundant components and uncoordinated resource utilization limits the consolidation of tenants. In order to effectively consolidate tenants in a soft isolation environment, a system must have a mechanism to estimate what physical resources (e.g. memory size or disk IOPS) the tenant will need to answer queries in a responsive manner. An accurate resource estimation per tenant is not readily available in our target environment. When tenants share a single database process, resource attribution is reported to the OS at the database process level, and not by the individual tenants. Additionally, resources consumed by tenants are not always additive when colocated [25, 6]. Therefore, along with estimating resource requirements, an effective tenant placement strategy must account for the impact of colocation when placing tenants together. To address these issues we presented Pythia, a technique to leverage supervised learning to model tenant resource consumption and model how various tenant classes will colocate. Pythia leverages an expert system administrator to identify key database level attributes (e.g. cache hit ratio or database size) and to provide training data to identify how tenants consume resources based on the database level attributes. The administrator also specifies performance objectives and how aggressive the system should consolidate tenants by setting resource consumption targets. Pythia empirically learns how tenants colocate in regard to guidelines set by the administrator. This allows Pythia to incrementally learn ideal tenant workloads to colocate. Our presented system controller, Delphi, monitors tenant performance to ensure that query latency service level objectives (SLOs) are continually being met. When a performance objective is no longer being met either due to the introduction of new tenants or due to an evolution in a colocated workload, Delphi uses Pythia s models to identify a tenant-to-server mapping. When a violation occurs, Delphi identifies a set of tenants to remove from the server in violation. This set of tenants is identified by using Pythia s colocation model to find the minimal set to remove which is expected to put the server back to acceptable resource con- 122

147 Conclusion Section 7.1 sumption. The set of tenants that are removed are found new destination servers that can receive the tenants without creating a new violation, again by using the colocation model. Delphi employs a local search heuristic, hill climbing [69] to identify the set of tenant migrations that improves the overall balance of load on the system. Because soft isolation uses the placement of tenants to ensure resource access, a migration primitive must be enabled to move tenants between servers when the system must be load balanced. While replication mechanics will enable some load balancing [61, 70], there will be scenarios when the replicas are not viable destinations due to existing workloads on the replica. This dissertation analyzes the various migration techniques and presents forms of migration to abstract and categorize migration techniques. The presented forms of migration include asynchronous, synchronous, and live migration. An asynchronous migration relies on stopping workload execution at the source node while the state of the data is copied to destination. This copy can either be performed by copying the local database files or by using asynchronous replication [16] to ship updates to the destination. A synchronous replication relies on using active, or synchronous, replication [16] to synchronize the state of the source and destination. Afterwards, the source server can be disabled. Live migration copies the persistent state between servers without making the database unavailable for any time. In order to evaluate different migration techniques a framework for evaluating migration is presented. This framework outlines key metrics to determine the impact of a migration on an active tenant. Attributes include the amount of time the system is unavailable (downtime), the number of failed operations, external coordination required to complete the migration, and the latency overhead incurred on transactions. This dissertation presents Zephyr, the first live migration technique for sharednothing databases. Zephyr uses a reactive, pull based migration that shifts the workload to the destination as as possible. The migration is broken into phases that signify where workload execution can occur. During the phase with both source and destination executing the workload, unique page ownership is used ensure data consistency between the servers. While disk-based, shared-nothing databases work well for systems that host many tenants with low throughput requirements, main memory databases can provide a database platform with the ability to provide a higher throughput capacity. One popular architecture for main memory databases is to partition with a single threaded execution model per partition. These partitions can be distributed across different servers. These partitioned databases are subject to performance 123

148 Chapter 7. Conclusion and Future Work problems that arise from excessive distributed transactions or hotspots on partitions. To respond to such performance issues, these systems require an ability to change how data is partitioned. To address this challenge, we present a live reconfiguration technique, Squall. While similar to migration, a reconfiguration for a partitioned database migrates data at any granularity (i.e. any subset of the local data), while addressing the presence of distributed transactions and working within the confines of a single threaded access model. Squall focuses on how to maintain correctness while minimizing disruption to the single thread partition executor by chunking migrating ranges into smaller sections. 7.2 Future Work The promise of system elasticity is to dynamically scale up and down capacity in response to application needs. This dissertation presents several critical primitives to enable elasticity in a soft isolation based database platform. With these primitives, a system can implement higher level algorithms to control elasticity in a database platform. The design and implementation of these elastic control algorithms provides an extremely rich area for future research. An elastic database will need to self-manage the decision on when to expand and contract server resources. This calls for solutions within a system controller to make these decisions with limited input from a system administrator. The research presented explores reactive load-balancing when performance objectives are violated. During periods of steady state, a controller must model if the system can be contracted into a fewer number of servers without creating performance violations due to the reduced resources. A cost benefit analysis based on server operating costs and potential violation costs should be utilized when making any decision about consolidation. By analyzing historical trends a controller will predict if and when another increase in activity is likely to occur. With this information the controller can determine if the costs associated with consolidating (in terms of migration impact) are justified by being able to retain the consolidated state for a long enough time period while maintaining performance objectives. Instead of a reactive load-balancing and expansion detection, a system controller can proactively load-balance tenants or expand the system capacity. A proactive expansion algorithm requires the ability to detect temporal trends in tenant utilization. This trend analysis will have to filter out sporadic bursts of activity that can come from ad-hoc utilization or short lived spikes in utilization. Long term shifts in tenants that result in increased resource utilization, allows for the system to expand the capacity and load-balance tenant workloads before 124

149 Future Work Section 7.2 performance crises arise. Similar to contraction, expansion decisions should also analyze historical patterns to identify likely growth periods. Decisions about expansion and contraction will need to factor in the cost of migration into decision making. During migration the moving tenant is likely to suffer from degraded performance. The colocated tenants at the source and destination are also likely to incur disruption due to the impact of migration. Modeling the costs associated with a migration must factor not only attributes about the migrating tenant, but also workloads at the involved servers, the server s available capacity, and the network capacity. A complete elasticity solution must factor in the costs of migrating tenants when making decisions about load-balancing or changing system capacity. While an elastic transactional database can reduce the costs associated with hosting a database platform, for organizations with a statically provisioned infrastructure the benefits are not as prominent. For systems with static resources (e.g. two racks of servers dedicated to database hosting), a database platform could integrate transactional and analytic workloads to maximize resource utilization effectiveness. Here, the system can dynamically adjust the resource capacity dedicated to the transactional workload to ensure that performance objectives are met. If the system has excess capacity, the remaining resources should be dedicated to analytic workloads. If there are no active workloads available, a system can pre-compute materialized views or perform common aggregation queries for future analysis. Such a system would also need to gracefully degrade analytic components when the transactional component consumes majority of available resources. Here, the system can explore using older snapshots or views to provide analysis, or use approximate results to answer queries with reduced resources. Building a data platform that elastically integrates analytic and transactional workloads is a promising area of research that demands new solutions across the data platform. 125

150 126

151 Bibliography [1] H-Store. [2] MemSQL. [3] MongoDB. [4] VMware vfabric SQLFire. [5] VoltDB. [6] M. Ahmad and I. T. Bowman. Predicting system performance for multitenant database workloads. In ACM DBTest, pages 1 6, [7] Personal communications with Jerry Zheng, VP Web Operations at AppFolio Inc., April [8] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia. A view of cloud computing. Commun. ACM, 53(4):50 58, Apr [9] S. Aulbach, T. Grust, D. Jacobs, A. Kemper, and J. Rittinger. Multi-tenant databases for software as a service: schema-mapping techniques. In ACM International Conference on Management of Data (SIGMOD), pages , [10] J. Baker, C. Bond, J. Corbett, J. Furman, A. Khorlin, J. Larson, J.-M. Leon, Y. Li, A. Lloyd, and V. Yushprakh. Megastore: Providing Scalable, Highly Available Storage for Interactive Services. In Conference on Innovative Data Systems Research (CIDR), pages , [11] S. K. Barker, Y. Chi, H. J. Moon, H. Hacigms, and P. J. Shenoy. Cut me some slack : latency-aware live migration for databases. In International Conference on Extending Database Technology (EDBT), pages ,

152 Bibliography [12] H. Berenson, P. Bernstein, J. Gray, J. Melton, E. O Neil, and P. O Neil. A critique of ANSI SQL isolation levels. In ACM International Conference on Management of Data (SIGMOD), pages 1 10, [13] P. A. Bernstein, I. Cseri, N. Dani, N. Ellis, A. Kalhan, G. Kakivaya, D. B. Lomet, R. Manner, L. Novik, and T. Talius. Adapting Microsoft SQL Server for Cloud Computing. In International Conference on Data Engineering (ICDE), pages , [14] P. A. Bernstein and N. Goodman. Timestamp-based algorithms for concurrency control in distributed database systems. In Very Large Data Bases (VLDB), pages , [15] P. A. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Control and Recovery in Database Systems. Addison Wesley, Reading, Massachusetts, [16] P. A. Bernstein and E. Newcomer. Principles of Transaction Processing. Morgan Kaufmann Publishers Inc., second edition, [17] R. Bradford, E. Kotsovinos, A. Feldmann, and H. Schiöberg. Live wide-area migration of virtual machines including local persistent state. In Virtual Execution Environments, pages , [18] R. Cattell. Scalable sql and nosql data stores. ACM International Conference on Management of Data (SIGMOD) Rec., 39:12 27, [19] S. Chandrasekaran and R. Bamford. Shared cache - the future of parallel databases. In International Conference on Data Engineering (ICDE), pages , [20] N. Chohan, C. Bunch, S. Pang, C. Krintz, N. Mostafa, S. Soman, and R. Wolski. AppScale: Scalable and Open AppEngine Application Development and Deployment. In CloudComp, [21] C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield. Live migration of virtual machines. In USENIX Symposium on Networked Systems Design and Implementation (NSDI), pages , [22] B. F. Cooper et al. Benchmarking Cloud Serving Systems with YCSB. In Symposium on Cloud Computing (SoCC), pages ,

153 Bibliography [23] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking Cloud Serving Systems with YCSB. In Symposium on Cloud Computing (SoCC), pages , [24] J. Cowling and B. Liskov. Granola: low-overhead distributed transaction coordination. In USENIX Annual Technical Conference, pages 21 34, June [25] C. Curino, E. Jones, S. Madden, and H. Balakrishnan. Workload-Aware Database Monitoring and Consolidation. In ACM International Conference on Management of Data (SIGMOD), [26] C. Curino, E. Jones, S. Madden, and H. Balakrishnan. Workload-Aware Database Monitoring and Consolidation. In ACM International Conference on Management of Data (SIGMOD), [27] C. Curino, E. Jones, R. Popa, N. Malviya, E. Wu, S. Madden, H. Balakrishnan, and N. Zeldovich. Relational Cloud: A Database Service for the Cloud. In Conference on Innovative Data Systems Research (CIDR), pages , [28] C. Curino, E. P. C. Jones, S. Madden, and H. Balakrishnan. Workload-aware database monitoring and consolidation. In ACM International Conference on Management of Data (SIGMOD), [29] C. Curino, Y. Zhang, E. P. C. Jones, and S. Madden. Schism: a workloaddriven approach to database replication and partitioning. Proc. Very Large Data Bases (VLDB), 3(1):48 57, [30] S. Das, S. Agarwal, D. Agrawal, and A. El Abbadi. ElasTraS: An Elastic, Scalable, and Self Managing Transactional Database for the Cloud. Technical Report , CS, UCSB, [31] S. Das, D. Agrawal, and A. El Abbadi. ElasTraS: An Elastic Transactional Data Store in the Cloud. In USENIX HotCloud, [32] S. Das, S. Nishimura, D. Agrawal, and A. El Abbadi. Live Database Migration for Elasticity in a Multitenant Database for Cloud Platforms. Technical Report , CS, UCSB, [33] S. Das, S. Nishimura, D. Agrawal, and A. El Abbadi. Albatross: Lightweight Elasticity in Shared Storage Databases for the Cloud using Live Data Migration. Proc. Very Large Data Bases (VLDB), 4(8): , May

154 Bibliography [34] D. DeWitt and J. Gray. Parallel database systems: the future of high performance database systems. Commun. ACM, 35(6):85 98, [35] D. J. DeWitt, R. H. Katz, F. Olken, L. D. Shapiro, M. R. Stonebraker, and D. Wood. Implementation techniques for main memory database systems. ACM International Conference on Management of Data (SIGMOD) Rec., 14(2):1 8, [36] A. J. Elmore, S. Das, D. Agrawal, and A. El Abbadi. Towards an elastic and autonomic multitenant database. NetDB, [37] A. J. Elmore, S. Das, D. Agrawal, and A. El Abbadi. Zephyr: Live Migration in Shared Nothing Databases for Elastic Cloud Platforms. In ACM International Conference on Management of Data (SIGMOD), pages , [38] A. J. Elmore, S. Das, A. Pucher, D. Agrawal, A. El Abbadi, and X. Yan. Characterizing tenant behavior for placement and crisis mitigation in multitenant dbmss. ACM International Conference on Management of Data (SIGMOD), pages , [39] K. P. Eswaran, J. N. Gray, R. A. Lorie, and I. L. Traiger. The notions of consistency and predicate locks in a database system. Commun. ACM, 19(11): , [40] Facebook Statistics. statistics, Retreived Nov 30, [41] N. Folkman. So, that was a bummer. 10/05/so-that-was-a-bummer/, October [42] H. Garcia-Molina and K. Salem. Main memory database systems: An overview. IEEE Trans. on Knowl. and Data Eng., 4(6): , Dec [43] J. Gray. Notes on data base operating systems. In Operating Systems, An Advanced Course, pages , London, UK, Springer-Verlag. [44] J. N. Gray, R. A. Lorie, and G. R. Putzolu. Granularity of locks in a shared data base. In Very Large Data Bases (VLDB), pages , [45] L. Grit, D. Irwin, A. Yumerefendi, and J. Chase. Virtual machine hosting for networked clusters: Building the foundations for autonomic orchestration. In International Workshop on Virtualization Technology in Distributed Computing,

155 Bibliography [46] H2 Database Engine [47] T. Haerder and A. Reuter. Principles of transaction-oriented database recovery. ACM Comput. Surv., 15(4): , Dec [48] S. Harizopoulos, D. J. Abadi, S. Madden, and M. Stonebraker. OLTP through the looking glass, and what we found there. In ACM International Conference on Management of Data (SIGMOD), pages , [49] P. Helland, H. Sammer, J. Lyon, R. Carr, P. Garrett, and A. Reuter. Group commit timers and high volume transaction systems. In High Performance Transaction Systems (HPTS), [50] D. Jacobs and S. Aulbach. Ruminations on multi-tenant databases. In Database Systems for Business, Technology and Web (BTW), pages , [51] E. P. Jones. Fault-Tolerant Distributed Transactions for Partitioned OLTP Databases. PhD thesis, MIT, [52] R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. B. Zdonik, E. P. C. Jones, S. Madden, M. Stonebraker, Y. Zhang, J. Hugg, and D. J. Abadi. H- store: a high-performance, distributed main memory transaction processing system. Proc. Very Large Data Bases (VLDB), 1(2): , [53] W. Lang, S. Shankar, J. Patel, and A. Kalhan. Towards multi-tenant performance slos. In International Conference on Data Engineering (ICDE), pages , [54] K. Li and J. F. Naughton. Multiprocessor main memory transaction processing. Symposium on Databases in Parallel and Distributed Systems, pages , [55] H. Liu, H. Jin, X. Liao, L. Hu, and C. Yu. Live migration of virtual machine based on full system trace and replay. In ACM International Symposium on High Performance Distributed Computing, pages , [56] R. Liu, A. Aboulnaga, and K. Salem. Dax: A widely distributed multi-tenant storage service for dbms hosting. In Very Large Data Bases (VLDB), [57] Z. Liu, H. Hacigümüs, H. J. Moon, Y. Chi, and W.-P. Hsiung. Pmax: tenant placement in multitenant databases for profit maximization. In International Conference on Extending Database Technology (EDBT), pages ,

156 Bibliography [58] N. Malviya. Recovery algorithms for in-memory OLTP databases. Master s thesis, MIT, [59] C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and P. Schwarz. Aries: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. ACM Transactions on Database Systems (TODS), 17(1):94 162, [60] C. Mohan, B. G. Lindsay, and R. Obermarck. Transaction Management in the R* Distributed Database Management System. ACM Transactions on Database Systems (TODS), 11(4): , [61] H. J. Moon, H. Hacigümüs, Y. Chi, and W.-P. Hsiung. Swat: a lightweight load balancing method for multitenant databases. In International Conference on Extending Database Technology (EDBT), pages 65 76, [62] B. Mozafari, C. Curino, and S. Madden. Dbseer: Resource and performance prediction for building a next generation database cloud. In Conference on Innovative Data Systems Research (CIDR), [63] V. Narasayya, S. Das, M. Syamala, B. Chandramouli, and S. Chaudhuri. Sqlvm: Performance isolation in multi-tenant relational database-as-aservice. In CIDR, [64] O. Ozmen, K. Salem, M. Uysal, and M. H. S. Attar. Storage workload estimation for database management systems. In ACM International Conference on Management of Data (SIGMOD), pages , [65] A. Pavlo, C. Curino, and S. Zdonik. Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. In ACM International Conference on Management of Data (SIGMOD), pages 61 72, [66] A. Pavlo, E. P. Jones, and S. Zdonik. On predictive modeling for optimizing transaction execution in parallel oltp systems. Proc. Very Large Data Bases (VLDB), 5:85 96, October [67] T. Rafiq. Elasca: Workload-aware elastic scalability for partition based database systems. Master s thesis, University of Waterloo, [68] B. Reinwald. Database support for multi-tenant applications. In IEEE Workshop on Information and Software as Services, [69] S. J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 3rd edition,

157 Bibliography [70] J. Schaffner, T. Januschowski, M. Kercher, T. Kraska, H. Plattner, M. J. Franklin, and D. Jacobs. Rtp: Robust tenant placement for elastic in-memory database clusters. In ACM International Conference on Management of Data (SIGMOD), pages , [71] O. Schiller, N. Cipriani, and B. Mitschang. Prorea: live database migration for multi-tenant rdbms with snapshot isolation. In International Conference on Extending Database Technology (EDBT), pages 53 64, [72] M. B. Sheikh, U. F. Minhas, O. Z. Khan, A. Aboulnaga, P. Poupart, and D. J. Taylor. A bayesian approach to online performance modeling for database appliances using gaussian models. In International Conference on Autonomic Computing (ICAC), pages ACM, [73] R. Shoup and D. Pritchett. The ebay architecture. SD Forum, November [74] J. Sobel. Scaling Out (Facebook). April [75] A. A. Soror, U. F. Minhas, A. Aboulnaga, K. Salem, P. Kokosielis, and S. Kamath. Automatic virtual machine configuration for database workloads. In ACM International Conference on Management of Data (SIGMOD), pages , [76] G. Soundararajan, D. Lupei, S. Ghanbari, A. D. Popescu, J. Chen, and C. Amza. Dynamic resource allocation for database servers running on virtual storage. In USENIX Conference on File and Storage Technologies (FAST), pages 71 84, [77] R. Stoica, J. J. Levandoski, and P.-A. Larson. Identifying hot and cold data in main-memory databases. In International Conference on Data Engineering (ICDE), pages 26 37, [78] M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, and P. Helland. The End of an Architectural Era (It s Time for a Complete Rewrite). In Very Large Data Bases (VLDB), pages , [79] The Transaction Processing Performance Council. TPC-C benchmark (Version ), [80] B. Urgaonkar, P. Shenoy, and T. Roscoe. Resource overbooking and application profiling in shared hosting platforms. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages ,

158 Bibliography [81] C. D. Weissman and S. Bobrowski. The design of the force.com multitenant internet application development platform. In ACM International Conference on Management of Data (SIGMOD), pages , [82] A. Whitney, D. Shasha, and S. Apter. High Volume Transaction Processing Without Concurrency Control, Two Phase Commit, SQL or C++. In High Performance Transaction Systems (HPTS), [83] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Data Management Systems. Morgan Kaufmann Publishers Inc., second edition, [84] G. won You, S. won Hwang, and N. Jain. Scalable load balancing in cluster storage systems. In Middleware, pages , [85] P. Xiong, Y. Chi, S. Zhu, H. J. Moon, C. Pu, and H. Hacigümüs. Intelligent management of virtualized resources for database systems in cloud environment. In International Conference on Data Engineering (ICDE), pages 87 98, [86] P. Xiong, Y. Chi, S. Zhu, J. Tatemura, C. Pu, and H. HacigümüŞ. Activesla: A profit-oriented admission control framework for database-as-aservice providers. In Symposium on Cloud Computing (SoCC), pages 15:1 15:14, [87] F. Yang, J. Shanmugasundaram, and R. Yerneni. A scalable data platform for a large number of small applications. In Conference on Innovative Data Systems Research (CIDR),

159 Appendices 135

160