Report for the seminar Algorithms for Database Systems F1: A Distributed SQL Database That Scales Bogdan Aurel Vancea May 2014 1 Introduction F1 [1] is a distributed relational database developed by Google and it is used mainly for the Google AdWords business. F1 combines the scalability of NoSQL systems with the consistency offered by SQL databases. The name F1 is an abbreviation of Filial 1 Hybrid, which in biology stands for the first generation of the offspring of two very different parent species. The name is meant to symbolize the fact that F1 is the result of the combination of NoSQL and SQL databases. One of the most important design aspects of F1 is the fact that it is built on top of a distributed key-value store database. This key-value store is called Spanner [2], it is a NoSQL database created by Google and provides synchronous cross-datacenter replication and strong consistency. This design choice results in a relatively high commit latency for transactions, which is mitigated by F1 through various design heuristics. As a result, the latency of applications using F1 is similar to the latency of the previous database solution used by the Google AdWords product. However, F1 also provide better scalability, reliability and availability. 2 Goals The goal of this report is to analyze the approach taken by F1 for designing scalable SQL databases. For the purpose of this report, a scalable SQL database will be defined as a databases that: provides strong consistency semantics. This means that the system should always present a consistent state. A strongly consistent state is the basis for ACID transactions, however this report will not go into details about this type of transactions. is scalable both geographically and from the point of view of data storage and request load. The system should be able to scale transparently and just by adding additional nodes in order to handle more data or an increasing number of requests per second. 1
The architecture proposed by F1 is to build such a scalable SQL database by adding a layer of SQL processing over a distributed key-value store. The distributed key-value store should fulfill the scalability and consistency requirements for simple operations that operate on key-value pairs, while relational abstractions like tables, SQL processing and ACID transactions are implemented by an additional database middleware. The main issue that appears in such a system is the additional network latency added by the middleware server layer. Considering that distributed databases can be deployed over multiple data centers, the additional network latency could add a significant penalty to the read and write latencies. And in an age of Big Data, databases with high latencies and potentially low throughput are not acceptable. In this context, this report will present the architecture of the multilayer system proposed earlier and analyze the impact of the additional layer on the read and write latencies. Additionally, the report will present some solutions to mitigate the extra latency added by the additional layer. 3 System Architecture Figure 1: The architecture of F1 This section will detail the distributed database architecture proposed by F1. The architecture of the system is presented in Figure 1. In this model, the data is stored on the distributed key-value pair servers, while the query processing is performed by the middleware servers. In the case of the F1 database, the key-value store servers are represented by Spanner servers, while the middleware servers are represented by F1 servers. 2
The Spanner key-value store server use a distributed file system implemented by Google, called Colossus [3], the second generation of the Google File System [4]. Conceptually, the relational data is stored as rows in tables, however at the level of the key-value store servers each table row is stored by multiple key-value pairs. This implementation detail is abstracted by the middleware servers, which convert SQL queries operating on table rows into low-level operations that operate on key-value pairs. The database is accessed using SQL queries sent through a Client library to one of the middleware servers. The middleware servers process the SQL queries and produce a list of low-level operations operating on key-value pairs. These low-level operations are the forwarded by the middleware server to one or more of the key-value pair nodes that hold the data affected by them. Because of the strong consistency requirement, a consistent view of the data must be kept at all times and the middleware server can consider any write operation completed only after it has received an acknowledgement from the keyvalue pair servers signifying that the write was finished successfully. A common way to increase the availability and fault-tolerance of a distributed database system is to replicate the data stored across multiple nodes. In such a case, to maintain the strong consistency requirement, the middleware node would also have to obtain an acknowledgement from all of the replicas of the node that holds the data to be written before considering the write completed. The main advantage of this multilayer architecture is the fact that the data processing components are physically separated from the data storage components. Because the data is only stored on the key-value pair nodes, the data storage capacity of the system can be scaled individually from the query processing capacity. Therefore, to increase the query processing power of the system, one would only need to add more middleware nodes. Because these middleware nodes do not store any relational data, this operations does not incur a data redistribution cost. To increase the amount of data that can be stored by the system, more key-value store nodes need to be added. It is important to note that adding new storage nodes brings a data redistribution cost. If the new node is a replica of an existing node, the new node needs to load the state of the node that it replicates. If the new node is a non-replica node, the existing data stored by the system would need to be redistributed among the existing nodes. The disadvantage of this architecture is that all data access operations need at least two network round-trips in addition to the disk operation. In the case of read operations, a network request is made between the client and the middleware server, followed by an additional network request made by the middleware to the key-value store server holding the data requested. In this case, having multiple replicas of key-value store nodes would allow different requests from different clients for the same data to be sent to different nodes, thus mitigating some of the extra latency. In the case of write operations, in addition to the network request between the middleware and key-value store server, network requests need to be made to all of the replicas of the node storing the data to be updated. Unlike read operations, the presence of the replica nodes influences the write latency in a negative manner. The next section will analyze the impact of replication on write requests in a strongly consistent system. 3
Figure 2: Replication of a write operation using the Paxos algorithm 4 Synchronous Replication There are multiple models for the replication of writes to replicas, each model ensuring a certain consistency model. The synchronous replication model ensures that all the write requests are atomically performed on all replicas. In this replication model, the node that contains the main copy of the data is called the leader node, while the nodes containing copies of the data are simply called replicas. The replication process is initialized by the leader and is finished once all of the replicas have performed the write on the data. If multiple writes need to be replicated by a middleware server, it is possible to initialize the replication of each write on a different node, to increase parallelism. In such a case, a consensus algorithm needs to be used for replication. F1 uses the Paxos consensus algorithm for the replication process. This algorithm ensures that the replication will be finished successfully even in the presence of multiple leaders. The figure 2 shows requests made during a replication round in the Paxos algorithms. First, an SQL query is send from the client to one of the middleware servers. This update query is then converted to a single key-value operation that is then sent to the key-value store servers. The best-case scenario for the algorithm is the following: The key-value store server receiving the update initializes the replication process and sends propose messages with the new value to the replicas. The replicas can accept the newly proposed value and send an acknowledge message to the leader. The leader counts the acknowledge messages from the replicas as votes. If a majority of the replicas have accepted the proposed update, the leader can send a commit message to the replicas. It is only after the commit was performed by all of the replicas that the replication process is finished. 4
Figure 3: A possible normalized relational schema for mobile manufacturers If multiple updates are initialized in the same time, a single leader is chosen to perform all the updates in a valid order. This case will not be covered in the report. However, one can see that even in the simplest case, when a single write is replicated successfully, the algorithm needs 2 network round-trips between the leader and each replica: one for the propose message and another for the commit message. These additional network roundtrips increase the latency of write operations. There are several ways through which a high write latency could be mitigated in a database system. The following sections will analyze the optimizations proposed by F1 to deal with the high write latency. 5 Data Model F1 proposes using a hierarchical data model to reduce the number of writes required for update operations. The data model used by F1 is very similar to the data model used by modern relational databases. F1 stores data as rows in tables, however the internal storage is slightly different from the one of traditional databases. F1 provides some extensions to the traditional data model: explicit table hierarchies and column support for Protocol Buffers. From the logical point of view, in the clustered hierarchical model the tables are organized in a hierarchy. In this hierarchy, each table can be a parent table of one or more child tables. Moreover, a table that has no parent is called a root table. From the physical point of view, all of the child tables are stored clustered with the parent tables. This means that the rows of the parent and child tables are interleaved. The remainder of this section will present the differences between the traditional, 5
normalized relational model and the hierarchical clustered schema model proposed by F1, using as an example a database that holds data about mobile manufacturers. An example relational schema for a mobile manufacturer database is illustrated in Figure 3. There are 4 tables, for manufacturers, phones, tablets and an additional SIM Support table that tracks the types of SIM cards that can be associated with each mobile phone. In this traditional, normalized relational model, all rows that belong to the same table are usually stored in the same file on disk. Figure 4 shows an example table hierarchy for the mobile database. In this hierarchy, the Manufacturer is the root table, while the Tablet and Phone tables are its child tables. The SIM Support table is a child table of the Phone table as the SIM information is only related to Phones. Figure 4: A possible hierarchy for a mobile manufacturer database The storage of rows on disk for a clustered hierarchical schema is different from the storage layout proposed by the relational schema. While in the relational schema, rows of each table are stored one after the other on disk, in the clustered hierarchical schema, the child rows are stored interleaved with the parent rows. The storage layout for the mobile manufacturer hierarchy is shown in Figure 5. In the example, the rows of the Phone and Tablet tables are stored right after the rows of the corresponding Manufacturer entries and the rows of the SIM Support table are stored right after the corresponding rows from the Phone table. An additional storage constraint set by the hierarchical clustered schema is that all the rows associated to a root row be stored on the same node. This includes not only the direct children, but also the children s children and so on. The main advantage of such a hierarchical schema is that all the rows belonging to a 6
single root row are accessible using a range scan starting from that root row. For example, updating attributes belonging to all tablets or phones manufactured by Samsung can be done in a single scan starting from the manufacturer row of Samsung. Because of the constraint stating that all child rows of a root row are stored on the same node, if a transaction needs to apply multiple updates on a root row hierarchy, all the writes are directed to a single node. This is important because multiple updates corresponding to the same transaction can be batched in a single network message. This request batching will be described in the following section. The disadvantages of this hierarchical model is that the domain data needs to manifest a certain hierarchy. If the tables cannot be grouped into such hierarchies, the schema can degenerate into a traditional normalized schema, where all the tables are root tables and no table has any child tables. Such a schema cannot benefit from the advantages of a hierarchical schema. Moreover, the fact that all of the child rows of a root row need to be stored on the same node limits the maximum number of nodes in a root row hierarchy to the storage space available on the node. This could potentially pose a problem for hierarchies in which root rows have very many child rows. Figure 5: The storage layout for the hierarchical schema for mobile manufacturers 6 Request batching This section will detail the request batching proposed by F1 to mitigate the high write latency. In traditional SQL databases, where the data storage and processing are done on the same node, the write latency is mainly caused by disk latency. Additionally, the disk latency is caused by the write capacity of the device and the contention of database processes for the IO device. In the case of the multilayer architecture, the duration of network messages make up an important component of the write latency. This means 7
that the write latency can be mitigated by batching multiple write commands in a single network message. For example, if a transaction contains multiple write operations that are applied on data belonging to a single root row, all of these write commands can be batched in a single network message from the query processing node to the key-value store node. Moreover, this batch of updates can be replicated in the same time. Another example are updates that are applied on different root rows that reside on the same node. This case is illustrated in Figure 6, where 2 separate updates need to be applied on root rows stored on the same node. The SQL updates are translated into 2 write commands operating on key-value pairs and these 2 commands can be batched in a single network message sent from the F1 middleware server to the key-value store server. Figure 6: Illustration of the request batching process 7 Drawbacks and Alternatives The system proposed in the F1 paper manages to provide both scalability and strong consistency. However, this comes at a certain cost: Higher single read and write latencies. In this system, read and write commands operating on single rows have a high latency. The authors have reported that the latency of the system for these operations was larger than the latency of the previous database system used for AdWords. However, in the proposed system, reads or writes to the full row hierarchy of a single root row can be done with a single network request as well. Higher resource cost. This architecture requires more physical nodes due to the fact that the SQL query processing and data storage is done on different machines. This means that at least one middleware node is needed for the query processing, without taking into account the key-value pair store nodes. 8
Need for hierarchical structure in data The clustered hierarchical data model is a key concept used to reduce the number of write requests associated with each transaction. If the data stored by the database cannot be grouped into an appropriate hierarchy, the reduced latency offered by this storage optimization will not be achieved. In the architecture proposed by F1, the data is remote from the nodes that perform the query processing. An alternative architecture is to keep the data on the same nodes that perform the query processing. The authors of [5] have identified the main bottlenecsk of traditional relational databases to be: write-ahead logging, two-phase locking, data structure latching and buffer management. The database VoltDB [6] was implemented in the spirit of these previous ideas and proposed a distributed in-memory architecture. In this system, nodes are single-threaded, eliminating the need for locking and latching, while the full in-memory architecture simplifies the buffer management process. 8 Conclusions This report has analyzed F1, a distributed SQL database. The authors manage to successfully combine the advantages of SQL and NoSQL systems in a system that provides transparent scalability, strong consistency and very high availability. This is done using a multilayer architecture, in which the query processing components are physically separated from the data storage components. Such an architecture provides good scalability and availability but the additional physical layer impacts the write latency negatively. This additional network latency is mitigated using a clustered hierarchical schema instead of a traditional relational schema. Request batching is also used to group multiple commands into a single network request in order to reduce the impact of network latency. The authors report that the system has been successfully used in production and the user-facing latency of their application is on par with the latency when using the previous database system. References [1] Jeff Shute, Radek Vingralek, Bart Samwel, Ben Handy, Chad Whipkey, Eric Rollins, Mircea Oancea, Kyle Littlefield, David Menestrina, Stephan Ellner, John Cieslewicz, Ian Rae, Traian Stancescu, and Himani Apte. F1: A distributed SQL database that scales. In VLDB, 2013. [2] James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson C. Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. Spanner: Google s globally distributed database. ACM Trans. Comput. Syst., 31(3):8, 2013. 9
[3] Andrew Fikes. Storage architecture and challenges. http://static. googleusercontent.com/media/research.google.com/en//university/ relations/facultysummit2010/storage_architecture_and_challenges.pdf, July 2010. [4] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. SIGOPS Oper. Syst. Rev., 37(5):29 43, October 2003. [5] Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker. OLTP Through the Looking Glass, and What We Found There. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD 08, pages 981 992, New York, NY, USA, 2008. ACM. [6] Michael Stonebraker and Ariel Weisberg. The VoltDB main memory DBMS. 10