Exploration of Non-Relational Database Models. Swayze Smartt. Department of Computer Science. Wake Forest University. Spring 2011 Honors Thesis

Transcription

1 Smartt 1 Exploration of Non-Relational Database Models Swayze Smartt Department of Computer Science Wake Forest University Spring 2011 Honors Thesis Advised by Dr. Stan Thomas

2 Smartt 2 Abstract While relational databases have been popular for at least the last quarter-century, new non-relational models are quickly gaining popularity largely due to the cloud computing movement and increased reliance on distributed computing environments. These new models, relying heavily on variations of the Google-inspired MapReduce functionality, promise new efficiencies and capabilities not available in old DBMS models. 1 This paper discusses the advantages and disadvantages of this new database paradigm and examines current distributed database implementations currently on the market including: Amazon s SimpleDB, Google s Bigtable, and Apache CouchDB. The paper also discusses the importance of this new database model and its likely ascent as the preferred model for distributed environments. 1. Introduction Since the 1970s, the relational database and the associated entity relationship models have together been the standard for database development. 2 Recent trends in both software and hardware have opened up exploration for a new non-relational database model, relying on key-value pairs instead of the combination of primary and foreign keys stored in the familiar table format. 3 This new model, driven in large part by the cloud-computing movement, promises a more efficient and scalable database structure, with faster processing through the powerful map-reduce function. 4 Despite its many benefits, the non-relational model has several drawbacks, the most notable being its frequent inability to structure data in a semantically meaningful way. 1 Bhat and Jadhav, p Harrington, p Dean and Ghemawat, p Dean and Ghemawat, p. 77

3 Smartt 3 Regardless of whether the non-relational model becomes the new database standard, an unlikely event, the new model already has widespread use in specific applications and is projected to become more popular based on computing trends. Given these trends, a wellinformed database administrator should have at least a basic understanding of the nonrelational model and its desirability in many situations. 2. The Relational Database Model 2.1 Overview The relational database model is over 40 years old, developed in 1970 at IBM by Dr. Edgar Ted (Codd) as an extremely intuitive way to store, process and query data. 5 The model follows a common way of displaying data, in tables (or entities). The tables have rows (also records or tuples) with various columns (also fields or attributes) and relationships to other tables Normalization Rules Relational DBMS design is governed by various rules in creating efficient and lowredundancy databases. Following a sequential process of normalization eliminates the chance of data anomalies occurring from queries updating and deleting data. 7 This structuring of the database allows for an additional layer of data integrity independent from any program logic. This distinction is important as implementing the same integrity constraints in a non-relational model is much more difficult. 8 Normalization, however, also poses potential problems as the processes can be complex for large sets of data. 3. The Non-Relational Model 5 Burleson, p Harrington, p Bhat and Jadhav, p

4 Smartt Overview To database administrators and many others familiar with the traditional relational model of database structure, the non-relational may not seem like a database at all. One of the most ubiquitous applications of this paradigm is Google s massive archive of the internet which reportedly takes up petabytes of space. 9 According to Stonebraker, all the major Web-search engines use home-brew text software to serve us search results. None use relational DBMSs. 10 Several features, discussed below, make the non-relational model a desirable choice in database development. 3.2 Key-value Pairs and MapReduce The fundamental relationship present in a non-relational database is the key-value relationship in which some index, the key, is associated with a data item or set of data items, the value. The simplicity of this relationship allows for faster processing of data as compared to the traditional relational model. Although not a database implementation per se, Google s MapReduce technique implements two simple functions, map and reduce, which allow for distributed programming to occur automatically without the programmer having any specific knowledge about the underlying architecture or implementation. 11 The map function performs some operation on a key-value pair, generating intermediate keyvalue pairs. The reduce function then performs an operation on the resulting set to consolidate the values into a single value associated with that key. The most common illustration of the MapReduce function is an algorithm to count all of the words in a document. Figure 1: MapReduce Illustration* 9 Chang et al., p Stonebraker, p Dean and Ghemawat, p. 72

5 Smartt 5 Function map(string documentname, String document) { } for each word w in document: EmitIntermediate(w, 1); Function reduce(string word, Int partialcounts) { } int result = 0; for each partialcount in partialcounts: result+=partialcount; Emit(result); *Adapted from Dean and Ghemawat The implementation of MapReduce is incredible simple given the complexity of what actually takes place at the hardware level. Google s Dean and Ghemawat eloquently describe this seemingly magical process of automatic parallelization: MapReduce automatically parallelizes and executes the program on a large cluster of commodity machines. The runtime system takes care of the details of partitioning the input data, scheduling the program s execution across a set of machines, handling machine failures, and managing required intermachine communication. 12 Because of the simplicity in implementing MapReduce, it has grown in popularity among programmers who have little experience with parallelization but want to benefit from its superior performance over non-parallel relational models. 4. Comparison of Different Non-Relational Databases 12 Dean and Ghemawat, p. 72

6 Smartt 6 This discussion will next look at three popular non-relational models and examine the advantages and disadvantages of each. 4.1 CouchDB CouchDB is an open source distributed database management system that uses a RESTful HTTP API and stores data in the JavaScript Object Notation (JSON) format. 13 Couch is an acronym for Cluster of Unreliable Commodity Hardware, emphasizing both its commitment to being a distributive DBMS and to fault tolerance despite the use of commodity hardware. 14 As previously alluded to, CouchDB, like other non-relational DBMSs, is schema-less and requires relationships between data objects be defined by the developer at a higher level than typical of a relational database. 15 CouchDB uses views which are server-side JavaScript functions to enforce relationship constraints defined by the developer. 16 This schema-less system allows more flexibility and efficiency in data storage and processing, however, it potentially sacrifices data integrity if not properly enforced by the high-level developer CouchDB Structure CouchDB databases work by storing documents with a unique ID, and revision number; all data in a CouchDB database is stored within one of these documents. 17 When updating, CouchDB simply increments the version number of a document; concurrent updating is allowed, however, data control is lockless meaning update conflict resolutions are all-ornothing Bhat and Jadhav, p Leff and Rayfield, p Bhat and Jadhav, p

7 Smartt 7 Figure 2: CouchDB Implementation Replica Databases Documents Doc1 HTTP Request CouchDB Engine Replica Database #1 Doc2 Doc1 Replica Database #2 Doc2 Doc1 Replica Database #3 Doc2 CouchDB replicates a database across multiple hosts, each with complete read/write access to the database. Conflicts are automatically resolved and all prior versions of documents are retained and routinely compacted CouchDB Advantages CouchDB is an extremely simple implementation of a non-relational database that allows for an easy and intuitive way to store data in semantically significant documents. The DBMS also allows much more flexibility in adding and removing attributes in documents as well as having different attributes across documents in the same database. This flexibility minimizes the need for complex design decisions in implementing a database CouchDB Disadvantages Besides the typical disadvantages associated with non-structured data, the lockless read/write controls allow for the possibility of conflicts to occur frequently. While a purposeful decision that allows substantial performance improvements over traditional locking methods, the developer must consider conflict as one of two simultaneous users will receive an error when attempting to update a dirty document and the changes will not be committed to the database. 4.2 Amazon s SimpleDB 19 Apache CouchDB: Introduction, online

8 Smartt 8 Amazon s SimpleDB is a highly-scalable database-as-a-service implementation of a nonrelational database. Although non-relational, SimpleDB still allows a developer to follow some of the traditional rules of the relational model. In SimpleDB, traditional tables are called domains with rows of data called items and corresponding columns called attributes Amazon s SimpleDB Structure Although similar to the traditional database model, SimpleDB deviates in many ways from the rules of relational databases. First normal form prevents two values from being stored in one column of data, while SimpleDB allows this by design. Furthermore, two items, even in the same domain, can have different attributes. 21 This would be analogous to a typical entity-relationship database allowing different columns for each attribute in one table. Simple DB indexes domains automatically to allow for efficient querying without much concern for how the underlying data are stored. 22 Schema can also change as the need for the database changes, a characteristic of the SimpleDB model that would be almost impossible to implement in traditional relational databases. 23 Should a new need arise for the database to store a completely different attribute, a simple query allows this attribute to be stored in the relevant items without disturbing the structure of the database nor the efficiency provided by the indices Amazon s SimpleDB Advantages Amazon s SimpleDB integrates well with existing Amazon Web Services (AWS) and allows for completely cloud-based development solutions that extend well beyond database implementation. Usage of SimpleDB follows the same pay-as-you-go model as other 20 Kavanagh, online

9 Smartt 9 software-as-a-service products. 24 As SimpleDB is very similar to the open source CouchDB, the advantages discussed earlier are also relevant here Amazon s SimpleDB Disadvantages Some administrators may have concerns about entrusting data to a third party cloud-based solution, which are well founded given recent problems Amazon has faced with its web services. 25 Aside from these concerns, having unstructured data means that the integrity of data must be preserved outside of the database environment, when writing to the database. When querying data, aggregate functions may be more difficult to implement because the syntax supported by SimpleDB is not traditional SQL. Instances may arise frequently when data is first queried from the database, then some operation must be performed to arrive at some desired aggregate value. The disadvantages associated with CouchDB are also relevant here. 4.3 Google s BigTable Responding to the increasing petabytes of data gathered as it archived the World Wide Web, Google created BigTable, a reliable and highly scalable non-relational database-like storage system. 26 A number of features make Google s BigTable a good test case for the effectiveness of non-relational models in specific instances BigTable Structure The BigTable data model is essentially a three-dimensional version of the typical database table, with loose schema requirements. One dimension stores both the webpage contents as well as the anchors that link to a particular page, another dimension stores other websites (arranged in alphabetical order allowing webpages from the same website to be stored 24 Amazon SimpleDB Pricing, online 25 Goldman, online 26 Chang et al., p. 2

10 Smartt 10 closely in memory), and the final dimension stores the cache of prior Figure 1: Illustration of Google s BigTable Webpage contents Websites that link to wfu.edu Figure 3: BigTable Structure* anchor:collegeboard.com/wfu anchor:wakesg.com versions of a particular website BigTable Advantages Most of the advantages associated with Other websites edu.wfu.www <html> Wake Forest University WFU Homepage BigTable relate to the efficiency of queries *Adapted from Chang et al. and storage. Because data is stored contiguously in tablets based on some commonality in the data, fewer queries need to be performed to retrieve information BigTable Disadvantages While BigTable is desirable for large scale database implementations, the expense associated with setting up a distributed environment may not be necessary for small-scale implementations. As with the other non-relational models, data integrity can be an issue. BigTable does have some structure; however, as data types are supported for each attribute. 5. Advantages of the Non-Relational Model 5.1 Efficiency Perhaps the greatest impetus behind the growing popularity of the non-relational model is the increased efficiency associated with the structuring of the data. In many cases, these efficiencies can be substantial Row vs. Column Stores

11 Smartt 11 Row stores are characterized by the physical data relating to a record s attributes being stored contiguously in memory. 27 This storage technique is common to typical relational DBMSs and prioritizes the speed of writing data over reading data. From the row store diagram, it is easy to understand why this storage technique is more effective at writing data, as data would commonly be written in logical groupings Figure 4: Row Store Physical Data Storage employeeid fname lname that follow a record s attributes. However, in applications where the attributes of data are constantly changing, it is equally understandable why row stores would not John Bob Sue Tim Steve Robinson Johnson Peterson Mead Bandow be desirable. Consider the impact of adding another attribute, for example phoneno, to Figure 5: Column Store Physical Data Storage employeeid fname lname John Bob Sue Tim Steve Robinson Johnson Peterson Mead Bandow our sample database above. To maintain the physical grouping of the data, a new portion of memory must be allocated every n bytes of data, creating complexity. Another issue with this storage technique is that the entirety of each tuple must be brought into memory for a given query, including the irrelevant attributes. 28 Column stores, in contrast, are characterized by record attributes being stored contiguously in memory, prioritizing read operations. 29 Most commonly implemented in data warehousing systems, this storage technique allows much faster processing of queries and has the added benefit of holding in memory only those attributes relevant to the query. Many non-relational models typically organize data in this manner, focusing on the 27 Stonebraker et al, p

12 Smartt 12 efficiency of querying the data rather than organizing the data in a way that necessarily follows the logical structure of the record and its attributes. This implementation allows for new attributes to be added as database needs change without much disruption of the physical storage. In data warehousing, for example, the non-relational column store is 50 times faster than a relational row store. 30 The relational approach requires every column to be read while the non-relational approach allows only columns relevant to the query to be read. While relational databases do exist with column store implementations, namely in newer DBMSs, most legacy systems in place today still rely on code written in the 1980s, which implement the row store technique Hashing Efficiencies extend beyond data warehousing to many other applications. Even the most basic web crawling algorithms utilize non-relational databases for storage are still at least two orders of magnitude faster than the relational databases marketed by major vendors. 32 The discussion of Google s BigTable earlier provides support for this claim, as efficiencies in hashing allow faster lookups of data. Instead of storing all attributes contiguously, BigTable will store references to the final location of a data item on disk, allowing for more efficient read-optimized disks to be used. 5.2 Scalability While relational databases do scale well up to a certain point, they are limited when expansion needs grow beyond one server. 33 As more and more servers are added, the relational model becomes increasingly complex as database administrators must carefully plan how to properly balance system demands across hundreds or thousands of servers. 30 Stonebraker, p Bain, online

13 Smartt 13 Non-relational models, on the other hand, are designed to scale well by implementing the key-value paradigm. Because all data associated with a particular key is stored together on the same server, there is no need to join tables across systems, dramatically reducing the complexity of adding new servers. 5.3 Simplicity The final major benefit of the non-relational database model is its general simplicity. Because non-relational DBMSs are schemaless, there is no need to go through the normalization process or even know what the final database will look like. All of the nonrelational models discussed previously allow adding new attributes by design without compromising the structure of the database. Furthermore, because normalization is not required, complex relationships will also not need to be planned out, as in the typical relational model. Because the structure of databases must often evolve to changing needs, this feature is especially important. 6. Disadvantages of the Non-Relational Model 6.1 Lack of Structure/Data Integrity Because non-relational databases share data across different application platforms, data integrity is difficult to enforce and normalization is often sacrificed for performance. 34 While having a loosely defined structure may be beneficial in certain situations, constraints exist in the relational model to preserve the integrity of the data. With multiple applications performing frequent read/write queries to a database, ensuring that all of these applications will properly preserve the integrity of the database can prove difficult. Instead of database integrity being controlled at the database administrator level, it is now controlled at the developer level, which could lead to problems in standardizing data. 34 Bhat and Jadhav, p

14 Smartt Implementation/Migration A more practical concern, most major corporations, academic institutions, and government agencies have well-established database implementations. Migrating from a relational DBMS to a non-relational DBMS is a difficult process, and one which may deter many from even attempting it. Before considering conversion, these organizations must consider the costs associated with the transition as well as the benefits provided by the non-relational model to determine whether such a move is practical. Although the transition may be difficult, because non-relational models are much more liberal than their relational counterparts, transitioning from a relational model to a non-relational one may prove less difficult than even the transition between two relational models. 7. Conclusion Recent trends in both cloud and distributed computing are making the non-relational model more desirable. The need to scale quickly, disassociate hardware from the data model, and provide more efficient databases are all contributing factors in this transition. When demands outside of the relational realm have come about in the past, administrators pursued less-than-desirable bolt-on approaches that. The most important aspect of the non-relational database movement has been the many varieties of databases that are available to developers outside the legacy systems. Now developers do not have to settle for the relational model when data needs dictate a different approach to storage. While the relational model will likely continue to exist for the foreseeable future, the nonrelational model will only grow in popularity. As demands for performance and scalability increase, variations of the key-value database will continue to evolve.

15 Smartt 15 Works Cited Amazon SimpleDB Pricing. Amazon Web Services. Web. 25 April < Apache CouchDB: Introduction. Apache Software Foundation. Web. 25 April < Bain, Tony. Is the Relational Database Doomed? Readwriteweb.com. ReadWrite Enterprise, 12 February Web. 24 April Bhat, Uma and Shraddha Jadhav. Moving Towards Non-Relational Databases. International Journal of Computer Applications 1.13 (2010): Burleson, Donald. Inside the Database Object Model. Boca Raton: CRC Press, Chang, Fay, et al. Bigtable: A Distributed Storage System for Structured Data. ACM Transactions on Computer Systems 26.2/4 (2008): Dean, Jeffrey and Sanjay Ghemawat. MapReduce: A Flexible Data Processing Tool. Communications of the ACM (2010) 53.1: Goldman, David. Why Amazon s Cloud Titanic Went Down. CNN.com. CNN, 22 April Web. 24 April Harrington, Jan. Relational Database Design and Implementation: Clearly Explained. Burlington: Morgan Kaufmann Publishers, Kavanagh, David. Relating to Amazon SimpleDB. March 4, Web. 28 April < Leff, Avraham and James Rayfield. EDS: An Elastic Data-Service for Situational Applications IEEE International Conference on Web Services (2010). Stonebraker, Mike. Saying Good-bye to DBMSs Communications of the ACM 52.9 (2010):

16 Smartt 16 Stonebraker, Mike et al. C-Store: A Column-oriented DBMS. Proceedings of the 31st VLDB Conference (2005):