NoSQL databases Mrs.Archana kalia Lecturer in Information Technology Dept. VPMs Polytechnic College,Thane Abstract: NoSQL (Not only SQL) is a database used to store large amounts of data. NoSQL databases are distributed, non-relational, open source and are horizontally scalable (in linear way). NOSQL does not follow property of ACID as we follow in SQL. In this paper, we are surveying about NoSQL, its background, fundamentals like ACID, BASE and CAP theorem. Since it is very difficult to choose a suitable database for a specific use case, this paper evaluates the underlying techniques of NoSQL databases considering their applicability for certain requirements. Introduction: In the computing system (web and business applications), there are enormous data that comes out every day from the web. A large section of these data is handled by Relational database management systems (RDBMS). The idea of relational model came with E.F.Codd s 1970 paper "A relational model of data for large shared data banks" which made data modeling and application programming much easier. Beyond the intended benefits, the relational model is well-suited to clientserver programming and today it is predominant technology for storing structured data in web and business applications. NoSQL is a non-relational database management systems, different from traditional relational database management systems in some significant ways. It is designed for distributed data stores where very large scale of data storing needs (for example Google or Facebook which collects terabits of data every day for their users). These type of data storing may not require fixed schema, avoid join operations and typically scale horizontally. In today s time data is becoming easier to access and capture through third parties such as Facebook, Google+ and others. Personal user information, social graphs, geo location data, usergenerated content and machine logging data are just a few examples where the data has been increasing exponentially. To avail the above service properly, it is required to process huge amount of data which SQL databases were never designed. The evolution of NoSql databases is to handle these huge data properly. NoSQL systems generally have six key features: 1. the ability to horizontally scale simple operation throughput over many servers, 2. the ability to replicate and to distribute (partition) data over many servers, 3. a simple call level interface or protocol (in contrast to a SQL binding), 4. a weaker concurrency model than the ACID transactions of most relational (SQL) Database systems, 5. efficient use of distributed indexes and RAM for data storage, and 6. the ability to dynamically add new attributes to data records. 1
ACID free ACID stands for Atomicity, Consistency, Isolation and Durability. ACID concept basically comes from the SQL environment. But in NoSQL we will not use the ACID concept because of Consistency feature of SQL. As in the distributed environment, data is spread to different machines, each machine stores its data and maintenance of consistency is needed. For example, if there is change in one tuple of the table then changes are needed in each and every machine on which that particular data resides. If information regarding an updation spreads immediately, then consistency is given; if not, then inconsistency is carried BASE BASE stands for Basically, Available, Soft state, and Eventual consistency. BASE is reverse of ACID. NoSQL databases are divided in between the road from ACID to BASE. After a transaction consistency the state that we will get is soft state not a solid state. The main focus leading behind the BASE is the permanent availability. For example, thinking about the databases in banks, if two persons are accessing the same account in different cities then data updations is needed not just in time but needs some real time databases as well. Those updations need to be done frequently on all machines. Some more examples are online railway reservation, online book trade, etc. SCALABILITY In electronics (including hardware, communication and software), scalability is the ability of a system to expand to meet your business needs. For example scaling a web application is all about allowing more people to use your application. We scale a system by upgrading the existing hardware without changing much of the application or by adding extra hardware. There are two ways of scaling horizontal and vertical scaling : Vertical scaling To scale vertically (or scale up) means to add resources within the same logical unit to increase capacity. For example to add CPUs to an existing server, increase memory in the system or expanding storage by adding hard drive. Horizontal scaling To scale horizontally (or scale out) means to add more nodes to a system, such as adding a new computer to a distributed software application. In NoSQL system, data store can be much faster as it takes advantage of scaling out which means to add more nodes to a system and distribute the load over those nodes. CAP CAP stands for Consistency, Availability and Partition tolerance. CAP is basically a theorem that follows three principles Consistency - This means that the data in the database remains consistent after the execution of an operation. For example after an update operation all clients see the same data. Availability - This means that the system is always on (service guarantee availability), no downtime. 2
Partition Tolerance - This means that the system continues to function even the communication among the servers is unreliable, i.e. the servers may be partitioned into multiple groups that cannot communicate with one another. In theoretically it is impossible to fulfill all 3 requirements. CAP provides the basic requirements for a distributed system to follow 2 of the 3 requirements. Therefore the entire current NoSQL database follow the different combinations of the C, A, P from the CAP theorem. Here is the brief description of three combinations CA, CP, AP: CA - Single site cluster, therefore all nodes are always in contact. When a partition occurs, the system blocks. CP - Some data may not be accessible, but the rest is still consistent/accurate. AP - System is still available under partitioning, but some of the data returned may be inaccurate. NoSQL data store types On the basis of CAP theorem NoSQL databases are divided into number of databases. There are four new different types of data stores in NoSQL A. Key Value Stores Key value stores are similar to maps or dictionaries where data is addressed by a unique key. Since values are uninterrupted byte arrays, which are completely opaque to the system, keys are the only way to retrieve stored data. Values are isolated and independent from each other wherefore relationships must be handled in application logic. Due to this very simple data structure, key value stores are completely schema free. New values of any kind can be added at runtime without conflicting any other stored data and without influencing system availability. The grouping of key value pairs into collection is the only offered possibility to add some kind of structure to the data model. Key value stores are useful for simple operations, which are based on key attributes only. In order to speed up a user specific rendered webpage, parts of this page can be calculated before and served quickly and easily out of the store by user IDs when needed. Since most key value stores hold their dataset in memory, they are oftentimes used for caching of more time intensive SQL queries. B. Document Stores Document Stores encapsulate key value pairs in JSON or JSON like documents. Within documents, keys have to be unique. Every document contains a special key "ID", which is also unique within a collection of documents and therefore identifies a document explicitly. In contrast to key value stores, values are not opaque to the system and can be queried as well. Therefore, complex data structures like nested objects can be handled more conveniently. Storing data in interpretable JSON documents have the additional advantage of supporting data types, which makes document stores very developerfriendly. Similar to key value stores, document stores do not have any schema restrictions. Storing new documents containing any kind of attributes can as easily be done as adding new attributes to existing documents at runtime. Document stores offer multi attribute lookups on records which may have complete different kinds of key value pairs. Therefore, these systems are very convenient in data integration and schema migration tasks. (JSON is the data structure of the Web. It's a simple data format that allows programmers to store and communicate sets of values, lists, and key-value mappings across systems. As JSON adoption has grown, database vendors have sprung up offering JSON-centric document databases.) 3
C. Column Family Stores Column Family Stores are also known as column oriented stores, extensible record stores and wide columnar stores. All stores are inspired by Goggles Big table, which is a "distributed storage system for managing structured data that is designed to scale to a very large size". Big table is used in many Google projects varying in requirements of high throughput and latency-sensitive data serving. The data model is described as "sparse, distributed, persistent multidimensional sorted map". In this map, an arbitrary number of key value pairs can be stored within rows. Since values cannot be interpreted by the system, relationships between datasets and any other data types than strings are not supported natively. Similar to key value stores, these additional features have to be implemented in the application logic. Multiple versions of a value are stored in a chronological order to support versioning on the one hand and achieving better performance and consistency on the other one (chapter four). Columns can be grouped to column families, which is especially important for data organization and partitioning. D. Graph Databases In contrast to relational databases and the already introduced key oriented NoSQL databases, graph databases are specialized on efficient management of heavily linked data. Therefore, applications based on data with many relationships are more suited for graph databases, since cost intensive operations like recursive joins can be replaced by efficient traversals Property graphs are distinct from resource description framework stores like Sesame [18] and Big data which are specialized on querying and analyzing subject-predicate-object statements. Since the whole set of triples can be represented as directed multi relational graph, RDF frameworks are considered as a special form of graph databases in this paper too. In contrast to property graphs, these RDF graphs do not offer the possibility of adding additional key value pairs to edges and nodes. On the other handy, by use of RDF schema and the web ontology language it is possible to define a more complex and more expressive schema, than property graph databases do. Use cases for graph databases are location based services, knowledge representation and path finding problems raised in navigation systems, recommendation systems and all other use cases which involve complex relationships. Property graph databases are more suitable for large relationships over many nodes, whereas RDF is used for certain details in a graph. CHARACTERISTICS OF NoSQL *NoSQL does not use the relational data model thus does not use SQL language. *NoSQL stores large volume of data. *In distributed environment (spread data to different machines), we use NoSQL without any inconsistency. *If any faults or failures exist in any machine, then in this there will be no discontinuation of any work. * NoSQL is open source database, i.e. its source code is available to everyone and is free to use it without any overheads. *NoSQL allows data to store in any record that is it is not having any fixed schema. * NoSQL does not use concept of ACID properties. * NoSQL is horizontally scalable leading to high performance in a linear way. * It is having more flexible str 4
SQL vs NoSQL SQL (relational) versus NoSQL scalability is a controversial topic. This paper argues against both extremes. Here is some more background to support this position. The argument for relational over NoSQL goes something like this: * If new relational systems can do everything that a NoSQL system can, with analogous performance and scalability, and with the convenience of transactions and SQL, why would you choose a NoSQL system? * Relational DBMSs have taken and retained majority market share over other competitors in the past 30 years: network, object, and XML DBMSs. * Successful relational DBMSs have been built to handle other specific application loads in the past: read-only or read-mostly data warehousing, OLTP on multi-core multi-disk CPUs, in-memory databases, distributed databases, and now horizontally scaled databases. * While we don t see one size fits all in the SQL products themselves, we do see a common interface with SQL, transactions, and relational schema that give advantages in training, continuity, and data interchange. The counter-argument for NoSQL goes something like this: * We haven t yet seen good benchmarks showing that RDBMSs can achieve scaling comparable with NoSQL systems like Google s BigTable. * If you only require a lookup of objects based on a single key, then a key-value store is adequate and probably easier to understand than a relational DBMS. Likewise for a document store on a simple application: you only pay the learning curve for the level of complexity you require. * Some applications require a flexible schema allowing each object in a collection to have different attributes. While some RDBMSs allow efficient packing of tuples with missing attributes, and some allow adding new attributes at runtime, this is uncommon. * A relational DBMS makes expensive (multimode multi-table) operations too easy. NoSQL systems make them impossible or obviously expensive for programmers. * While RDBMSs have maintained majority market share over the years, other products have established smaller but non-trivial markets in areas where there is a need for particular capabilities,e.g. indexed objects with products like BerkeleyDB, or graph-following operations with object-oriented DBMSs. Both sides of this argument have merit. 5
CONCLUSION AND FUTURE WORK The main aim of this paper is to give an overview of NoSQL databases, about how it has declined the dominance of SQL, with its background and characteristics. It also describes its fundamentals that form the base of the NoSQL databases like ACID, BASE and CAP theorem. ACID property is not used in the NoSQL databases databases because of data consistency so we get to know how SQL lags data consistency. Later, on the basis of the CAP theorem we described different types of NoSQL databases that are Key-Value databases, Document Store Databases, Columnar based databases and Graph databases.. Further research is going on in the new technologies that are arising for or after NoSQL that is polygon persistence, etc. REFERENCES 1. Scalable SQL and NoSQL Data Stores by Rick Cattell Originally published in 2010, 2. NoSQL databases: a step to database scalability in web environment byjaroslav Pokorny Department of Software Engineering,Faculty of Mathematics and Physics, Charles University,Praha, Czech Republic. 3. NoSQL Evaluation,A Use Case Oriented Survey by Robin Hecht Chair of Applied Computer Science IVUniversity of Bayreuth, Germany robin.hecht@uni -bayreuth.de 4. Managing Schema Evolution in NoSQL Data Stores by Stefanie Scherzinger,Regensburg University of Applied Sciences,stefanie.scherzinger@hs-regensburg.de, Meike Klettke University of Rostock,meike.klettke@uni-rostock.deUta St orl,darmstadt Universityof Applied Sciencesuta.stoerl@h-da.de 6