The Quest for Extreme Scalability

The Quest for Extreme Scalability In times of a growing audience, very successful internet applications have all been facing the same database issue: while web servers can be multiplied without too many problems (scale out), this is not the case for relational databases. Sustaining a growing database workload requires either to buy more powerful hardware (scale up) or to rely on clustering abilities. Both solutions lead to increased complexity and costs. In this context, developers realized that relational databases and query languages might be the bottlenecks. Relational databases and query languages are fundamentally designed for stable workloads and complex data extraction, which are not as common with modern applications, where the ability to handle very large data sets while maintaining speed and scalability are actually more important. This realization lead to the creation of the NoSQL movement, based on innovative, open source, non-relational database systems designed to achieve specific requirements and manage extreme scalability on very large data sets.

Build to scale «NoSQL» is a label qualifying database management system that enables the implementation of databases that are not only based on SQL. NoSQL is usually associated with extreme performance and/or the ability to manage extremely large data sets. NoSQL emerged as a movement in early 2009 during a meet up organized in San Francisco to discuss the growing number of open source distributed database management systems that do not attempt to comply with ACID guarantees (atomicity, consistency, isolation, durability). There are close to 100 NoSQL open source projects being implemented today. Many NoSQL projects start by implementing a specific data structure to solve a specific problem that SQL databases can-not solve. Despite this common starting point, NoSQL databases vary quite a bit, reflecting the fact that NoSQL is more a label that qualifies a variety of atypical databases than a uniformed set of equivalent solutions. What are the main categories of NoSQL databases?

Extreme and affordable scalability data: Neo4J. There are actually four major types of NoSQL data models : - Key-value stores, which provide giant hashtable to store data, very useful for highaudience applications constantly broadcasting data to their users: Memcached, Redis, - Bigtable clones, storing data on a large, multi-dimensional sorted map, very useful to store, analyze and retrieve large amounts of data: HBase, Cassandra, - Document stores, designed for semistructured information: CouchDB, MongoDB, - Graph databases, probably the most experimental type, designed for graph-like According to Bruno Michel - lead developer of af83 R&D department, he has taken part in all NoSQL-related projects by af83 - «Solutions like Cassandra are designed as an effective answer to a specific problem, which is scalability with large data sets. Others, like MongoDB, are designed for Web application development in general, allowing more flexibility and performance. Others are meant to be solutions to very specific types of data, like graph databases or projects specialized in geographic data». What is the real life impact of this variety of approaches?

The right tool for the right context Bruno Michel says «Benefits of NoSQL solutions depend on use cases». When considering NoSQL, users are required to select the database that will be the best fit for their applications. Proper choice leads to higher performance, or lower cost. Olivier Desmoulin, founder of a geolocalized social network for foodies, certainly understands that: «We are serving up to 60,000 customers per day, using just one low cost server which not only serves the Rails applications, but also the whole MongoDB database. MongoDB s ability to handle geolocation was also very helpful. For a small start-up like us, NoSQL was critical to ensure scalability». There are many examples such as these. Bruno Michel: «MongoDB is used by very popular websites like bit.ly, foursquare or disqus for its ability to deliver performance and scalability by using sharding.» Despite the good news, choosing a NoSQL database is tough. According to Ori Pekelman, CTO of af83 with extensive experience on NoSQL technologies, having spearheaded the use of NoSQL on numerous projects with af83 customers, «There are more than a hundred NoSQL projects going on, most solutions are very new to the market, mature projects are only two or three years old, and the hierarchy is constantly moving. This is a time of better opportunities for customers, but the choice is tough». Bruno Michel «In the last three years, a lot of very promising solutions have appeared, some of them proved to be extremely hard to sustain, due to a very slow development or project instability». What could be the approach to select and leverage the power of NoSQL?

Our Recommendations 1. Assess your situation NoSQL databases were designed to handle very specific tasks: MongoDB was meant to be the database engine of a cloud-based application platform, Cassandra was designed to manage inbox search at Facebook, Memcached was designed to improve caching at Livejournal. As long as you are not going to implement your own NoSQL database management system, it is recommended that you clearly define your functional requirements first, and then proceed with finding the appropriate solution - using NoSQL or not: NoSQL should be used when scalability is a requirement, and avoided when running complex queries is the requirement. 2. Be realistic with your scalability requirements NoSQL is not appropriate for every application and project. Typical use cases are publicfacing internet applications with a really large audience, and very large data sets. Typical NoSQL databases range from dozens of gigabytes to petabytes. Very few applications fit that definition. Alternatively, smaller applications will benefit from the sometimes comparatively lower requirements of NoSQL databases, but should assess if they are able to sustain the NoSQL choice: NoSQL expertise is more difficult to find. 3. Check the performance thoroughly Ori Pekelman: «We have been testing dozens of solutions; performances go tenfold from one solution to another. Many benchmarks are available online, but due to the variety of approaches, it is sometimes very difficult to find proper return on experience on specific data volumes, read/write ratios or queries distribution». 4. Pay attention to the Open Source projects that is tied to the solutions Bruno Michel: «NoSQL solutions are only starting to be mature and production-ready, but customers need to be cautious as solutions and projects are not equal and evolve very quickly. Some solutions received a lot of visibility without the ability to deliver». 5. Check the availability of proper tools and documentation Bruno Michel: «An issue that may be underestimated is the status of tools and documentation related to your solutions. This sort of shortcoming leaves your project at risk, whatever the performance levels of the database.

Famous NoSQL users Facebook, Foursquare, Google and Yahoo are all NoSQL users. A quick search on Google will provide you a lot of coverage of their trial, errors and success toward NoSQL. NoSQL is not always one-stop data solution There are cases where NoSQL will be the data solution that will solve all your data requirements, but more often than not, NoSQL will only solve part of your requirements and you will need to implement a combination of solutions including NoSQL. Innovative features Depending on the implementation, NoSQL databases often include non-traditional features such as the ability to run in memory (also known as «NoDisk»), sharding or optimized mechanisms for geolocation. Required reading The Dynamo paper, about Amazon s own highly scalable data store The Bigtable paper, about Google s own DBMS SQL Databases Don t Scale The slideshow from the June 11, 2009 NoSQL meet up in San Francisco