I N T E R S Y S T E M S W H I T E P A P E R INTERSYSTEMS CACHÉ AS AN ALTERNATIVE TO IN-MEMORY DATABASES. David Kaaret InterSystems Corporation

INTERSYSTEMS CACHÉ AS AN ALTERNATIVE TO IN-MEMORY DATABASES David Kaaret InterSystems Corporation

INTERSYSTEMS CACHÉ AS AN ALTERNATIVE TO IN-MEMORY DATABASES Introduction To overcome the performance limitations of traditional relational databases, applications ranging from those running on a single machine to large, interconnected grids often use in-memory databases to accelerate data access. While in-memory databases and caching products increase throughput, they suffer from a number of limitations including lack of support for large data sets, excessive hardware requirements, and limits on scalability. InterSystems Caché is a high-performance object database with a unique architecture that makes it suitable for applications that typically use in-memory databases. Caché s performance is comparable to that of in-memory databases, but Caché also provides: Persistence data is not lost when a machine is turned off or crashes Rapid access to very large data sets The ability to scale to hundreds of computers and tens of thousands of users Simultaneous data access via SQL and objects: Java, C++,.NET, etc. This paper explains why Caché is an attractive alternative to in-memory databases for companies that need high-speed access to large amounts of data. Unique data engine enables persistence and high performance Caché is a persistent database, which means that data maintained in RAM is written to disk by background processes. So how can Caché provide performance that is comparable to in-memory databases, which only periodically write data to some permanent data store? Part of the answer lies in Caché s unique architecture. Instead of the rows and columns of a traditional database, Caché uses multidimensional arrays, the structure of which is based on object de initions. Data is stored the way the architect designs it, and the same structures used for the in-memory cache are used on disk. Data that should be stored together is stored together. As a result, Caché can access data on disk very quickly. The requirement that multiple in-memory caches need to be synchronized when data is updated also reduces the performance of many distributed cache products. With Caché, the updating of data and the distribution of data to caches are logically separate. This gives it a much simpler work low which allows for superior performance. Caché also provides in-process bindings to C++ and Java that allow applications written in those languages to directly populate Caché s internal data structures. 1

The bene its of persistence Given that Caché provides comparable performance, its ability to access data on disk confers some signi icant advantages compared to in-memory databases. The most obvious is that there is no need for a separate permanent data store. Caché is the permanent store, and it is always current. Data is not lost when a machine is turned off or crashes. Another advantage is that, with Caché, the size of data sets is not limited by the amount of available RAM. If data is not in a local cache it is either obtained from a remote cache or from disk in a seamless manner. Since it is not RAM-limited, a Caché-based system can handle petabytes of data, in-memory databases cannot. Adding RAM to a system in an attempt to increase capacity is more expensive than adding disk storage. (A terabyte of disk storage is cheaper than a terabyte of RAM.) Plus, many in-memory systems need to keep redundant copies of data on separate machines to safeguard against the effects of having a computer crash. Operating distributed cache systems with a persistent database like Caché often results in reduced hardware costs. Seamless SQL and object data access One problem shared by most in-memory databases is that, because their data structures are optimized for high-speed processing, the data is usually not readily accessible via SQL. In order to be compatible with most analysis and reporting tools, the data must irst be mapped into relational tables. This is usually done when data is transferred from the in-memory database to the permanent data store and typically involves an ETL (extract, transform, and load) process. (The processing overhead and additional time required for mapping is the main reason relational databases are not fast enough for extremely high-speed distributed applications, and why in-memory databases are often used instead.) A few in-memory databases are based on the relational model, and offer SQL data access. Such systems suffer from the opposite problem, in that data is not readily accessible to the object-oriented technologies that are typically used for application development. In addition, most relational in-memory databases are not designed for multi-computer con igurations. They run on only one machine, and are RAM-limited. Caché is different, because the multidimensional arrays it uses can be exposed simultaneously as relational tables and as objects. Caché s Uni ied Data Architecture maintains both object and relational views of data at all times without mapping. 2

I N T W H E R S I T E Y S T E M S P A P E R Fig. 1: Caché s Uni$ied Data Architecture enables multiple ways to access data Caché s SQL access is compatible with both ODBC and JDBC. On the object side, Caché provides bindings to any number of object-oriented languages including Java,.NET, and C++. Caché s object representation is full-featured and supports object-oriented concepts like inheritance, polymorphism, and encapsulation. Enterprise Cache Protocol In multi-computer applications Caché automatically maintains caches by use of its Enterprise Cache Protocol (ECP). With ECP, Caché instances can be conoigured as data servers and/or application servers. Each piece of data is owned by a data server. Application servers understand where data is located and keep local caches of recently used data. If an application server cannot satisfy requests from its local cache it will request the necessary data from a remote data server. ECP automatically manages cache consistency. ECP requires no application changes applications simply treat the entire database as if it was local. This is a major distinction from some distributed cache systems, where each client needs to specify what subset of data it is interested in before any queries are performed. One machine, one cache Another key difference between Caché and other distributed cache products is that most other products maintain a separate cache for each process running on a machine. For example, if a single machine has eight clients then eight individual caches will be maintained on that machine. 3

In contrast, Caché maintains its cache in shared memory and provides bindings to allow processes running in their own memory address space to access the data. Data can be simultaneously accessed through TCP-based protocols like JDBC, through language bindings, and also for exceptionally high performance through bindings that allow applications to directly manipulate the cache. Allowing multiple clients to share a single cache provides a number of bene its. One is that a shared-cache system has reduced memory requirements. When, as is often the case, individual clients require access to overlapping data, other distributed cache products maintain multiple copies of the data. With Caché only a single copy of the data needs to be maintained for each machine. Having one cache per machine also results in reduced network I/O. In high-performance applications the network traf ic associated with cache maintenance can be a major issue. However, with a single cache per machine, only that cache needs to be updated as the underlying data changes, rather than making overlapping updates to multiple caches. Even with multi-core processors, a Caché-based system only uses one shared cache per machine, resulting in superior scalability compared with other distributed cache products. For example, in a Caché-based system of 250 machines, each with 8 cores, only 250 caches need to communicate with each other in order to maintain cache coherence. But systems that require a separate cache for each core would need to coordinate 2000 caches. As modern computers may have eight, sixteen, or even more cores, the scalability advantage of Caché becomes increasingly important. Fig. 2a: Cache coherency without InterSystems Enterprise Cache Protocol. 4

I N T W H E R S I T E Y S T E M S P A P E R Fig. 2b: Cache coherency in a Caché-based system. Populating the cache In many distributed cache applications, pre-loading the cache can be a lengthy process. This may be due to the sheer amount of data, and/or because of the time required to map data from a relational store into the object-oriented structures used by the application. For some data-intensive applications, more time is spent populating in-memory caches than actually running calculations against them. Not so with Caché. Caché s exceptional SQL capabilities allow it to easily pull data from relational primary data sources. And of course, as a persistent database, Caché may be the primary source. In that case, there is no need to pre-load caches at all. Local caches will automatically load the data they need as queries are run. Another consideration is how many machines are involved with the task of populating caches. With Caché, primary ownership of the data is held by a small percentage of the computers in a distributed grid environment. Populating that environment only requires access to the ECP data servers, and they can be loaded in the background while the other computers are used for other tasks. When the application servers come on line, their caches are repopulated automatically as data is requested. In contrast, when data is loaded in most in-memory products, it is partitioned to be spread across the distributed cache so that all, or virtually all, data is in the memory of at least one machine. As a result, it is often not feasible to do data loads with a small subset of the computers while bringing the rest on line as needed. 5

Conclusion The primary reason for using in-memory databases is speed. But although they are fast, in-memory databases often suffer from poor scalability, lack of SQL support, excessive hardware requirements, and the risk of losing data due to unplanned outages. Caché is the only persistent database that provides performance equal to that of in-memory databases. It also supports extremely large data sets, seamlessly allows data access via both SQL and objects, enables distributed systems of hundreds of machines, and is highly reliable. All of this makes Caché an attractive alternative for applications that must process very high volumes of data at very high speed. About InterSystems InterSystems Corporation is a global software technology leader with headquarters in Cambridge, Massachusetts, and of ices in 23 countries. InterSystems provides innovative products that enable fast development, deployment, and integration of enterprise-class applications. InterSystems Caché is a high performance object database that makes applications faster and more scalable. InterSystems Ensemble is a rapid integration and development platform that enriches applications with new functionality, and makes them connectable. InterSystems HealthShare is a platform that enables the fastest creation of an Electronic Health Record for regional or national health information exchange. InterSystems DeepSee is software that makes it possible to embed real-time business intelligence in transactional applications, enabling better operational decisions. For more information, visit InterSystems.com. InterSystems Corporation World Headquarters One Memorial Drive Cambridge, MA 02142-1356 Tel: +1.617.621.0600 Fax: +1.617.494.1631 InterSystems.com InterSystems Ensemble and InterSystems Caché are registered trademarks of InterSystems Corporation. InterSystems DeepSee and InterSystems HealthShare are trademarks of InterSystems Corporation. Other product names are trademarks of their respective vendors. Copyright 2010 InterSystems Corporation. All rights reserved. 1-10