Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores Composite Software October 2010
TABLE OF CONTENTS INTRODUCTION... 3 BUSINESS AND IT DRIVERS... 4 NOSQL DATA STORES LANDSCAPE... 5 TABULAR / COLUMNAR DATA STORES... 5 DOCUMENT STORES... 5 GRAPH DATABASES... 5 KEY/VALUE STORES... 5 OBJECT AND MULTI-VALUE DATABASES... 5 MISCELLANEOUS NOSQL SOURCES... 5 INTEGRATING NOSQL DATA STORES USING DATA VIRTUALIZATION... 6 TABULAR/COLUMNAR DATA STORES... 6 XML DOCUMENT STORES... 7 KEY/VALUE STORES... 7 SUMMARY... 8 Composite Software 2
INTRODUCTION There is a trend in the data storage and management arena to consider data storage options beyond the traditional SQL-based relational database. The overall movement began in 2009 and was known as NoSQL (meaning no SQL ), but that label has since evolved into NOSQL (meaning not only SQL ). Unfortunately both of these labels say more about what it isn t than what it is, and this is the source of ongoing confusion for this whole class of data stores. The general definition of a NOSQL data store is that is manages data that is not strictly tabular and relational, so it does not make sense to use SQL for the creation and retrieval of the data. More specifically, NOSQL data stores are usually non-relational, distributed, open-source, and horizontally scalable, although there are exceptions to each of these for specific NOSQL data stores. While NOSQL access standards have yet to fully develop, each implementation provides some sort of Java-based development API appropriate for accessing that type of NOSQL data. The Composite Data virtualization Platforms typically use these APIs to access and integrate NOSQL data, with three kinds of NOSQL data sources a natural integration fit. This paper describes the primary NOSQL data sources in the market today and how to integrate them with other sources using the Composite Data Virtualization Platform. Composite Software 3
BUSINESS AND IT DRIVERS The main driver for the creation of NOSQL data stores was the emergence of web-scale data i.e., massive amounts of data at the large web sites and services like Amazon, Google, Yahoo!, Facebook, etc. A number of NOSQL data stores emerged from custom engineering development done at these large companies. Recently predictive analytics, voice-of-thecustomer, churn, fraud and other big data use cases have emerged to further accelerate demand. Storing and processing this data revealed several specific motivations for these new data stores including: Cost per Terabyte: Many of the NOSQL data sources were invented to handle web-scale data that is created in enormous volumes (e.g., web site click streams), and storing this much data in a traditional relational database would be expensive and inefficient. Many of the NOSQL data sources are open source and run on commodity hardware, making them considerably less expensive per terabyte than traditional databases from vendors like Oracle and Teradata. Distributed Processing: Web-scale data is so large that the traditional database approach to storage, indexing, and retrieval does not work very well with this class of data. NOSQL data sources introduce storage architectures that scale horizontally; and parallel algorithms designed to efficiently process the distributed data ( map-reduce being the most prominent example). Data Shape Appropriateness: Many successful web-based services have introduced data that is not efficiently represented as relational, motivating new data structures more appropriate to the data. For example, social media web sites employ graph databases to represent the social relationships inherent in these services. Composite Software 4
NOSQL DATA STORES LANDSCAPE Although the original emergence of NOSQL data stores was motivated by web-scale data, the movement has grown to encompass a wide variety of data stores that just happen to not use SQL as their processing language (making it difficult to characterize exactly what a NOSQL data store is). There is no general agreement on the taxonomy of NOSQL data stores, but the categories below capture much of the landscape. Tabular / Columnar Data Stores Storing sparse tabular data, these stores look most like traditional tabular databases. Examples include Hadoop/HBase (Yahoo!), BigTable (Google), Hypertable and VoltDB. Their primary data retrieval paradigm utilizes column filters, generally leveraging hand-coded map-reduce algorithms. Document Stores These NOSQL data sources store unstructured (i.e., text) or semi-structured (i.e., XML) documents. Examples include MongoDB, Mark Logic and CouchDB. Their data retrieval paradigm varies highly, but documents can always be retrieved by unique handle. XML data sources leverage XQuery. Text documents are indexed, facilitating keyword search-like retrieval. Graph Databases These NOSQL sources store graph-oriented data with nodes, edges, and properties and are commonly used to store associations in social networks. Examples include Neo4J, AllegroGraph and FlockDB. Data retrieval focuses on retrieving associations from a particular node. Key/Value Stores These sources store simple key/value pairs like a traditional hashtable. They are further subdivided into in-memory and disk-based solutions. This category of NOSQL systems probably has the largest number of members, each embodying slightly different characteristics. Examples include Memcached, Cassandra (Facebook), SimpleDB, Dynamo (Amazon), Voldemort (Linked-In) and Kyoto Cabinet. Their data retrieval paradigm is simple; given a key, return the value. Some offer more complex querying mechanisms that can look inside the value, but normally the value is considered opaque. Object and Multi-value Databases These types of stores preceded the NOSQL movement, but they have found new life as part of the movement. Object databases store objects (as in object-oriented programming). Multi-value databases store tabular data, but individual cells can store multiple values. Examples include Objectivity, GemStone and Unidata. Proprietary query languages are used to retrieve data. Miscellaneous NOSQL Sources Several other data stores can be classified as NOSQL stores, but they don t fit into any of the categories above. Examples include: GT.M, IBM Lotus/Domino, and the ISIS family. Composite Software 5
INTEGRATING NOSQL DATA STORES USING DATA VIRTUALIZATION The Composite Data Virtualization Platform provides a complete development and runtime environment for discovering, accessing, federating, abstracting and delivering data from diverse sources. Access is typically done via standards-based protocols and APIs, for example JDBC and ODBC for SQL-based sources, HTTP and SOAP for Web services, JMS for messages, APIs for enterprise and cloud-based applications. Through these methods, source data is securely exposed from a single virtual location, regardless of how and where it is physically stored. While NOSQL access standards have yet to fully develop, each implementation provides some sort of Java-based development API appropriate for accessing that type of NOSQL data. The Composite Data Virtualization Platform uses these APIs as well as Composite s Custom Java Procedure (CJP) resource to access and integrate NOSQL data. Three kinds of NOSQL systems are a particularly natural fit for this integration approach. These include Tabular/Columnar Data Stores, XML Document Stores, and Key/Value Stores. A more detailed integration approach for each of these is outlined below. Over time, as NOSQL leaders emerge and usage patterns solidify, Composite may elect to provide more in-depth integrations with particular NOSQL data stores through the creation of fully supported adapters. Tabular/Columnar Data Stores Because the original implementation of the Composite Data Virtualization Platform integrated tabular data, retrieving and processing data from this category of NOSQL data store is an easy fit. This approach leverages Composite s ability to incorporate table functions in the FROM clause of a SQL statement. That is, any Composite procedure resource that returns a cursor can be dropped into the View editor as a table, where it will show up in the FROM clause of the SQL statement. For a specific NOSQL data store, a collection of CJP table functions can be implemented that leverage the NOSQL system s Java API. Each CJP would provide access to a different table in the underlying NOSQL data store. The CJPs can take input arguments to filter the data from the table, further leveraging the NOSQL system s processing capability. The values of the filters can even be specified at run-time from a client query by leveraging the virtual column capability of Views. It is worth remembering that these tabular/columnar NOSQL data sources store very large data sets, so caution must be used on large queries. The table function implementation should ensure sufficient data reduction in the target data source by leveraging input parameters. Also, the processing of requests to these data sources can take a very long time (more like batch jobs than live queries), so employing some form of caching would probably be prudent. This approach provides full access to the data in the underlying NOSQL system and it will likely meet most near term needs. There are, however, some disadvantages and inefficiencies in this approach. For example, all the columns specified in the CJP s cursor would always be retrieved, even if they weren t all necessary for the current query. Also, more generic filtering and aggregation might be possible with the underlying system, but the CJP provides only a limited interface to expose that capability to Composite. If a particular NOSQL Tabular data Composite Software 6
store becomes quite popular, it would be an ideal candidate for Composite to develop a custom adapter that would fully integrate and leverage that specific data source s capabilities. XML Document Stores Because XML document stores utilize XQuery as their preferred data retrieval paradigm, the Composite Data Virtualization Platform leverages its embedded XQuery engines and XML native data type to easily retrieve and further process documents from this category of NOSQL data store. For a specific NOSQL XML document store with a Java API, a minimum of two CJP procedures are required. Both CJPs return an XML document that can be further manipulated by any of the upstream XML manipulation functionality (e.g., XSLT Transformations). The first CJP would take a document handle (unique identifier) as its only input argument, and then leverage the API to retrieve and return that document. The second CJP would take an XQuery specification as its only input argument, and then leverage the API to execute the query and return the results as a single document. Of course, additional CJPs accepting more specific parameters could also be implemented, facilitating easier integration into multiple views. This approach provides full access to the data in the underlying XML data source, and it will likely be sufficient for most needs. Key/Value Stores The Composite Data Virtualization Platform can integrate key/value stores in two ways. The first is through a custom SQL function. That is, a function can be created that takes the key as a parameter, and returns the value. This function can then be used in multiple SQL statements throughout Composite. In the second, Composite leverages the in-memory key/value store as a cache target. This is the primary use-case typically described by our enterprise customers. This approach is best for small data sets or procedure results, but it doesn t work as well for large tabular data sets. Further, this form of cache integration is often challenged by the impedance mismatch between cached tabular data and cached key/value data (the cached data is opaque inside the key/value store), so the entire set must be retrieved for processing. This form of integration is available today from our professional services organization. Composite Software 7
SUMMARY NOSQL data stores are proliferating as a means of supporting web-scale data. Recently predictive analytics, voice-of-the-customer, churn, fraud and other big data use cases have emerged to further accelerate demand. There are a wide variety of NOSQL systems, each with their own set of use-cases and advantages. Each NOSQL data store has a unique and non-standard API that can be used to access and integrate these sources. The Composite Data Virtualization Platform is well suited for integrating data from these NOSQL sources with other data within and outside the enterprise. This paper describes integrations for three flavors of NOSQL data stores: Tabular/Columnar Data Stores, XML Document Stores, and In-Memory Key/Value Stores. Today, Composite can provide basic access to data from any of these NOSQL data stores with minimal programming, using standard resources. In the longer term, when leaders in particular areas of the NOSQL landscape emerge, Composite may provide deeper integrations through standard product adapters that within the Composite Application Data Services product line. Composite Software 8
ABOUT COMPOSITE SOFTWARE Composite Software, Inc. is the data virtualization gold standard at ten of the top 20 banks, six of the top ten pharmaceutical companies, four of the top five energy firms, major media and technology organizations; and multiple government agencies. These are among the hundreds of global organizations with disparate, complex information environments that count on the Composite to increase their data agility, cut costs and reduce risk. Backed by nearly a decade of pioneering R&D, Composite is the data virtualization performance leader, scaling from project to enterprise for data federation, data warehouse extension, enterprise data sharing, real-time and cloud computing data integration. Founded in 2002, Composite Software is a privately held, venture-funded corporation based in Silicon Valley. For more information, please visit www.compositesw.com.