Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores



Similar documents
Composite Software Data Virtualization Turbocharge Analytics with Big Data and Data Virtualization

Cloud Scale Distributed Data Storage. Jürmo Mehine

How To Handle Big Data With A Data Scientist

INTRODUCTION TO CASSANDRA

How To Scale Out Of A Nosql Database

Big Data Technologies. Prof. Dr. Uta Störl Hochschule Darmstadt Fachbereich Informatik Sommersemester 2015

The Quest for Extreme Scalability

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

NoSQL Data Base Basics

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Databases 2 (VU) ( )

The 3 questions to ask yourself about BIG DATA

BIG DATA TOOLS. Top 10 open source technologies for Big Data

GigaSpaces Real-Time Analytics for Big Data

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

So What s the Big Deal?

Making Sense of NoSQL Dan McCreary Ann Kelly

Choosing The Right Big Data Tools For The Job A Polyglot Approach

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Making Sense ofnosql A GUIDE FOR MANAGERS AND THE REST OF US DAN MCCREARY MANNING ANN KELLY. Shelter Island

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Data Modeling for Big Data

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Cloud Computing and Advanced Relationship Analytics

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Applications for Big Data Analytics

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

How To Write A Database Program

Composite Data Virtualization Data Virtualization Platform Maturity Model

NoSQL and Hadoop Technologies On Oracle Cloud

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

There s no way around it: learning about Big Data means

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Lecture Data Warehouse Systems

Big Data Technologies Compared June 2014

Next-Generation Cloud Analytics with Amazon Redshift

Crazy NoSQL Data Integration with Pentaho

WHITE PAPER. Four Key Pillars To A Big Data Management Solution

NOSQL, BIG DATA AND GRAPHS. Technology Choices for Today s Mission- Critical Applications

InfiniteGraph: The Distributed Graph Database

NoSQL Systems for Big Data Management

WA2192 Introduction to Big Data and NoSQL EVALUATION ONLY

Manifest for Big Data Pig, Hive & Jaql

Integrating Big Data into the Computing Curricula

Advanced Data Management Technologies

Slave. Master. Research Scholar, Bharathiar University

Open source large scale distributed data management with Google s MapReduce and Bigtable

Big Data Defined Introducing DataStack 3.0

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

BIRT in the World of Big Data

NoSQL Databases. Polyglot Persistence

multiparadigm programming Multiparadigm Data Storage for Enterprise Applications

White Paper: Datameer s User-Focused Big Data Solutions

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

NoSQL for SQL Professionals William McKnight

Big data for the Masses The Unique Challenge of Big Data Integration

Navigating the Big Data infrastructure layer Helena Schwenk

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

NoSQL Evaluation. A Use Case Oriented Survey

Big Systems, Big Data

Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

MEAP Edition Manning Early Access Program Neo4j in Action MEAP version 3

NoSQL. Thomas Neumann 1 / 22

Big Data Analytics. Rasoul Karimi

JAVASCRIPT CHARTING. Scaling for the Enterprise with Metric Insights Copyright Metric insights, Inc.

Performance and Scalability Overview

Big Data Architectures. Tom Cahill, Vice President Worldwide Channels, Jaspersoft

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Big Data and Data Science: Behind the Buzz Words

Big Data With Hadoop

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Using Big Data for Smarter Decision Making. Colin White, BI Research July 2011 Sponsored by IBM

Table of Contents. Développement logiciel pour le Cloud (TLC) Table of Contents. 5. NoSQL data models. Guillaume Pierre

How To Improve Performance In A Database

BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS

How To Store Data In Nosql

BIG DATA SOLUTION DATA SHEET

Evolution to Revolution: Big Data 2.0

Why Big Data in the Cloud?

Preparing Your Data For Cloud

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

CISC 432/CMPE 432/CISC 832 Advanced Database Systems

Performance and Scalability Overview

RDF graph Model and Data Retrival

Big Data & the Cloud: The Sum Is Greater Than the Parts

An Approach to Implement Map Reduce with NoSQL Databases

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

2.1.5 Storing your application s structured data in a cloud database

Big Data Solutions. Portal Development with MongoDB and Liferay. Solutions

White Paper: Big Data and the hype around IoT

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Study concluded that success rate for penetration from outside threats higher in corporate data centers

Tap into Hadoop and Other No SQL Sources

Transcription:

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores Composite Software October 2010

TABLE OF CONTENTS INTRODUCTION... 3 BUSINESS AND IT DRIVERS... 4 NOSQL DATA STORES LANDSCAPE... 5 TABULAR / COLUMNAR DATA STORES... 5 DOCUMENT STORES... 5 GRAPH DATABASES... 5 KEY/VALUE STORES... 5 OBJECT AND MULTI-VALUE DATABASES... 5 MISCELLANEOUS NOSQL SOURCES... 5 INTEGRATING NOSQL DATA STORES USING DATA VIRTUALIZATION... 6 TABULAR/COLUMNAR DATA STORES... 6 XML DOCUMENT STORES... 7 KEY/VALUE STORES... 7 SUMMARY... 8 Composite Software 2

INTRODUCTION There is a trend in the data storage and management arena to consider data storage options beyond the traditional SQL-based relational database. The overall movement began in 2009 and was known as NoSQL (meaning no SQL ), but that label has since evolved into NOSQL (meaning not only SQL ). Unfortunately both of these labels say more about what it isn t than what it is, and this is the source of ongoing confusion for this whole class of data stores. The general definition of a NOSQL data store is that is manages data that is not strictly tabular and relational, so it does not make sense to use SQL for the creation and retrieval of the data. More specifically, NOSQL data stores are usually non-relational, distributed, open-source, and horizontally scalable, although there are exceptions to each of these for specific NOSQL data stores. While NOSQL access standards have yet to fully develop, each implementation provides some sort of Java-based development API appropriate for accessing that type of NOSQL data. The Composite Data virtualization Platforms typically use these APIs to access and integrate NOSQL data, with three kinds of NOSQL data sources a natural integration fit. This paper describes the primary NOSQL data sources in the market today and how to integrate them with other sources using the Composite Data Virtualization Platform. Composite Software 3

BUSINESS AND IT DRIVERS The main driver for the creation of NOSQL data stores was the emergence of web-scale data i.e., massive amounts of data at the large web sites and services like Amazon, Google, Yahoo!, Facebook, etc. A number of NOSQL data stores emerged from custom engineering development done at these large companies. Recently predictive analytics, voice-of-thecustomer, churn, fraud and other big data use cases have emerged to further accelerate demand. Storing and processing this data revealed several specific motivations for these new data stores including: Cost per Terabyte: Many of the NOSQL data sources were invented to handle web-scale data that is created in enormous volumes (e.g., web site click streams), and storing this much data in a traditional relational database would be expensive and inefficient. Many of the NOSQL data sources are open source and run on commodity hardware, making them considerably less expensive per terabyte than traditional databases from vendors like Oracle and Teradata. Distributed Processing: Web-scale data is so large that the traditional database approach to storage, indexing, and retrieval does not work very well with this class of data. NOSQL data sources introduce storage architectures that scale horizontally; and parallel algorithms designed to efficiently process the distributed data ( map-reduce being the most prominent example). Data Shape Appropriateness: Many successful web-based services have introduced data that is not efficiently represented as relational, motivating new data structures more appropriate to the data. For example, social media web sites employ graph databases to represent the social relationships inherent in these services. Composite Software 4

NOSQL DATA STORES LANDSCAPE Although the original emergence of NOSQL data stores was motivated by web-scale data, the movement has grown to encompass a wide variety of data stores that just happen to not use SQL as their processing language (making it difficult to characterize exactly what a NOSQL data store is). There is no general agreement on the taxonomy of NOSQL data stores, but the categories below capture much of the landscape. Tabular / Columnar Data Stores Storing sparse tabular data, these stores look most like traditional tabular databases. Examples include Hadoop/HBase (Yahoo!), BigTable (Google), Hypertable and VoltDB. Their primary data retrieval paradigm utilizes column filters, generally leveraging hand-coded map-reduce algorithms. Document Stores These NOSQL data sources store unstructured (i.e., text) or semi-structured (i.e., XML) documents. Examples include MongoDB, Mark Logic and CouchDB. Their data retrieval paradigm varies highly, but documents can always be retrieved by unique handle. XML data sources leverage XQuery. Text documents are indexed, facilitating keyword search-like retrieval. Graph Databases These NOSQL sources store graph-oriented data with nodes, edges, and properties and are commonly used to store associations in social networks. Examples include Neo4J, AllegroGraph and FlockDB. Data retrieval focuses on retrieving associations from a particular node. Key/Value Stores These sources store simple key/value pairs like a traditional hashtable. They are further subdivided into in-memory and disk-based solutions. This category of NOSQL systems probably has the largest number of members, each embodying slightly different characteristics. Examples include Memcached, Cassandra (Facebook), SimpleDB, Dynamo (Amazon), Voldemort (Linked-In) and Kyoto Cabinet. Their data retrieval paradigm is simple; given a key, return the value. Some offer more complex querying mechanisms that can look inside the value, but normally the value is considered opaque. Object and Multi-value Databases These types of stores preceded the NOSQL movement, but they have found new life as part of the movement. Object databases store objects (as in object-oriented programming). Multi-value databases store tabular data, but individual cells can store multiple values. Examples include Objectivity, GemStone and Unidata. Proprietary query languages are used to retrieve data. Miscellaneous NOSQL Sources Several other data stores can be classified as NOSQL stores, but they don t fit into any of the categories above. Examples include: GT.M, IBM Lotus/Domino, and the ISIS family. Composite Software 5

INTEGRATING NOSQL DATA STORES USING DATA VIRTUALIZATION The Composite Data Virtualization Platform provides a complete development and runtime environment for discovering, accessing, federating, abstracting and delivering data from diverse sources. Access is typically done via standards-based protocols and APIs, for example JDBC and ODBC for SQL-based sources, HTTP and SOAP for Web services, JMS for messages, APIs for enterprise and cloud-based applications. Through these methods, source data is securely exposed from a single virtual location, regardless of how and where it is physically stored. While NOSQL access standards have yet to fully develop, each implementation provides some sort of Java-based development API appropriate for accessing that type of NOSQL data. The Composite Data Virtualization Platform uses these APIs as well as Composite s Custom Java Procedure (CJP) resource to access and integrate NOSQL data. Three kinds of NOSQL systems are a particularly natural fit for this integration approach. These include Tabular/Columnar Data Stores, XML Document Stores, and Key/Value Stores. A more detailed integration approach for each of these is outlined below. Over time, as NOSQL leaders emerge and usage patterns solidify, Composite may elect to provide more in-depth integrations with particular NOSQL data stores through the creation of fully supported adapters. Tabular/Columnar Data Stores Because the original implementation of the Composite Data Virtualization Platform integrated tabular data, retrieving and processing data from this category of NOSQL data store is an easy fit. This approach leverages Composite s ability to incorporate table functions in the FROM clause of a SQL statement. That is, any Composite procedure resource that returns a cursor can be dropped into the View editor as a table, where it will show up in the FROM clause of the SQL statement. For a specific NOSQL data store, a collection of CJP table functions can be implemented that leverage the NOSQL system s Java API. Each CJP would provide access to a different table in the underlying NOSQL data store. The CJPs can take input arguments to filter the data from the table, further leveraging the NOSQL system s processing capability. The values of the filters can even be specified at run-time from a client query by leveraging the virtual column capability of Views. It is worth remembering that these tabular/columnar NOSQL data sources store very large data sets, so caution must be used on large queries. The table function implementation should ensure sufficient data reduction in the target data source by leveraging input parameters. Also, the processing of requests to these data sources can take a very long time (more like batch jobs than live queries), so employing some form of caching would probably be prudent. This approach provides full access to the data in the underlying NOSQL system and it will likely meet most near term needs. There are, however, some disadvantages and inefficiencies in this approach. For example, all the columns specified in the CJP s cursor would always be retrieved, even if they weren t all necessary for the current query. Also, more generic filtering and aggregation might be possible with the underlying system, but the CJP provides only a limited interface to expose that capability to Composite. If a particular NOSQL Tabular data Composite Software 6

store becomes quite popular, it would be an ideal candidate for Composite to develop a custom adapter that would fully integrate and leverage that specific data source s capabilities. XML Document Stores Because XML document stores utilize XQuery as their preferred data retrieval paradigm, the Composite Data Virtualization Platform leverages its embedded XQuery engines and XML native data type to easily retrieve and further process documents from this category of NOSQL data store. For a specific NOSQL XML document store with a Java API, a minimum of two CJP procedures are required. Both CJPs return an XML document that can be further manipulated by any of the upstream XML manipulation functionality (e.g., XSLT Transformations). The first CJP would take a document handle (unique identifier) as its only input argument, and then leverage the API to retrieve and return that document. The second CJP would take an XQuery specification as its only input argument, and then leverage the API to execute the query and return the results as a single document. Of course, additional CJPs accepting more specific parameters could also be implemented, facilitating easier integration into multiple views. This approach provides full access to the data in the underlying XML data source, and it will likely be sufficient for most needs. Key/Value Stores The Composite Data Virtualization Platform can integrate key/value stores in two ways. The first is through a custom SQL function. That is, a function can be created that takes the key as a parameter, and returns the value. This function can then be used in multiple SQL statements throughout Composite. In the second, Composite leverages the in-memory key/value store as a cache target. This is the primary use-case typically described by our enterprise customers. This approach is best for small data sets or procedure results, but it doesn t work as well for large tabular data sets. Further, this form of cache integration is often challenged by the impedance mismatch between cached tabular data and cached key/value data (the cached data is opaque inside the key/value store), so the entire set must be retrieved for processing. This form of integration is available today from our professional services organization. Composite Software 7

SUMMARY NOSQL data stores are proliferating as a means of supporting web-scale data. Recently predictive analytics, voice-of-the-customer, churn, fraud and other big data use cases have emerged to further accelerate demand. There are a wide variety of NOSQL systems, each with their own set of use-cases and advantages. Each NOSQL data store has a unique and non-standard API that can be used to access and integrate these sources. The Composite Data Virtualization Platform is well suited for integrating data from these NOSQL sources with other data within and outside the enterprise. This paper describes integrations for three flavors of NOSQL data stores: Tabular/Columnar Data Stores, XML Document Stores, and In-Memory Key/Value Stores. Today, Composite can provide basic access to data from any of these NOSQL data stores with minimal programming, using standard resources. In the longer term, when leaders in particular areas of the NOSQL landscape emerge, Composite may provide deeper integrations through standard product adapters that within the Composite Application Data Services product line. Composite Software 8

ABOUT COMPOSITE SOFTWARE Composite Software, Inc. is the data virtualization gold standard at ten of the top 20 banks, six of the top ten pharmaceutical companies, four of the top five energy firms, major media and technology organizations; and multiple government agencies. These are among the hundreds of global organizations with disparate, complex information environments that count on the Composite to increase their data agility, cut costs and reduce risk. Backed by nearly a decade of pioneering R&D, Composite is the data virtualization performance leader, scaling from project to enterprise for data federation, data warehouse extension, enterprise data sharing, real-time and cloud computing data integration. Founded in 2002, Composite Software is a privately held, venture-funded corporation based in Silicon Valley. For more information, please visit www.compositesw.com.