multiparadigm programming Multiparadigm Data Storage for Enterprise Applications



Similar documents
Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Slave. Master. Research Scholar, Bharathiar University

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

NoSQL Databases. Polyglot Persistence

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Cloud Scale Distributed Data Storage. Jürmo Mehine

A Brief Analysis on Architecture and Reliability of Cloud Based Data Storage

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

NoSQL Data Base Basics

Cassandra A Decentralized, Structured Storage System

NoSQL Database Options

How graph databases started the multi-model revolution

Introduction to NOSQL

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Reference Model for Cloud Applications CONSIDERATIONS FOR SW VENDORS BUILDING A SAAS SOLUTION

BIG DATA TOOLS. Top 10 open source technologies for Big Data

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

NoSQL Databases. Nikos Parlavantzas

ENZO UNIFIED SOLVES THE CHALLENGES OF OUT-OF-BAND SQL SERVER PROCESSING

Making Sense ofnosql A GUIDE FOR MANAGERS AND THE REST OF US DAN MCCREARY MANNING ANN KELLY. Shelter Island

Lecture Data Warehouse Systems

GigaSpaces Real-Time Analytics for Big Data

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

these three NoSQL databases because I wanted to see a the two different sides of the CAP

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Loose Coupling between Cloud Computing Applications and Databases: A Challenge to be Hit

2.1.5 Storing your application s structured data in a cloud database

Database Management System Choices. Introduction To Database Systems CSE 373 Spring 2013

NoSQL Evaluation. A Use Case Oriented Survey

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

How to Choose Between Hadoop, NoSQL and RDBMS

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Introduction to NoSQL

Introducing DocumentDB

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

Sentimental Analysis using Hadoop Phase 2: Week 2

Structured Data Storage

NoSQL and Hadoop Technologies On Oracle Cloud

A Review of Column-Oriented Datastores. By: Zach Pratt. Independent Study Dr. Maskarinec Spring 2011

Advanced Data Management Technologies

Highly available, scalable and secure data with Cassandra and DataStax Enterprise. GOTO Berlin 27 th February 2014


Practical Cassandra. Vitalii

So What s the Big Deal?

INTERNATIONAL JOURNAL of RESEARCH GRANTHAALAYAH A knowledge Repository

An Approach to Implement Map Reduce with NoSQL Databases

nosql and Non Relational Databases

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

Big Data Solutions. Portal Development with MongoDB and Liferay. Solutions

CitusDB Architecture for Real-Time Big Data

Domain driven design, NoSQL and multi-model databases

A COMPARATIVE STUDY OF NOSQL DATA STORAGE MODELS FOR BIG DATA

INTRODUCTION TO CASSANDRA

How To Handle Big Data With A Data Scientist

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Eventually Consistent

Cassandra A Decentralized Structured Storage System

This paper defines as "Classical"

NoSQL and Graph Database

NoSQL. Thomas Neumann 1 / 22

Object Oriented Database Management System for Decision Support System.

The Quest for Extreme Scalability

Benchmarking and Analysis of NoSQL Technologies

JAVASCRIPT CHARTING. Scaling for the Enterprise with Metric Insights Copyright Metric insights, Inc.

Preparing Your Data For Cloud

Enterprise Mobility Space

Framework Adoption for Java Enterprise Application Development

Data sharing in the Big Data era

MEAP Edition Manning Early Access Program Neo4j in Action MEAP version 3

From the Monolith to Microservices: Evolving Your Architecture to Scale. Randy linkedin.com/in/randyshoup

Performance Evaluation of NoSQL Systems Using YCSB in a resource Austere Environment

Logical Data Models for Cloud Computing Architectures

Introduction to Apache Cassandra

CloudDB: A Data Store for all Sizes in the Cloud

The NoSQL Ecosystem, Relaxed Consistency, and Snoop Dogg. Adam Marcus MIT CSAIL

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce

Microsoft Azure Data Technologies: An Overview

Understanding Neo4j Scalability

NoSQL in der Cloud Why? Andreas Hartmann

Integrating Big Data into the Computing Curricula

Data Modeling for Big Data

GRAPH DATABASE SYSTEMS. h_da Prof. Dr. Uta Störl Big Data Technologies: Graph Database Systems - SoSe

Glassfish Architecture.

Can the Elephants Handle the NoSQL Onslaught?

Understanding NoSQL on Microsoft Azure

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

Transcription:

focus multiparadigm programming Multiparadigm Data Storage for Enterprise Applications Debasish Ghosh, Anshin Software Storing data the same way it s used in an application simplifies the programming model, making it easier to decentralize data processing. Emerging NoSQL data-storage engines support this strategy. Regardless of the paradigm used to model the application domain, most enterprise applications use the relational model for data storage. Relational technology is mature, widely understood, and successfully deployed in countless applications. However, its dominance has also had some undesirable consequences for application development. For an application that models the business logic in an object-oriented way, the developer faces an impedance mismatch between the application s object model and the data s relational model. Object-relational mapping (ORM) frameworks exist to bridge this divide, but ORMs aren t trivial to use and often introduce more complexity than the problem they solve. 1 Relational s use a generic storage model based on tables and columns, so it s no wonder they don t scale well for all data types. Google designed BigTable 2 and Amazon came up with Dynamo 3 specifically to address these issues. Both were implemented as distributed data stores designed to scale to a very large size. Recent experiences in data modeling with social networking applications such as Facebook, 4 Twitter, 5,6 and Digg 7 also demonstrate this deficiency. One way to address this problem is to model data so that it remains closer to the way the application uses it. On the basis of practical experiences, I describe a multiparadigm data-storage model that permits applications to work with data in a way that s semantically closer to its usage patterns. Toward a Semantically Richer Data Storage Consider an enterprise application module that must handle document sets that can have many optional attributes for example, an Address Book that can have fields like Name, House Number, Street Name, City, Zip Code, and Country Code. It can optionally include a list of telephone numbers and email addresses as well. With a relational model, you design it as a table with nullable attributes for optional items. For repeating entries, you use normalization techniques and store data in multiple tables to avoid data redundancy and inconsistent updates and deletes. Distributing data across multiple tables has the effect of destructuring the semantics of the way your application looks at the data. The application would prefer to store the Address Book in a single structure that keeps the document whole and consistent with the way the domain model would use it. Today, we have data stores that let us model 0740-7459/10/$26.00 2010 IEEE September/October 2010 IEEE SOFTWARE 57

The idea is to use each data store s strengths to meet your data model s requirements. our data layer exactly this way. CouchDB (http:// couchdb.org) and MongoDB (http://mongodb.org) store data in JavaScript Object Notation (JSON, http://json.org) documents and let applications manipulate document structures directly through their query engines. Consider another example where your application must store various routes across cities to find optimal shipping strategies for your clients. Typically, you think of this as a graph, with the cities being the nodes and the connecting routes being the edges. You could store the structure in a relational and use SQL queries that employ complicated joins across multiple tables. Or you could store the data in a graph like Neo4J (http://neo4j.org) that offers nodes and relationships as first-class abstractions and various graph-manipulation APIs for use directly within the application layer. Both of these examples have one thing in common. They express a need for a data model much richer than the universal table/columnbased representation that a relational offers a need for a specialized representation of each individual data type used in your application. Specialized representation also implies specific query languages for each data store. This is a benefit in that you can use the most expressive language to query your data structures, but it also means learning a multitude of languages and their best practices. SQL is no longer the universal query interface, so these new data stores are popularly referred to as NoSQL stores. NoSQL has many connotations, but the most popular one today is Not Only SQL. Multiparadigm Programming with Data When you re using multiple data-representation techniques, you need the right tool for the right job. When you re working on a large-scale application, use data stores that meet your application s accesspattern requirements and offer the desired level of scalability and performance guarantees. A relational management system (RD- BMS) engine is the right tool for handling relational data used in transactions requiring atomicity, consistency, isolation, and durability (ACID). However, an RDBMS isn t an ideal platform for modeling complicated social data networks that involve huge volumes, network partitioning, and replication. Graph s like Neo4J model such relationships much better. CouchDB offers offline data-processing capabilities through replication techniques and allows synchronization with other copies at a later time. MongoDB has blazing-fast in-memory operations. Cassandra (http://cassandra.org) supports decentralized data storage for efficient columnar access from your application. It has the fault tolerance of Dynamo while offering a more advanced data model. For applications that need huge write scalability, Cassandra has proved to be a very good option. If you need write scalability, Riak (http://riak. basho.com) is another option that models a keyvalue data store and offers decentralized access, availability, and network-partition tolerance. The underlying idea is to use each data store s strengths to meet your data model s requirements. This brings the data s storage model closer to the application-domain model that uses it. Plus, it gives you the scalability benefits these engines offer. The result is a multiparadigm strategy for data management. With all these NoSQL stores acting as the interface to your application s domain model, you might still need an underlying relational to serve as the system of record for generating reports and audit trails, running of other batch processes, and so on. NoSQL stores don t scale well for such jobs, and the RDBMS world offers lots of tool support in these areas. for Eventual Consistency However, one question still remains: How do you keep the underlying relational store in sync with the other data stores? One option is to use asynchronous messaging as the backbone for an integration layer. By combining asynchronous messaging with the actor communication model, 11 we can establish an architecture for achieving eventual consistency between the underlying and the online domainspecific data stores. Figure 1 gives the overall architecture for such a multiparadigm infrastructure. In this approach, asynchronous messaging replicates necessary changes in individual data stores in the main relational. The application determines which changes must be propagated to the underlying store as the system of record. For example, if you re using MongoDB as a data store for online processing, the business components will store all your transaction data in MongoDB collections. When the application logic updates a collection online, the messaging system will trigger downstream updates to schedule jobs on the queue. These jobs are 58 IEEE SOFTWARE www.computer.org/software

processed asynchronously and keep the underlying relational data store consistent with the frontal online stores. The overall application architecture benefits from asynchronous updates in a couple of ways. First, it scales well, because asynchronous processing doesn t block and can even be scheduled as offline threads of execution. Second, it guarantees delivery of updates to the underlying store within a specified time interval. The two stores might be inconsistent for a short time, but many applications can tolerate this delay to meet other scalability objectives. In other words, the consistency isn t instant, so it s called eventual consistency. In a sense, the frontal data store is like a cache between the application and the of record that is, the SQL store. However, unlike a traditional cache, the frontal store can provide a persistence abstraction that best fits the application. This multilevel model is also useful when the application requires an SQL store of record for nontechnical business reasons. NoSQL Stores: A User s Point of View You must consider two important aspects of your application and infrastructure requirements before selecting a NoSQL storage engine. First, every data store needs a user-friendly query interface. Unlike the relational world where SQL provides the universal query language, the NoSQL world has no such unifying query language. Every data store provides query interfaces in multiple host languages Java, Ruby, Python, and so on. The languages vary significantly across storage engines, and the query mechanisms differ across the various stores. Stores like MongoDB offer user-friendly query APIs as part of the client library implementation. In CouchDB, you must write views, using a map/ reduce paradigm, to get data from the. 12 CouchDB comes with a default view-engine implementation in JavaScript. However, its view architecture is decoupled from the core server, so you can write your own view server using your preferred language. So, every data store has its own query model and language that might be optimal for its underlying engine, yet, the absence of a unified query model like SQL in the relational world could deter early adoption of NoSQL data stores. The second important consideration when selecting a data store is scalability requirements. Each candidate data store has different strengths Neo4J Graph-structured domain rules Columnar data access with decentralization Cassandra Module 1 Module 2 Asynchronous message passing Relational and weaknesses, which you will need to align with your application requirements. Some features for determining your selection are throughput requirements of reads and writes for your application, whether your application needs to handle data distributed across nodes and serve query requests from users even when some nodes fail, whether your application needs offline dataprocessing capabilities, your application s availability requirements, and your application s data consistency requirements. A complete analysis of all the stores with respect to distribution and scalability is beyond this article s scope. For more details, see the documentation for each product. Advantages of Architectural blueprints like the one in Figure 1 are becoming more common. In one project at Anshin Software, we use MongoDB for document storage in collaboration with Oracle as the underlying system of record. The implementation archi- Module 3 Module 4 MongoDB Document structures Document structures with offline processing CouchDB Figure 1. Architecture for a multiparadigm infrastructure. The application uses multiple frontal data stores, depending on the way each component uses the data. Message-oriented middleware uses asynchronous replication to achieve eventual consistency with the underlying relational management system. September/October 2010 IEEE SOFTWARE 59

About the Author Debasish Ghosh is the chief technology evangelist at Anshinsoft (www.anshinsoft. com), where he specializes in leading delivery of enterprise-scale solutions for clients ranging from small to Fortune 500 companies. His research interests are functional programming, domain-specific languages, and NoSQL s. Debasish received his bachelor s degree in computer science and engineering from Jadavpur University, India. He s a senior member of the ACM and author of the book DSLs In Action, to be published this year by Manning. Read his programming blog at http://debasishg.blogspot.com and contact him at dghosh@acm.org. When eventual consistency is sufficient for your application, asynchronous messaging provides a robust, scalable way to synchronize data between different data stores. It lets you decentralize data processing and store data in structures that are more closely aligned with the application logic. This leads to prospects for a much simpler programming model without the incidental complexities that additional glue frameworks bring to an application. tecture benefited from this model, first, because it stores the data in a model that s closest to the data-access pattern. This minimizes the impedance mismatch between the data and the application layer. Furthermore, because application-layer data access is simpler and more direct, the application code base is much more expressive, concise, and maintainable. Using asynchronous messaging as the binding glue to manage back-end data consistency also gives us a horizontally scalable infrastructure. Finally, it distributes online data-processing load across multiple data stores. There s no single point of failure as there is when a single RDBMS handles all the loads. Challenges of As with any architectural paradigm, there are a host of pitfalls to be aware of. First, not all systems are suitable for enforcing an eventually consistent model. If your application requires all operations to be immediately available from the underlying relational, this model isn t for you. A typical example of such a use case is handling a banking system s debit and credit transactions, which must be ACID consistent. Second, the NoSQL systems discussed here are relatively immature compared to the well-known SQL systems. Many of them are still evolving rapidly. All of them are being used in production systems, but few have reached a version 1.0 level. You can expect these systems to make some incompatible changes in the near term. Furthermore, if the application or data architect doesn t use the architecture pattern carefully, the result could turn into a cacophony. Finally, traditional architects, accustomed to using a single RDBMS for an application, will have to be convinced of the wisdom of a multiparadigm strategy. References 1. T. Neward, The Vietnam of Computer Science, blog, June 2006, http://blogs.tedneward.com/2006/06/26/ The+Vietnam+Of+Computer+Science.aspx. 2. F. Chang et al., BigTable: A Distributed Storage System for Structured Data, Proc. 7th Symp. Operating System Design and Implementation (OSDI 06), Usenix Assoc., 2006; www.usenix.org/events/osdi06/tech/ chang.html. 3. G. DeCandia et al., Dynamo: Amazon s Highly Available Key-value Store, ACM SIGOPS Operating Systems Rev., vol. 41, no. 6, 2007, pp. 205 220; http://s3.amazonaws.com/allthingsdistributed/sosp/ amazon-dynamo-sosp2007.pdf. 4. A. Lakshman, P. Malik, and K. Ranganathan, Cassandra: A Structured Storage System on a P2P Network, slide presentation at ACM SIGMOD Int l Conf. Management of Data (SIGMOD 08), 2008; www.slideshare. net/jhammerb/data-presentations-cassandra-sigmod. 5. J. Adams, Billions of Hits: Scaling Twitter, slide presentation presented at the Chirp 2010 Official Twitter Developer Conf., 2010; www.slideshare.net/netik/ billions-of-hits-scaling-twitter. 6. N. Kallen, Big Data in Real-Time at Twitter, slide presentation, 2010; www.slideshare.net/nkallen/ q-con-3770885. 7. I. Eure, Looking to the Future with Cassandra, blog, 9 Sept. 2009, http://about.digg.com/blog/ looking-future-cassandra. 8. C. Hewitt, P. Bishop, and R. Steiger, A Universal Modular ACTOR Formalism for Artificial Intelligence, Proc. 3rd Int l Joint Conf. Artificial Intelligence, Morgan Kaufmann, 1973, pp. 235 245. 9. S. Helmberger, Introduction to CouchDB Views, 2 Apr. 2010; http://wiki.apache.org/ couchdb/introduction_to_couchdb_views. NEXT ISSUE: Software Architecture: Framing Stakeholders Concerns 60 IEEE SOFTWARE www.computer.org/software

This article was featured in For access to more content from the IEEE Computer Society, see computingnow.computer.org. Top articles, podcasts, and more. computingnow.computer.org