How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns



Similar documents
The Modern Online Application for the Internet Economy: 5 Key Requirements that Ensure Success

Introduction to Apache Cassandra

Introduction to Multi-Data Center Operations with Apache Cassandra and DataStax Enterprise

Introduction to Multi-Data Center Operations with Apache Cassandra, Hadoop, and Solr WHITE PAPER

Big Data: Beyond the Hype

Comparing the Hadoop Distributed File System (HDFS) with the Cassandra File System (CFS)

Big Data: Beyond the Hype. Why Big Data Matters to You. White Paper

Big Data: Beyond the Hype

Simplifying Database Management with DataStax OpsCenter

INTRODUCTION TO CASSANDRA

Highly available, scalable and secure data with Cassandra and DataStax Enterprise. GOTO Berlin 27 th February 2014

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Implementing Search in Web, Mobile, and IOT Applications An Overview of DataStax Enterprise Search

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Comparing the Hadoop Distributed File System (HDFS) with the Cassandra File System (CFS) WHITE PAPER

Tap into Hadoop and Other No SQL Sources

HDP Hadoop From concept to deployment.

Architecting for the Internet of Things & Big Data

Integrating a Big Data Platform into Government:

Virtualizing Apache Hadoop. June, 2012

Big Data on Microsoft Platform

Next-Generation Cloud Analytics with Amazon Redshift

CA Technologies Big Data Infrastructure Management Unified Management and Visibility of Big Data

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data: Are You Ready? Kevin Lancaster

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

The Future of Data Management

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Complying with Payment Card Industry (PCI-DSS) Requirements with DataStax and Vormetric

From Spark to Ignition:

Big Data Analytics - Accelerated. stream-horizon.com

THE JOURNEY TO A DATA LAKE

Comparing Oracle with Cassandra / DataStax Enterprise

Native Connectivity to Big Data Sources in MSTR 10

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

So What s the Big Deal?

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

No-SQL Databases for High Volume Data

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Big Data Are You Ready? Thomas Kyte

Modernizing Your Data Warehouse for Hadoop

Transforming the Telecoms Business using Big Data and Analytics

Choosing The Right Big Data Tools For The Job A Polyglot Approach

BIG DATA TRENDS AND TECHNOLOGIES

Enabling SOX Compliance on DataStax Enterprise

Big Data Technologies Compared June 2014

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

Evaluating Apache Cassandra as a Cloud Database White Paper

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Using Tableau Software with Hortonworks Data Platform

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Big Data Are You Ready? Jorge Plascencia Solution Architect Manager

How To Use Big Data For Telco (For A Telco)

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Data Integration Checklist

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Oracle Database 12c Plug In. Switch On. Get SMART.

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Building Your Big Data Team

INTELLIGENT BUSINESS STRATEGIES WHITE PAPER

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

Real-Time Big Data Analytics + Internet of Things (IoT) = Value Creation

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra

HDP Enabling the Modern Data Architecture

How to Enhance Traditional BI Architecture to Leverage Big Data

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

An Oracle White Paper June Oracle: Big Data for the Enterprise

Interactive data analytics drive insights

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Oracle Big Data SQL Technical Update

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

The Inside Scoop on Hadoop

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

BIG DATA & DATA SCIENCE

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Real Time Big Data Processing

Advanced In-Database Analytics

An Oracle White Paper October Oracle: Big Data for the Enterprise

SAP and Hortonworks Reference Architecture

Cloudwick. CLOUDWICK LABS Big Data Research Paper. Nebula: Powering Enterprise Private & Hybrid Cloud for DataStax Big Data

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!

Big Data Explained. An introduction to Big Data Science.

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Elastic Application Platform for Market Data Real-Time Analytics. for E-Commerce

Introducing Oracle Exalytics In-Memory Machine

How To Handle Big Data With A Data Scientist

Transcription:

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

Table of Contents Abstract... 3 Introduction... 3 Definition... 3 The Expanding Digitization of Business... 3 The Core of the Internet Enterprise... 4 Requirements leading to radical change... 4 Success Factors for the Internet Enterprise... 6 Global Scaling... 6 Customer-Driven Development... 6 Micro Thinking... 6 Rise of the Global Database... 6 Roadmap Toward the Internet Enterprise... 7 How DataStax Helps Power the Internet Enterprise... 10 Conclusion... 11 About DataStax... 11

Abstract Data, a key strategic asset, must be used more effectively than ever before, if businesses are to compete in today s Internet economy. Modern enterprises must leverage data collected from operational (transactional) systems to achieve fast time-to-insight that results in better decisions and better customer service. These papers explores how DataStax Enterprise makes it easy for Internet Enterprises to run operational analytics on data stored in Cassandra, and integrate that data with historical Hadoop data warehouses/lakes, so that online applications can lead to better business. Introduction No one questions the fact that data is a key strategic asset businesses must use effectively to compete in today s Internet economy. Modern enterprises must utilize data collected from operational (transactional) systems in ways that provide the fastest possible time-to-insight so they can quickly make decisions that better serve their customer and benefit their business. Examples of modern Web and mobile applications that need fast turnaround of collected data into information to improve a customer s experience and assist in making business decisions include: Fraud detection systems that quickly detect identity theft and prevent loss to customers and a business. Media and entertainment applications that track a customer s viewing and listening preferences and make on-target recommendations that increase the customer s enjoyment of the service and result in additional purchases for the business. Home utility and appliance sensor applications that continuously ingest and analyze usage information, resulting in lower energy costs and better use of the product for the customer. These and other types of Internet economy systems depend upon a data management platform foundationally architected to consume operational data and analyze it in a way that enables fast decision making capabilities to benefit both the customer and underlying business. This paper explores how DataStax Enterprise supplies these analytic capabilities to today s Web and mobile applications that extend around the globe and must always be available for customer use. The data gathered by NREL comes in different formats, at different rates, from a wide variety of sensors, meters, and control networks. DataStax aligns it within one scalable database. Keith Searight, NREL The Evolution of Analytics A survey of how analytics on data collected through operational systems is performed today reveals that some IT practices used in the past remain intact while new trends are emerging for Web and mobile applications. Operational versus Data Warehouse Analytics For many decades, a separation between operational (online databases) and data warehouses has existed; a separation that has been characterized by the different types of workloads and applications each type of database serves. Operational or line-of-business (LOB) systems typically support transactions and queries that are short in duration, are both write and read intensive, and reflect a real-time nature where data handling is concerned. By contrast, data warehouses are typified by workloads with long running queries against very large data volumes that have been collected from multiple operational systems, which are used for analysis and decision making purposes. Even though a data warehouse s primary purpose is to enable analysis on collected data, this does not mean that analytics reside only in the domain of the data warehouse. In fact, traditional RDBMS s like Oracle, Microsoft SQL Server, etc., have all included various analytics functions (e.g. windowing, partition by, etc.) that allow for running analysis on operational data. The evolution of today s business to one of an Internet economy has not altered this paradigm, although, because of scaling and data distribution

needs, the types of databases and data platforms being used have definitely changed to support the need of modern online applications. As a result, legacy operational and data warehouse engines such as Oracle and Teradata have begun to lose ground to NoSQL databases that handle distributed line-of-business applications and Hadoop that services data warehouses or data lakes. cases tailor-made for transactional-analytics are online recommendation engines that constantly consume and analyze user activity and then quickly turn around recommendations on other suggested items to purchase, additional news stories to read, and more. Figure 2 Transactional-analytical processing application. Figure 1 Contrasting legacy and Internet Enterprise platforms for operational and data warehousing. As with legacy RDBMS operational and data warehouse applications, the need exists in modern online systems using NoSQL to perform analytics on transactional data and also integrate that data with data warehouses / data lakes that use Hadoop. The Emergence of Transactional Analytics Many of today s online applications have outgrown the traditional and basic ACID (atomic, consistent, isolated, durable) transaction of the relational era and have broadened it so that it can (1) be used across a widely distributed system and; (2) be more of an interaction where the transaction may include analysis that is real/near time and possibly even historical. Once completed, the transaction is then used to trigger other events and make decisions that affect literally the next transaction the user makes or internal activities such as business intelligence decision-making processes. Examples of applications that are increasingly becoming transactional-analytic include fraud detection systems that field incoming purchase requests and analyze many specifics regarding the request such as purchase location, frequency, amount, and much more. Other application use Analyst groups such as Gartner Group classify this broadening of legacy transactions as hybrid transactional analytical processing or HTAP. Additionally, Gartner states that the analytics required in many of these applications will be of varied tempos, meaning that the speed at which the analysis is carried out will sometimes need to be real/near time while other situations will best be handled by analytics that take longer to run. Requirements for Running Analytics on Online Applications Given the heightened priority of making fast and accurate decisions from data collected from online applications, what are the key requirements for supporting analytic functionality in a modern operational database? While each application is different, the following can serve as a general musthave checklist for today s operational databases: High-speed data consumption the database should support fast data use cases where data is rapidly flowing into the system from user transactions, sensor inputs, and other similar feeds. Heterogeneous data type support the system should support all types of data,

including structured, semi-structured, and unstructured. Continuous availability because analytics on operational data is not optional, the same uptime requirements used for OLTP operations apply to analytic workloads. Location independence analytics on operational data must be capable of being run in any location that the underlying application serves. Performance at scale the database should be able to run analytic operations that meet performance SLA s regardless of the underlying data volumes. Multi-workload support with isolation analytic workloads performed on OLTP data should not impact OLTP operations; in other words, there should be a way to support both OLTP and analytic workloads with isolation between the two, so no competition exists for either compute or data resources. Minimization of data movement the need to ETL (extract-transform-load) data to separate databases for analysis should be minimal as constant data movement costs time. Multi-analytic tempo support the database should be able to support multiple analytic tempos that satisfy applications needing more than one speed of analytics (e.g. both near/real time and long running/batch). Integration with data warehouses/lakes easy back/forth integration with external data warehouses/lakes should be possible, beyond simple ETL where the data warehouse may access data directly in the operational data store and run analytic tasks remotely. A New Approach: Analytics with DataStax Enterprise Today s Internet Enterprises that utilize modern Web and mobile applications to engage and interact with their customers will find that running analytics on their operational data is made easy by using DataStax Enterprise. DataStax Enterprise is the leading distributed database for today s digital world of always-on, connected-everywhere applications. At the core of DataStax Enterprise is Apache Cassandra - the #1 open source massively scalable NoSQL database used by many Internet Enterprises today to power their online applications. Cassandra sports an always-on, continuously available architecture that future-proof s the success of business applications by providing linear scale performance against ever-increasing data volumes. The modern masterless ring architecture and distributed nature of Cassandra allows a business to easily support its customers no matter where they are geographically located, plus it provides hybrid application support for those systems that run partly in private data centers and partly on public cloud providers. Figure 3 The distributed, masterless architecture of Cassandra makes distributing data anywhere in the world fast and easy. DataStax Enterprise provides a production-ready version of Cassandra along with other important features that modernize traditional businesses into Internet Enterprises: Enterprise-class security that ensures data is safe and protected. Integrated analytics support on Cassandra data (more on this below). Integrated enterprise search capabilities on Cassandra data. Workload isolation and data replication that ensures OLTP, analytics, and search workloads do not compete with each other for data or compute resources. In-memory database option for both OLTP and analytic workloads. Automatic management services that transparently automate numerous database maintenance and performance monitoring tasks.

Visual management and monitoring of all database clusters from any device (laptop, tablet, smart phone). Around-the-clock expert support. Figure 4 DataStax Enterprise components. When it comes to supporting analytic workloads on operational data, DataStax Enterprise provides three different options that may be utilized (any one or all) in a database cluster. Real-Time Analytics For applications needing real-time analytics support, DataStax Enterprise provides the ability to run fast analytic operations on Cassandra data in either an application-based manner (i.e. developed in an application with a language like Java), or via ad-hoc queries executed through bundled database utilities or BI tools such as Tableau. When creating a new database cluster, an architect or administrator simply specifies that some or all nodes in the new cluster be analytics enabled. After that, analytics can be run on any incoming data housed on those nodes. A number of different deployment scenarios may be used such as combining OLTP and analytics on the same nodes or segregating OLTP and analytics on different nodes, the latter of which accomplishes workload isolation so that OLTP and analytics workloads do not compete with each other for data or compute resources. Enabling this capability is DataStax Enterprise s built-in replication, which automatically replicates data from OLTP nodes to analytic nodes where analytic operations may be carried out. Figure 5 Deploying a new cluster with segregated OLTP and analytics nodes. For real-time analytics, DataStax Enterprise uses Spark, which provides in-memory as well as diskbased support for running fast analytics across a distributed, shared nothing architecture. Analytic applications may be developed in languages such as Java, Scala, and Python, while ad-hoc queries are supported in three ways: (1) SparkSQL, which has a subset of SQL-92 compatible syntax allows SQL styled queries to be run against Cassandra data (2) Shark, which is a Hadoop Hive-compatible utility that allows Hive-styled queries to be run against Cassandra data; (3) BI tools such as Tableau, which are enabled through a free ODBC driver that connects directly to a DataStax Enterprise cluster. Further, DataStax Enterprise also enables streaming analytics on high velocity, in-flight data streams via support for Spark Streaming. This shortens the time between a transaction and its impact on analytical insight, which is especially required for use cases such as Internet of Things (IoT) applications. A primary benefit of DataStax Enterprise real/neartime analytics is very fast response times made possible by various technology enablers including inmemory processing. It should be noted that DataStax Enterprise s OLTP in-memory option may be used in conjunction with in-memory analytics, with the combination delivering a full in-memory solution for transactional-analytic workloads and fast turnaround times for use cases such as recommendation engines, online retail re-pricing, fraud detection, and others.

Integrated Batch Analytics For situations where analytics use cases on operational data are of a batch-oriented (or longer in duration) nature, DataStax Enterprise provides builtin batch analytics capabilities that allow for longer running analytic tasks to be executed directly on Cassandra data. As with real/near-time analytics, nodes in a DataStax Enterprise cluster may be specifically marked out for such operations. External Batch Analytics and Integration with Data Warehouses Because there are situations where operational and historical data must be combined for decision making purposes, DataStax Enterprise supports integration with Hadoop data warehouses/lakes such as those offered by Cloudera and HortonWorks. The integration allows three things 1. Components from an external Hadoop vendor (e.g. Hive, Pig, etc.) can be installed directly on nodes in a DataStax Enterprise cluster and execute directly on Cassandra data. 2. Cassandra tables may be linked with external Hadoop objects (e.g. a Hive table) and queried / joined together. 3. Results from analytic tasks may be sent back to a Hadoop data warehouse. Figure 6 Specifying that a node in a cluster be devoted to batch analytics. To enable integration, Hadoop task trackers and other desired components are installed and configured on specified nodes in a DataStax Enterprise cluster. Once running, analytic tasks can be run against Cassandra data, and optionally link Cassandra and external Hadoop objects together, with output results being sent back to a Hadoop deployment. Analytic tasks may be run internally and directly on Cassandra data in a DataStax Enterprise cluster with MapReduce, Hive, Pig, and Mahout functions. Enabling both real/near-time and batch analytics in a cluster provides full support for the multiple analytic tempos required by many of today s online applications. The standard use case for integrated batch analytics in DataStax Enterprise involves situations where there is a need to perform longer running analytic tasks on Cassandra data that may include numerous computations and be programmatic in nature (e.g. a health-care company that analyzes patient procedures for billing). It is important to note that the integrated batch analytics feature should not be used as a replacement for a Hadoop data warehouse/lake and is not meant to handle the types of very large data warehouse workloads that are better served by standalone Hadoop implementations. Instead, integration between DataStax Enterprise and such deployments is made available for linking hot and cold/historical data together. Figure 7 Integration with external Hadoop data warehouses is easily handled with DataStax Enterprise.

Evaluating DataStax Enterprise for Modern Analytics The following table describes how DataStax Enterprise delivers analytic requirements of today s online applications. REQUIREMENT COMMENTS High-speed Data Consumption One of Cassandra s hallmarks is being the fastest write engine of any database- RDBMS or NoSQL Modern Data Type Support Supports all data types Continuous Availability Has no single point of failure and provides capabilities for no downtime Location Independence Best multi-datacenter and cloud support of any database, allowing data to be read, written and analyzed anywhere Performance at Scale Only database to provide true linear scale performance; nodes are added online to increase performance Minimization of Data Movement Built-in replication removes the need to move data to different systems for real-time analysis and search Integration with Data Warehouses Easily integrates with external Hadoop data warehouses Conclusion DataStax Enterprise makes it easy for Internet Enterprises to run operational analytics on data stored in Cassandra, as well as integrate that data with historical Hadoop data warehouses/lakes, so that online applications can better serve both the needs of the target customer and the internal decision making requirements of the business. For downloads of DataStax Enterprise, online documentation, tutorials, client drivers, getting started materials and more, visit www.datastax.com. About DataStax DataStax, the leading distributed database management system, delivers Apache Cassandra to the world s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. DataStax has more than 500 customers in 45 countries including leaders such as Netflix, Rackspace and Pearson Education, and spans verticals including web, financial services, telecommunications, logistics, and government. Based in Santa Clara, Calif., DataStax is backed by industry-leading investors including Lightspeed Venture Partners, Meritech Capital, and Crosslink Capital. For more information, visit DataStax.com or follow us @DataStax and @DataStax EU