Composite Data Virtualization Data Virtualization Platform Maturity Model Composite Software September 2010
TABLE OF CONTENTS INTRODUCTION... 3 EVOLVING NEEDS, EVOLVING SOLUTIONS... 4 HOW TO MEASURE DATA VIRTUALIZATION PLATFORM MATURITY... 5 MATURITY DIMENSION... 6 FUNCTIONALITY DIMENSION... 7 COMBINING DIMENSIONS... 8 QUERY PROCESSING... 8 CACHING... 8 DATA ACCESS (FROM SOURCES)... 9 TRANSFORMATION (INCLUDES DATA QUALITY)... 10 DATA DELIVERY (TO CONSUMERS)... 10 SECURITY... 10 MODELING AND METADATA MANAGEMENT... 11 ENTERPRISE-SCALE OPERATION... 12 CONCLUSION... 13 Composite Software 2
INTRODUCTION At an ever accelerating pace, enterprises and government agencies are discovering innovative ways to leverage information to meet progressively more challenging financial and service level objectives. Yet, to fulfill this explosive demand, data professionals face increasing difficult data integration challenges including: Constant business change necessitating immediate and evolving IT response; Growing data volumes and complexity that increase business risk and reduce agility; and Operational and financial constraints necessitating easy-to-adopt, cost-effective IT solutions that leverage prior investments. Traditional approaches such as data consolidation and replication alone have not kept pace. As a result, data virtualization, an integration method that leverages modern virtualization principles, has evolved to complement these earlier investments and fill the business and IT gap. In an environment of ever evolving needs and solutions, enterprises and government agencies must select the right data virtualization offering to meet their needs. To provide a systematic assessment approach, Composite Software has developed a Data Virtualization Platform Maturity Model. This Data Virtualization Leadership Series white paper describes this model and how it can be used for both initial evaluation and on-going optimization of data virtualization platforms. Composite Software 3
EVOLVING NEEDS, EVOLVING SOLUTIONS Originally deployed to meet light data federation requirements in BI environments, today s data virtualization use cases span a range of consuming applications including customer experience management, risk management and compliance, supply chain management, mergers and acquisitions support, and more. Further, the range of data supported has grown beyond relational to include semi-structured XML, dimensional MDX, and the new NoSQL data types. Along the way, adoption has evolved from initial project-level deployments to enterprise-scale data virtualization layers that share data from multiple sources across multiple applications and uses. At the same time, the data virtualization offerings themselves have evolved. From a vendor point of view, many of the early Enterprise Information Integration (EII) companies who entered the market in the early 2000s have been acquired or exited the market, leaving a short list of suppliers able to meet today s more advanced data virtualization requirements. To fill this gap between supply and demand, new entrants from adjacent markets such as BI and Extract-Transform-Load (ETL) have recently announced data virtualization products that leverage these vendors existing offerings. Finally, within this vendor landscape, the functionality of the offerings has also evolved dramatically, across a range or functional categories with various levels of capability from entrylevel to mature. Composite Software 4
HOW TO MEASURE DATA VIRTUALIZATION PLATFORM MATURITY In an environment of ever evolving needs and solutions, enterprises and government agencies find determining the right data virtualization offering to meet their needs a significant challenge. To provide a systematic assessment approach, we developed a Data Virtualization Platform Maturity Model. This model has two critical dimensions. The first uses a five-stage maturity timeline to provide a common framework for measuring the various phases typical in software innovation. The second dimension looks at key functionality categories that, when successfully combined, create viable data virtualization platforms. Once a data virtualization offering has been selected, the comprehensive detail within the Data Virtualization Platform Maturity Model continues to provide value to IT strategists and enterprise architectures during on-going deployment. It can be applied When developing a data virtualization capabilities adoption roadmap; When aligning staff development, release deployment and related plans with the adoption roadmap; and When measuring the viability of the selected data virtualization offering over time. Composite Software 5
MATURITY DIMENSION The first dimension in the Data Virtualization Maturity Model measures the five stages of product maturity as follows: Entry Level First product release that implements a minimal set of functionality to credibly enter the market. Limited Follow-on product release(s), aimed at satisfying initial customer demands within narrow (often vertical market) use-cases. Intermediate Product releases where the functionality expands rapidly based on traction in the marketplace. Feature rich, these releases address a growing market and an expanding set of use cases. Advanced Product releases addressing more complex use cases as well as supporting large-scale, enterprise-wide infrastructure requirements. Mature Mature product releases that increase functional depth and expand market penetration. Often products incorporate functionality from adjacent functionality areas. The relation between data virtualization platform maturity and time can be seen in Figure 1. Mature Product Maturity Intermediate Limited Advanced Entry Level Time in Years Figure 1. Product Maturity over Time Composite Software 6
FUNCTIONALITY DIMENSION The second dimension in the Data Virtualization Maturity Model is functionality. Derived from Composite Software s millions of hours of operational deployment at Global 2000 enterprises and government agencies and hundreds of man years of R&D, the following eight functional categories combine to form a viable enterprise-level data virtualization platform. These categories include: Query Processing Caching Data Access (from Sources Transformation (includes Data Quality) Data Delivery (to Consumers) Security Modeling and Metadata Management Enterprise-scale Operation Composite Software 7
COMBINING DIMENSIONS By overlaying the five stages in the maturity dimension across the eight categories on the functionality dimension, enterprises and government agencies can use the Data Virtualization Platform Maturity Model to gain a comprehensive understanding of key capabilities by stage. Query Processing At its core, data virtualization s primary purpose is on-demand query of widely dispersed enterprise data. Consequently, data virtualization platforms must ensure these queries are efficient and responsive. If the high-performance query processing engine is immature or poorly architected, the rest of the functionality is of little consequence. Maturity is typically measured by the breadth and efficiency of optimization algorithms. Entry Level Limited Intermediate Advanced Mature Process mainstream queries correctly Limited join algorithms Projection pruning Full implementation of relational algebra semantics Multi-threaded and parallel query processing Complete set of relational join algorithms Automatic rule-based optimizations Push predicates down to underlying data sources Data source specific optimizations Dynamic memory management for large data set support Single-source updates Complete standard SQL support Complete federated query plan support, including plan caching Support for multiple data shapes (i.e., scalar, tabular, hierarchical), including limited XML support Advanced join techniques (e.g., distributed semi-join) User-provided query hints Limited cost-based optimizations Transformations between data shapes Complete standard XML support, including XSLT and XQuery manipulation Advanced cost-based optimizations based on statistics (including query plan rearrangement) Exotic use-case-specific optimizations (e.g., automatic UNION-JOIN inversion) Multi-source updates, including support for transactions Scripting environment for procedural logic Platform-specific query pass-through Star- and snowflakeschema optimizations Adaptive optimizations driven by learned patterns User-defined functions Caching Traditional data integration solutions periodically consolidate data in physical stores. In contrast, data virtualization platforms dynamically combine data in-memory, on-demand. Caching addresses the middle ground between these two approaches by enabling optional prematerialization of queries result sets. This flexibility can improve query performance, work around unavailable sources, reduce source system loads, and more. Maturity is measured by the breadth of caching options across factors such as triggering, storage, distribution, update, etc. Composite Software 8
Entry Level Limited Intermediate Advanced Mature Materialization of tabular data sets Local storage Basic cache refresh policies (i.e., periodic) Consider caches in optimization decisions Multiple cache refresh policies (periodic, aging, external events) Relational database cache storage, including DDL support Procedure result caching, including web service result caching Incremental cache refresh (leveraging change data capture) Cluster-shared caches Multi-cluster edge caching Adaptive dynamic caching based on learned patterns Distributed cache storage In-memory cache storage Data Access (from Sources) There are a wide variety of structured and semi-structured data sources in a typical large enterprise. Data virtualization platforms must reach and extract data efficiently from all of them. Further, they must include methods to programmatically extend data source access to handle unique, non-standard data sources. Maturity is measured in the breadth of data source formats and protocols supported. Entry Level Limited Intermediate Advanced Mature Limited set of relational databases Tabular files Expanded set of relational databases Basic web services over HTTP Excel spreadsheets XML files Data-source specific query pass-through Packaged application data access (e.g., SAP, Siebel) Data warehouse support LDAP data source support Stored procedure support in relational databases Message-based web service access (e.g., JMS) Multi-dimensional data source access (including MDX code generation) NOSQL data source integration (e.g., Hadoop/Hbase) Cloud-based data source integration Industry-specific data structure support (e.g., geospatial, molecular) Pass-through authentication to underlying data sources Complete web service support, including REST Legacy mainframe support (e.g., VSAM) Custom developed data source drivers leveraging native APIs Composite Software 9
Transformation (includes Data Quality) Because source data is rarely a 100 percent match with data consumer needs, data virtualization platforms must transform and improve data, typically abstracting disparate source data into standardized canonical models for easier sharing by multiple consumers. Maturity is measured in the ease of use, breadth, flexibility and extensibility of transformation functions. Entry Level Limited Intermediate Advanced Mature Basic SQL functions Aliasing All standard SQL functions Derived values Value standardization* Value enrichment* Support for multiple data shapes (scalar, tabular, and hierarchical) Complete tabular and hierarchical data type representation Transformations between data shapes Scripting environment for procedural logic SQL/XML 2007 support Third-party validation* Third-party enrichment* User-defined functions Universal data type conversion * Denotes functionality traditionally associated with Data Quality. Data Delivery (to Consumers) Enterprise end-users consume data using a wide variety of applications, visualization tools and analytics. Data virtualization platforms must deliver data to the consumers using the standardsbased data access mechanisms these consumers require. Further, they must enable delivery of common data to different consumers via different methods. Examples include as an XML document via SOAP, and as a relational view via ODBC. Maturity is measured in the breadth of data consumer formats and protocols supported. Entry Level Limited Intermediate Advanced Mature Basic ODBC or JDBC connectivity Full ODBC and JDBC standard support Full web services support, including REST Contract-first web service implementation Message-based data delivery Lightweight solutions, embedded in the client Basic web services support ADO.NET support Prepared statement support Analytical functions Scheduled queries with e-mailed results Result set pagination Security Data virtualization platforms must secure the data that passes through them. Deploying data virtualization should not force reinvention of existing well-developed security policies; it should leverage the standards and security frameworks already implemented in the enterprise. Maturity is Composite Software 10
measured in the breadth of authentication, authorization and encryption standards supported as well as a high degree of transparency. Entry Level Limited Intermediate Advanced Mature Built-in user authentication Basic access privileges Support groups and/or roles Support standard CRUD privileges for groups and individuals Leverage external LDAP authentication systems (e.g., Active Directory) Support GRANT privilege model Pass-through user credentials to underlying data sources Support web service security standards (e.g., SSL, WS-Security) Token-based authentication, including SSO, Kerberos, and NTLM Data encryption in wire protocols Policy-based authentication Column-level security Row-level security (i.e., redaction and masking) Modeling and Metadata Management Modeling and development productivity with its concomitant faster time to solution is one of data virtualization s biggest benefits. To ensure data modeler and developer adoption, the tools must be intuitive to use and standards-based. Further, they must automate key work steps including data discovery, code generation, in-line testing and more. Further they must provide tight links to the source control system, metadata repositories, and more. Maturity is measured by the degree that the data virtualization platform makes easy things easy and hard things possible. Entry Level Limited Intermediate Advanced Mature Basic drag-and-drop query editor Interactive testing and debugging Model import and export Graphical query editor with support for all major SQL constructs Textual editors for hierarchical (XML) data transformations Graphical tools to examine data lineage Interactive data source metadata introspection Metadata export/import Metadata migration utilities Metadata search and query Third-party metadata repositories Query plan visualizer with live monitoring Hierarchical (XML) to tabular graphical transformation editor Tabular to hierarchical (XML) graphical transformation editor Graphical editor to combine data of multiple shapes Integration with source code control systems Rule-based triggers Multi-user resource management Metadata management API Graphical transformation editor for multidimensional data Graphical editors for complex data types (e.g., XML schemas) Data profiling tools with multi-source relationship discovery (i.e., inter-silo schema discovery) Integration with adjacent data manipulation tools Any-to-any graphical transformation editor Scripting debugger Composite Software 11
Enterprise-scale Operation Because data virtualization serves critical business needs 7x24x365, operational support is a core requirement in enterprise data virtualization deployments. Data virtualization platforms must be highly deployable, reliable, available, scalable, manageable and maintainable. Maturity is measured in the breadth and depth of operational support capabilities. Entry Level Limited Intermediate Advanced Mature Basic logging Basic management consoles Support for all major enterprise operating system platforms (Windows, Solaris, Linux, AIX, HP-UX) Logging of all major activity Management consoles for all major functionality Unicode and internationalization Clustering for horizontal scaling, fail over, and disaster recovery Automatic and dynamic synchronization of metadata across cluster Complete management consoles Integration with SNMP monitoring systems Support for 64-bit processor architectures Cluster-wide visibility and monitoring consoles Cluster-shared caching Real-time monitoring of running queries Integration with thirdparty NOC tools and infrastructure (including DAM solutions) Geographically distributed cooperating clusters Integration with adjacent data management infrastructures (e.g., data governance solutions) Management and administration API Composite Software 12
CONCLUSION Data virtualization platform functionality has evolved to meet changing IT demands. The Data Virtualization Platform Maturity Model described in this paper provides a detailed, systematic approach that supports the initial evaluation of a data virtualization platform. This model may also be applied during the development of a data virtualization adoption roadmap; during the alignment process of executing the adoption roadmap; and/or to measure over time the viability of the selected data virtualization offering. Composite Software 13
ABOUT COMPOSITE SOFTWARE Composite Software, Inc. is the only company that focuses solely on data virtualization. Global organizations faced with disparate, complex data environments, including ten of the top 20 banks, six of the top ten pharmaceutical companies, four of the top five energy firms, major media and technology organizations as well as government agencies, have chosen Composite s proven data virtualization platform to fulfill critical information needs, faster with fewer resources. Scaling from project to enterprise, Composite s middleware enables data federation, data warehouse extension, enterprise data sharing, real-time and cloud computing data integration. Founded in 2002, Composite Software is a privately held, venture-funded corporation based in Silicon Valley. For more information, please visit www.compositesw.com.