A Next-Generation Analytics Ecosystem for Big Data Colin White, BI Research September 2012 Sponsored by ParAccel
BIG DATA IS BIG NEWS The value of big data lies in the business analytics that can be generated from it Big data involves a set of overlapping data management and analytic technologies Big data is garnering a significant amount of industry attention, but unfortunately much of this attention is focused erroneously on handling large volumes of multistructured data. This leads to a skewed and narrow perspective about the value of big data to organizations. Big data involves not just multi-structured data, but any type of electronic data that exists in, or can be acquired by, an organization. Also, big data is not so much about the volume or type of data, but more about how you extract business value from that data it is about the business analytics that can be generated from big data to help improve business efficiency and competitiveness. Confusion about big data and its business value also exists because big data involves a set of overlapping data management and analytic technologies, including relational DBMSs and non-relational systems such as Hadoop. When combined, these technologies represent a new generation of innovation, and the business value of big data comes from extending the existing analytics ecosystem to incorporate these innovative data management and analytic advances. Where Are We Today? Big data represents a continuum of high-performance technologies that have evolved over the past five decades to support IT business transaction and business analytics workloads that stretch the limits of hardware and software capabilities. The first business transaction processing systems introduced in the early 1960s, for example, were custom built and optimized to handle the workloads involved. Early airline reservation systems are an example here. In turn, the introduction and use of point-ofsale terminals, automated teller machines, mobile phones and the Internet have also led to workloads that often require optimized hardware and software systems to support performance needs. Optimized systems are required to handle complex workloads that analyze big data The picture is similar with business analytics and data warehousing where there have also been dramatic increases in workload requirements. The first multi-terabyte data warehouse, for example, was deployed in the early 1990s, but today this size of warehouse is common, and many companies now have data warehouses that handle multiple petabytes of data. Again, optimized systems are often required to handle the workloads involved. This is especially the case when the processing involves complex analytical models and algorithms. Some organizations have business analytics workloads that do not require optimized systems, and in this situation a single platform may be adequate for supporting most workloads. However, as the need grows to handle new types of data and support more sophisticated analyses, even these organizations are faced with either cutting back on the amount of data that can be managed and the types of analyses that can be run, or extending their business analytics ecosystems to add optimized hardware and software platforms for handling high-performance workloads. Copyright 2012 BI Research, All Rights Reserved. 1
WHAT THEN IS BIG DATA? Big data represents workloads that could not previously be supported Big data represents data management and analytic solutions that could not previously be supported because of technology performance limitations, the high costs involved, or limited information. Big data solutions allow organizations to build optimized systems that improve performance, reduce costs, and allow new types of data to be captured for analysis. Big data involves two important new data management technologies: Analytic relational systems that are optimized for supporting complex analytic processing against both structured and multi-structured data. These systems are evolving to support not only relational data, but also other types of data structures. These systems may be offered as software-only solutions or as custom hardware/software appliances. Non-relational systems (such as the Hadoop distributed computing environment) that are particularly well suited to the processing of large amounts of multi-structured data. There are many different types of non-relational systems, including distributed file systems, document management systems, and database and analytic systems for handling complex data such as graph data. These workloads can now be supported through advances in data management and business analytics Big data enables new types of analytic applications and systems When combined, these big data technologies can support the management and analysis of the many types of electronic data that exist in organizations, regardless of volume, variety, or volatility. They are used in conjunction with four important advances in business analytics: New and improved analytic techniques and algorithms that increase the sophistication of existing analytic models and results and allow the creation of new types of analytic applications. Enhanced data visualization techniques that make large volumes of data easier to explore and understand. Analytics-driven business processes that improve the speed of decision making and enable close to real-time business agility. These processes involve new rulesbased applications that use a combination of both transaction and analytic processing. Stream processing systems that filter and analyze data in motion (sensor data, for example) as it flows through IT systems and across IT networks. The four advances in business analytics enable users to make more accurate and faster decisions, and also answer questions that were not previously possible for cost reasons or because of technology limitations. They also enable data scientists to investigate big data to look for new data patterns and identify new business opportunities. This investigation work is usually done using a separate investigative computing platform or sandbox. The results from investigative computing may lead to new and improved analytic models and analyses, or new built-for-purpose analytic applications and systems. Copyright 2012 BI Research, All Rights Reserved. 2
Figure 1. Analytics ecosystem for big data We can see then that big data involves a number of different components that work together to provide a richer and more powerful analytics ecosystem. Such an ecosystem is illustrated in Figure 1. BARRIERS TO SUCCESS The business needs to understand the use cases for big data The big data ecosystem will be implemented in a phased manner A critical success factor is highperformance data integration and movement between systems Given the complexity and number of components involved in big data there are several barriers that have to be overcome in order to be successful in deploying and gaining business value from new data management and business analytics technologies. Educating IT and the business about the use cases and business benefits of big data. The use of big data is expanding rapidly and there are an increasing number of use cases that cover a wide range of applications in various industries. At present, most big data solutions are geared to addressing specific line-ofbusiness (LOB) needs, rather than being deployed enterprise wide. Understanding and selecting the components that are required to build and support a big data ecosystem. The objective of this paper is to help here by reviewing this ecosystem, identifying its key components, and explaining how these components are likely to evolve over time. Although an organization should develop an overall big data plan and ecosystem, the components that make up the ecosystem will be implemented in a phased manner to support the various workloads associated with the use cases that apply to that organization. The amount of data integration and level of data movement required in a big data environment. As illustrated in Figure 1, the analytics ecosystem for big data may involve several different software and hardware systems. The availability of open data and metadata interfaces, well-designed data integration tools, and high-speed data connectors between these distributed systems is a critical success factor in implementing big data solutions. Data management components that are currently distributed across these multiple systems are likely to be consolidated over time onto a single platform to reduce data movement. For Copyright 2012 BI Research, All Rights Reserved. 3
Flexible data governance is required for big data Most organizations will use a combination of relational and nonrelations systems such as Hadoop Even if multiple systems are used, business users need a single interface to these systems Users need to be trained in new data science skills, but tools should also be made easier to use example, a relational DBMS and Hadoop server could co-exist on the same system, or in the future, their capabilities combined into a single integrated DBMS environment. This consolidation is discussed further in the section An Analytics Ecosystem for the Future below. Another possible future option to reduce data movement is an intelligent workload facility that reroutes certain parts of a workload (a query, for example) to the system where the data resides. Developing data governance and data quality management processes to support big data. Given the growth in the amount of data that organizations will be analyzing it will become impossible to guarantee the quality of every piece of information delivered to business users. Instead, it will become necessary to segment data based on quality and security requirements, and handle governance accordingly. It will also be important that users are made fully aware of the quality level of any given piece of information delivered to them. The immaturity of new non-relational systems and the level of IT development and administration resources and skills required for supporting them. Although technologies such as Hadoop are immature, vendors are nevertheless rapidly enhancing these systems to improve their analytical processing capabilities, add development and administration tools, reduce implementation and administration effort, and improve system reliability, availability and security. It must be noted, however, that existing relational DBMS products have undergone significant development effort to support an enterprise-level analytical processing environment, and new non-relational systems will require similar resources to achieve the same level of maturity. However, not all big data use cases require an enterprise-quality analytical processing environment and this is why most organizations are likely to use multiple products. Providing business users with a single and seamless user interface. Given the distributed nature of the ecosystem illustrated in Figure 1, business users could be faced with having to use a multitude of analytic tools and interfaces to access and analyze the data they need to do their jobs. The ecosystem must therefore evolve into providing an integrated toolset and seamless interface to the big data environment. This is discussed in more detail in the section An Analytics Ecosystem for the Future below. Lack of skills for enabling data science and investigative computing projects. From an analytics perspective, this is likely to be one of the biggest barriers to success. While vendors are making their advanced analytics capabilities more approachable, there is still a limit to extent to which such features can be made easily accessible to less skilled users. Rather than looking for individuals that have a complete set of data science skills, organizations should instead build data science teams where the team as a unit has the required skills. For less experienced users, keeping the level of new skills required to a minimum is also important. Copyright 2012 BI Research, All Rights Reserved. 4
AN ANALYTICS ECOSYSTEM FOR THE FUTURE Vendors are working on short- and longterm solutions that remove the barriers to successful big data projects The ecosystem illustrated in Figure 1 shows how the various components of a big data environment coexist in a distributed environment consisting of multiple interconnected systems. As outlined in the section Barriers to Success above, this distributed environment can involve a number of issues that need to be considered. A key issue concerns the level of data and metadata integration and data movement involved in a big data ecosystem. For business analytics, one of the main issues is providing business users with an integrated toolset and seamless interface to the ecosystem. Vendors are working on both short-term and long-term solutions to help solve these issues. High-performance data connectors Metadata interface to identify data in the analytics ecosystem A common set of interfaces to the data in the analytics ecosystem SQL should be supported Short-Term Needs In the short-term, the data portability, connector and interface features supplied by vendors will help address many of these issues. The quality and performance of these features will be key distinguishing factors between products. Three key requirements here are: The ability to ingest and transform any type of data into and between the data management components of the ecosystem. In some cases this may be done using batch processes, but where low-latency data is required, there will be a need to stream data continuously, or in micro-batches, into and between the data management components. Transformation power and performance will distinguish products here. A common metadata interface to identify the data managed by the analytics ecosystem. This may be achieved by using data virtualization techniques to access the metadata where it is physically stored and managed. In the longer term, a single repository that documents the data, its location and physical definition, and its business meaning is required. However, this single repository has always been an elusive IT objective because of the number of data stores and vendor products used by most organizations. A common set of interfaces to the complete set of data managed by the ecosystem no matter where it resides. These interfaces should make the location of the data transparent to user and applications. Current approaches to achieving this include data virtualization and data abstraction layers in existing business analytics tools. However, such capabilities need to be moved into the data management environment to provide a set of optimized services for accessing and processing all of the data in the ecosystem. Given that big data extends existing analytical capabilities, these common interfaces should support current analytic toolsets and data languages. The most common data language used today is SQL, and SQL support should be the starting point for any given interface. Many non-relational products, however, do not support SQL, or support only a limited subset of SQL. To overcome this issue vendors will need to support other languages such as R or Pig, or even programming languages such as Java. In some cases, applications may use these languages directly. In other situations, to reduce the development effort involved, the interfaces will need to translate SQL into one of those languages, or invoke Copyright 2012 BI Research, All Rights Reserved. 5
A single hybrid data management and analytic platform may be required to handle complex workloads Three features help extend an RDBMS to support big data pre-built routines and functions written using these languages. Compatibility and performance will distinguish products here. Longer-Term Goals In the longer term, to support more sophisticated and complex workloads, it will be necessary to consolidate key components of the ecosystem onto a single system to eliminate the overheads caused by network interactions. This consolidated system may consist simply of the various components co-located on the same hardware and may also involve integrating the components into a single hybrid data management platform optimized for analytic processing. Providing a single data management and analytic platform will of course require major enhancements to an existing relational DBMS product. The advantage of this approach is that not only will new big data capabilities be leveraged for new types of data, but also existing analytic toolsets and data languages will continue to operate without change. The requirements for a single platform can be best explained by examining the architecture of a relational DBMS and reviewing the enhancements required to extend it to support big data. The Logical View of Data In a relational DBMS, users and applications see data in the form of tables. These tables are defined and accessed using SQL. Three important SQL extensibility features provide the underpinnings for analyzing big data: data-type extensions, analytic-function extensions, and external tables. Data-type extensions Analytic-function extensions SQL access to external data stores Over the years, vendors have steadily increased the range of data types that can be managed by a relational DBMS, including, for example, XML documents, geospatial data, text, and multi-media. Some products also provide a user-defined type capability, which allows both vendors and customers to define their own data types to the system. This is useful for adding support for complex multi-structured data formats. The essential requirement here of course is that these new types of data can be analyzed using SQL. This is where analytic function extensions come into play. Vendors have increased the number of pre-defined analytic functions they include in their products, and in some cases have added support for third-party function libraries. Certain products also include a user-defined function capability, which allows new functions to be developed for manipulating new types of data. Both vendor-supplied and user-defined functions are used to encapsulate complex analytic processing, which makes this processing accessible to a broader set of users. These functions are embedded into the DBMS, which means they can exploit the parallel processing capabilities of the system. Analytic-function extensibility is also sometimes used to implement the Hadoop MapReduce distributed programming model in relational DBMS products. An external table capability allows data that is external to the relational DBMS to be accessed using SQL. This capability is useful for accessing data that is managed by an external file system. It can also be used to modify and create these external data files. In a Hadoop environment, data managed by the Hadoop Distributed File Copyright 2012 BI Research, All Rights Reserved. 6
System (HDFS) could therefore be accessed and created using the external table capability of a relational DBMS. The design of physical data management layer affects the performance of the system The Physical Management of Data The key to success in extending a relational DBMS to support big data lies of course in how the underlying physical architecture implements the logical view of data discussed above. Performance is of the utmost importance here. The DBMS query optimizer maps the logical view of data to the underlying physical storage structures used to manage the data. The optimizer s job is to provide physical data independence and to determine the most appropriate way to physically access and process data. The optimizer plays a major role in the performance of the system, and significant amounts of research and development have gone into designing efficient optimizer technology and in integrating this technology at execution time with DBMS workload management. Products vary considerably in optimizer quality, in their ability to extend the optimizer to efficiently manage user-defined types and functions, and in handling complex workloads. Optimizer quality and extensibility and sophisticated workload management are essential if a product is to support big data with good performance. Today s relational DBMSs offer a variety of physical storage options Storage managers could be added to a relational DBMS to handle new types of big data The way the data is physically stored and managed in data blocks and data files also plays a significant role in performance. Initially relational DBMSs stored data in a row-based format in data files on disk. Indexes were then created to provide direct access to the data where required. Over time more advanced mechanisms and options have been introduced. Examples include various column-based data block formats, improved data compression and encryption, enhanced buffer management, new indexing techniques, in-memory tables, hybrid storage where data can be stored in optimized file systems and on different speed devices based on usage, and parallel processing architectures with high-speed data interconnects. All of the above physical storage options apply equally to big data. Certain types of big data, however, may require enhanced or separate physical storage options. This is nothing new. Some relational DBMSs, for example, have separate storage managers and optimized data stores for XML-formatted data and multi-media. Each of these storage managers works in conjunction with the query optimizer (see Figure 2 on the next page). This approach could be used to support multi-structured data such as graph data, document data or data imported from systems such as Hadoop. These new big data storage managers when coupled with external tables give organizations the flexibility to use SQL to transparently access data that has been moved into the relational DBMS and also data that resides in external data systems. We can see from the above discussion that supporting big data involves extending a relational DBMS to not only support new types of data and new analytical techniques, but also to provide a sophisticated physical architecture that can provide the required performance. Copyright 2012 BI Research, All Rights Reserved. 7
Figure 2. Next generation analytics system for big data Summary In the short-term, big data can be supported by providing efficient data interchange connectors and common interfaces to the various systems shown in Figure 1. In the longer term, for certain types of applications, it can be seen from the discussion above that there are several advantages to providing a single system for supporting big data. There are two ways of building such a system. The first is by co-locating critical components on the same hardware platform and then using a common set of application interfaces to those components and the data they manage. The second approach is to extend a relational DBMS to support big data using a single data management and analytics platform. This second approach allows the full power of the relational DBMS environment to be applied to big data solutions, while at the same supporting existing applications. Requirements for an analytics ecosystem for big data In summary, key requirements for a next generation analytics ecosystem for big data include: An integrated set of open interfaces to all of the processes, data and metadata in the analytics ecosystem. Support for all mainstream programming languages. SQL access to external data sources and systems (such as Hadoop) via highperformance connectors. Data-type extensions for big data. Analytic-function extensions and development kit for business analytics. Support for third-party and open source analytic function libraries Query optimizer that understands SQL extensibility for big data (data types, analytic functions and external tables) and the underlying physical data management and storage architecture. Workload manager for handling complex, mixed and interactive workloads. Copyright 2012 BI Research, All Rights Reserved. 8
Customized storage managers for handling certain types of complex data. Parallel-processing architecture with high-speed interconnect between nodes. None of the short-term or long-term options outlined in this paper are mutually exclusive. Organizations will need to understand the benefits and costs of each of the options and choose the appropriate solutions that meet their business needs while at the same time supporting the workloads involved with high performance. About BI Research BI Research is a research and consulting company whose goal is to help organizations understand and exploit new developments in business intelligence, data integration, and data management. Copyright 2012 BI Research, All Rights Reserved. 9