Have 40 Why Big Data in the Cloud? Colin White, BI Research January 2014 Sponsored by Treasure Data
TABLE OF CONTENTS Introduction The Importance of Big Data The Role of Cloud Computing Using Big Data in the Cloud Use Cases for Big Data in the Cloud The Challenges of Big Data in the Cloud Vendor Example: Treasure Data Data Acquisition Data Storage Data Analysis Treasure Data Value Proposition Conclusion 1 1 2 2 3 3 4 4 5 6 6 6 Copyright 2014 BI Research, All Rights Reserved.
INTRODUCTION Big data and cloud computing are top initiatives for IT, and when used together they promise significant benefits for both the business and IT. Big data helps create competitive advantage, increase revenues and identify new business opportunities, while cloud computing offers the potential to reduce IT costs and provide faster time to value for IT investments. Both technologies are evolving rapidly, and an increasing number of vendors are developing and delivering products and services for enabling big data solutions in a cloud-computing environment. Although all organizations should evaluate the use of cloud computing for their big data projects, it is also important to realize that big data in the cloud is not a one-size-fits-all approach. There are many different cloud services on the market and it is essential that you match business and technology requirements to the most appropriate offering. Also, not all big data projects are ideally suited to a cloud computing approach, and it is important to clearly identify those projects that do and do not lend themselves to a cloud environment. The objectives of this paper are to provide an overview of key industry trends in the use of big data in the cloud and to help you identify the use cases that best fit a big data cloud computing environment. Along the way, as an example, it will also take a look at Treasure Data (the sponsor of this paper) and its end-to-end cloud solution for big data. THE IMPORTANCE OF BIG DATA Big data projects initially focused on processing business information extracted from Internet and Web data sources, for example, e-mail, Web pages and logs, and social computing sites (Facebook, Twitter, etc.). More recently, the use of big data has grown to include other sources especially data generated by sensors installed on a wide range of equipment such as mobile devices, smart utility meters, motor vehicles, aircraft engines, security equipment, telephone switches, RFID readers, and so forth. In fact, machinegenerated data is one of the fastest growing sources of big data. As companies began to exploit big data it quickly became apparent that traditional approaches to data warehousing and analytic processing were unable to handle not only the data volumes and data rates involved, but also the diverse set of data sources and varieties of data required by big data projects. Clearly, a more efficient and cost-effective infrastructure was required. Solutions here vary from reducing the cost and improving the capabilities of relational database products to providing alternative data management technologies such as the Hadoop distributed computing environment. The value of big data, however, is not in the raw data itself, but in the business insights that can be gained by extracting and analyzing the business information embedded in the data. This is why vendors are focusing not only on providing products that help manage big data, but also on solutions that can extract and analyze the business information in that data. The result is that several vendors now offer end-to-end solutions that provide data acquisition and integration, data management, data analysis and data visualization Copyright 2014 BI Research, All Rights Reserved. 1
capabilities for the processing of big data. To speed deployment and improve time to value, these solutions are frequently offered as prepackaged hardware and software appliances and/or as a set of services for rapid deployment in a cloud-computing environment. THE ROLE OF CLOUD COMPUTING Cloud computing services promise pay-as-you go, on demand and elastic scalability for developing and deploying many IT projects. Compared with an on-premises IT environment, cloud computing reduces upfront IT costs and enables organizations to scale their IT resources as required, while paying only for the resources they use. The cloud is therefore an ideal environment for big data projects, given the large data volumes and unpredictable nature of the analytic workloads involved. This is one of the reasons why the industry is seeing a sudden and significant jump in the use of cloud computing. Another reason for this sudden growth is that cloud technologies are maturing and organizations are overcoming their data security issues and concerns. Barriers still remain to successful cloud adoption, however. Chief among these is complexity of integrating cloud and on-premises data and the inability of many cloud services to efficiently and rapidly move data into and out of the cloud environment this topic is discussed in more detail below. USING BIG DATA IN THE CLOUD Most traditional data warehousing and business analytics projects to date have involved managing and analyzing data extracted from on-premises business transaction systems. 1 In some situations, cloud services have been used for developing analytics on business transaction data stored in a cloud computing system such as salesforce.com, but these have been piecemeal and standalone projects. In fact, one of the risks of cloud computing is that it has made it easier for business groups and business users to bypass IT and purchase their own cloud-based IT services. This is why it is important for IT to partner and collaborate with the business in deploying and using cloud services to reduce risk, avoid poor technology selection, and manage data governance and data security issues. For the foreseeable future, it is unlikely that many organizations will move their existing business transaction systems or sensitive transaction data for analysis purposes to a public cloud environment. 2 However, cloud adoption for business transaction processing is increasing, especially for new projects and projects involving packaged application solutions, and so in the longer term this will lead to more traditional business transaction processing and associated analytic processing being done in the cloud. The biggest potential for cloud computing is the processing of data that already exists in 1 The exceptions are newer and Web-focused companies whose sole business is oriented towards Internet commerce. These companies have few legacy systems and it is therefore easier for them to move to a cloud-computing environment. 2 Many companies are, however, beginning to deploy private clouds and virtualized environments for in-house use, but this topic is beyond the scope of this paper. Copyright 2014 BI Research, All Rights Reserved. 2
the cloud. This data includes the large volumes of data on internal and public web servers, and also generated by third-party providers. It also includes externally generated data (certain types of machine sensor data, for example) that can as easily be delivered to a cloud environment as it can to an in-house environment. These large volumes of Web and sensor data can be captured, filtered and transformed in the cloud and then delivered to an in-house system for analysis. In many cases, the data can also be analyzed in the cloud and the results delivered to internal and external business users. One of the key requirements here is to keep data movement to a minimum and to process data where it resides. As noted in the beginning of this article, it is important to realize that big data in the cloud is not a one-size-fits-all solution. It pays to make use of cloud services where it makes sense from the perspective of satisfying business needs, reducing costs, achieving faster time to value, and providing flexibility and scalability. USE CASES FOR BIG DATA IN THE CLOUD When examining how organizations use cloud computing for big data projects, three main use cases become apparent these are outlined below. Standalone reporting and analysis of Web, social media or sensor data: A cloudbased reporting and analysis system is a cost-effective way of capturing, storing and analyzing high-volume web log/clickstream, social media (from Twitter, for example) or sensor (from telemetric devices, for example) data. In this use case, data from each source is uploaded in small batches or streamed directly to the cloud service for reporting and analysis. Data analysis and visualization of e-commerce data: Many organizations (web retailers, on-line gaming companies, etc.) run their entire businesses on the web. For these companies, monitoring business operations, analyzing customer and user behavior and tracking marketing programs are top priorities. The applications involved in e- commerce are often deployed on hundreds of servers and handle requests from millions of users and a variety of devices. They also generate terabytes of data every day. A cloudbased system is ideally suited to collecting, analyzing and visualizing all of this data to help business managers track and analyze overall business operations and performance. Data warehouse augmentation: A cloud-based data refinery or data hub is a costeffective way of capturing, storing, transforming and archiving high-volume data while also providing connectivity to a data warehouse for transferring data. In this use case, the data warehouse remains the primary source of analytics for business users, but direct reporting and analysis of cloud-based data may also be provided as required. THE CHALLENGES OF BIG DATA IN THE CLOUD The main tasks in any big data project involve acquiring and integrating the raw source data, managing that data, processing the data, and finally delivering the results to the Copyright 2014 BI Research, All Rights Reserved. 3
systems and users that require the processed data. Processing may involve transforming and filtering the data and also possibly analyzing the data. As in an on-premises environment, cloud users have the choice of integrating various cloud products and services themselves or using an integrated end-to-end solution. In the same way that an integrated hardware and software appliance simplifies development, deployment and administration for on-premises project, an integrated end-to-end cloud solution for big data offers similar benefits to an appliance approach. A cloud solution also provides flexible scalability and a pay-as-you-go pricing model. As mentioned earlier, one of the biggest barriers to cloud deployment is data integration and data movement. Ideally, the data should be processed where it resides, but even when the source data already resides in the cloud it may still have to be moved to a different cloud system for processing in the same way that data is moved from business transaction systems to a data warehouse in an on-premises environment. An added complication with big data is that the project may also involve a mixture of data in the cloud and on-premises data. In this case, the on-premises data may be accessed dynamically by a cloud application or staged from the on-premises environment to the cloud for use by the cloud application. Again, this is the same as in an on-premises environment where big data projects are increasingly using data from a variety of data sources in addition to a data warehouse. The key difference in a cloud environment is that data movement occurs across an Internet connection, which has security, performance and cost implications. It is very important in a cloud environment therefore to look for big data solutions that not only simplify development, deployment and administration, but that also provide solid and well performing data integration and data movement capabilities. VENDOR EXAMPLE: TREASURE DATA Treasure Data was founded in 2011 and is based in Mountain View, California. The company offers a managed cloud service that provides an end-to-end solution for the acquisition, storage and analysis of big data. At the time of writing, Treasure Data had some 90 customers, including several Fortune 500 companies. These customers come from a variety of industries but most of their big data projects fit into one of the three use cases outlined earlier. Data Acquisition Data is uploaded to the Treasure Data service using a parallel bulk data import tool or real-time data collection agents that run on the customer s local systems. The bulk data import tool is typically used to import data from relational databases, flat files (Microsoft Excel, comma delimited, etc.) and applications systems (ERP, CRM, etc.). Data collection agents are designed to capture data in real-time from Web and application logs, sensors, mobile systems, and so forth. Since near-real-time data is critical for the majority of customers, most data comes into Treasure Data system using data collection agents. Copyright 2014 BI Research, All Rights Reserved. 4
Data collection agents filter, transform and/or aggregate data before it is transmitted to the cloud service. All data is transmitted in a binary format known as MessagePack. 3 The agent technology has been designed to be lightweight, extensible and reliable. It also employs parallelization, buffering and compression mechanisms to maximize performance, minimize network traffic, and ensure no data loss or duplication during transmission. Buffer sizes can be tuned based on timing and data size. One of Treasure Data s customers, for example, uses data collection agents to transmit over a terabyte of compressed log data per day to the service for customer billing purposes. Another collects and transmits real-time gaming data from 3,500 servers for analysis on the Treasure Data service. The agent technology comes in two versions: an open source version (Fluentd) and an enhanced version supported by Treasure Data (Treasure Agent). The Fluentd open source community has some 2,000 members who have developed and contributed over 200 data capture plug-ins for use on-premises and in the cloud (including the Treasure Data cloud service). Treasure Data supplies a range of enhanced enterprise-ready plug-ins that provide improved compatibility and performance. Treasure Data also offers a monitoring and alerting service for Treasure Agent users. Data Storage The Treasure Data cloud service currently employs Amazon web services and the Amazon S3 object storage layer, but Treasure Data claims it can easily port to other platforms as customer needs dictate. All data flowing into the Treasure Data cloud service is time stamped, transformed into a compressed columnar MessagePack format, and stored in a proprietary columnar database known as Plazma. This database can then be queried using an enhanced Hadoop processing environment. Data is first kept in real-time files and then moved into archive files at regular intervals. This latter process enables time-based partitioning and larger data files for more efficient processing. The process is completely transparent to applications. A Web-based management console is provided for monitoring resources, managing access controls, and raising support tickets. Treasure Data is working on expanding this console to provide Treasure Viewer, an interface to query and visualize data. This interface is currently in beta testing. Treasure Data uses a flat-rate tiered pricing model that is based on the number of data rows imported into the service annually, guaranteed processing capacity, service-level requirements, and the level of support needed. The Treasure Data service provides a multi-tenant environment where additional machine resources expand to meet customer 3 MessagePack is an efficient binary serialization format used for exchanging data between systems. It is similar to the JSON data format, but is faster and more compact than JSON. MessagePack encodes single integers into a single byte, which means that short strings typically incur only one byte of overhead when encoded. Copyright 2014 BI Research, All Rights Reserved. 5
needs and where customers can use up to four times the guaranteed capacity at no extra cost if that capacity is available. Data Analysis Applications access and analyze data in a Treasure Data environment using Hadoop Hive or Treasure Query Accelerator queries coded in SQL syntax. Some Treasure Data customers are happy with Hive, while others often require a more interactive and highperformance interface than that supported by the MapReduce batch jobs generated by Hive. To meet this customer need, Treasure Data offers the Treasure Query Accelerator, which extends the SQL interface to support an enhanced version of Cloudera Impala. The Treasure Data platform separates the query engine from the storage layer to make it easier to add other SQL interfaces as other open source products mature. Both ODBC and JDBC drivers are available for the query interface, which enables many popular BI and analytics tools to be used with the service. Several customers, for example, use Tableau to access and analyze data managed by Treasure Data. Treasure Data Value Proposition The objective of Treasure Data is to provide an end-to-end cloud service for big data projects that is fast and easy to deploy, is economic, and also well supported. Its managed service model makes it attractive to companies with limited technical resources. The company receives high marks from its customers for fast implementation times and the support it provides. Another objective of Treasure Data s cloud service is to overcome the data integration and data movement issues outlined in this paper by providing optimized data collection agents that support a wide range of data sources. The Treasure Data service is especially well suited to those organizations that have data in the cloud data and/or externally generated sensor data, are open to a cloud-based approach, wish to use a managed big data service rather than a set of complex platform services, and do not have the skills or desire to manage a big data platform. CONCLUSION Big data is currently receiving significant industry attention and there is considerable hype associated with this topic. At the same time, however, more and more companies are beginning to see the business value of big data projects, and as this field matures the rate of adoption will accelerate. There is also considerable interest in cloud computing for reducing up-front IT costs, providing elastic scalability and enabling the rapid deployment of new projects. As a result an increasing number of companies will deploy their big data projects in the cloud. Both big data and cloud computing require a new set of skills, and organizations need to ensure these skills are in place before embarking on big data in the cloud. Vendors can help organizations get up to speed in this area and this is why choosing the right cloud vendor is important. A companion paper to this one looks at how organizations should prepare and get started on big data projects in the cloud and also offers suggestions for the features an organization should look for in selecting a cloud services vendor. Copyright 2014 BI Research, All Rights Reserved. 6