Why Big Data in the Cloud?



Similar documents
Using and Choosing a Cloud Solution for Data Warehousing

Big Data for the Rest of Us Technical White Paper

BIG DATA TRENDS AND TECHNOLOGIES

Using Big Data for Smarter Decision Making. Colin White, BI Research July 2011 Sponsored by IBM

How To Handle Big Data With A Data Scientist

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Microsoft Big Data. Solution Brief

WHITE PAPER. Data Migration and Access in a Cloud Computing Environment INTELLIGENT BUSINESS STRATEGIES

An Oracle White Paper September Oracle: Big Data for the Enterprise

Big Data at Cloud Scale

Interactive data analytics drive insights

Delivering Real-World Total Cost of Ownership and Operational Benefits

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Oracle Database 12c Plug In. Switch On. Get SMART.

Big Data and Market Surveillance. April 28, 2014

How To Make Data Streaming A Real Time Intelligence

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Data Integration Checklist

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

How To Use Hp Vertica Ondemand

ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION

Oracle Data Integration: CON7926 Oracle Data Integration: A Crucial Ingredient for Cloud Integration

Modernizing Your Data Warehouse for Hadoop

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

QUICK FACTS. Delivering a Unified Data Architecture for Sony Computer Entertainment America TEKSYSTEMS GLOBAL SERVICES CUSTOMER SUCCESS STORIES

A TECHNICAL WHITE PAPER ATTUNITY VISIBILITY

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Native Connectivity to Big Data Sources in MSTR 10

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

CREATING PACKAGED IP FOR BUSINESS ANALYTICS PROJECTS

the missing log collector Treasure Data, Inc. Muga Nishizawa

The benefits and implications of the Cloud and Software as a Service (SaaS) for the Location Services Market. John Caulfield Solutions Director

WHITE PAPER LOWER COSTS, INCREASE PRODUCTIVITY, AND ACCELERATE VALUE, WITH ENTERPRISE- READY HADOOP

An Oracle White Paper June Oracle: Big Data for the Enterprise

IT Workload Automation: Control Big Data Management Costs with Cisco Tidal Enterprise Scheduler

The 3 questions to ask yourself about BIG DATA

Please give me your feedback

The Future of Data Management

Big Data Analytics Nokia

Informatica and the Vibe Virtual Data Machine

Luncheon Webinar Series May 13, 2013

Best Practices for Hadoop Data Analysis with Tableau

Connecting Hadoop with Oracle Database

Hadoop & Spark Using Amazon EMR

UNIFY YOUR (BIG) DATA

Dell* In-Memory Appliance for Cloudera* Enterprise

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

Converging Technologies: Real-Time Business Intelligence and Big Data

Enterprise Data Integration

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Cost-Effective Data Management and a Simplified Data Warehouse

Ubuntu and Hadoop: the perfect match

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

Einsatzfelder von IBM PureData Systems und Ihre Vorteile.

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

SQLstream 4 Product Brief. CHANGING THE ECONOMICS OF BIG DATA SQLstream 4.0 product brief

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Harnessing the Power of Big Data for Real-Time IT: Sumo Logic Log Management and Analytics Service

Focus on the business, not the business of data warehousing!

Investor Presentation. Second Quarter 2015

Parallel Data Warehouse

Clickstream Data Warehouse Initiative

BIG DATA AND MICROSOFT. Susie Adams CTO Microsoft Federal

Next-Generation Cloud Analytics with Amazon Redshift

Big Data and Your Data Warehouse Philip Russom

GigaSpaces Real-Time Analytics for Big Data

WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS

Three Reasons Why Visual Data Discovery Falls Short

Innovation Session BIG DATA. HP EMEA Software Performance Tour 2014

Cisco Data Preparation

Why DBMSs Matter More than Ever in the Big Data Era

Using Attunity Replicate with Greenplum Database Using Attunity Replicate for data migration and Change Data Capture to the Greenplum Database

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

IBM Data Warehousing and Analytics Portfolio Summary

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Information Architecture

Azure Scalability Prescriptive Architecture using the Enzo Multitenant Framework

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Big Data & Cloud Computing. Faysal Shaarani

IBM BigInsights for Apache Hadoop

What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER

NoSQL for SQL Professionals William McKnight

Solution brief. HP CloudSystem. An integrated and open platform to build and manage cloud services

Service Provider Builds Secure Public Cloud for Mission-Critical Applications

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

IBM DB2 Near-Line Storage Solution for SAP NetWeaver BW

CitusDB Architecture for Real-Time Big Data

SOLUTION BRIEF. JUST THE FAQs: Moving Big Data with Bulk Load.

Transcription:

Have 40 Why Big Data in the Cloud? Colin White, BI Research January 2014 Sponsored by Treasure Data

TABLE OF CONTENTS Introduction The Importance of Big Data The Role of Cloud Computing Using Big Data in the Cloud Use Cases for Big Data in the Cloud The Challenges of Big Data in the Cloud Vendor Example: Treasure Data Data Acquisition Data Storage Data Analysis Treasure Data Value Proposition Conclusion 1 1 2 2 3 3 4 4 5 6 6 6 Copyright 2014 BI Research, All Rights Reserved.

INTRODUCTION Big data and cloud computing are top initiatives for IT, and when used together they promise significant benefits for both the business and IT. Big data helps create competitive advantage, increase revenues and identify new business opportunities, while cloud computing offers the potential to reduce IT costs and provide faster time to value for IT investments. Both technologies are evolving rapidly, and an increasing number of vendors are developing and delivering products and services for enabling big data solutions in a cloud-computing environment. Although all organizations should evaluate the use of cloud computing for their big data projects, it is also important to realize that big data in the cloud is not a one-size-fits-all approach. There are many different cloud services on the market and it is essential that you match business and technology requirements to the most appropriate offering. Also, not all big data projects are ideally suited to a cloud computing approach, and it is important to clearly identify those projects that do and do not lend themselves to a cloud environment. The objectives of this paper are to provide an overview of key industry trends in the use of big data in the cloud and to help you identify the use cases that best fit a big data cloud computing environment. Along the way, as an example, it will also take a look at Treasure Data (the sponsor of this paper) and its end-to-end cloud solution for big data. THE IMPORTANCE OF BIG DATA Big data projects initially focused on processing business information extracted from Internet and Web data sources, for example, e-mail, Web pages and logs, and social computing sites (Facebook, Twitter, etc.). More recently, the use of big data has grown to include other sources especially data generated by sensors installed on a wide range of equipment such as mobile devices, smart utility meters, motor vehicles, aircraft engines, security equipment, telephone switches, RFID readers, and so forth. In fact, machinegenerated data is one of the fastest growing sources of big data. As companies began to exploit big data it quickly became apparent that traditional approaches to data warehousing and analytic processing were unable to handle not only the data volumes and data rates involved, but also the diverse set of data sources and varieties of data required by big data projects. Clearly, a more efficient and cost-effective infrastructure was required. Solutions here vary from reducing the cost and improving the capabilities of relational database products to providing alternative data management technologies such as the Hadoop distributed computing environment. The value of big data, however, is not in the raw data itself, but in the business insights that can be gained by extracting and analyzing the business information embedded in the data. This is why vendors are focusing not only on providing products that help manage big data, but also on solutions that can extract and analyze the business information in that data. The result is that several vendors now offer end-to-end solutions that provide data acquisition and integration, data management, data analysis and data visualization Copyright 2014 BI Research, All Rights Reserved. 1

capabilities for the processing of big data. To speed deployment and improve time to value, these solutions are frequently offered as prepackaged hardware and software appliances and/or as a set of services for rapid deployment in a cloud-computing environment. THE ROLE OF CLOUD COMPUTING Cloud computing services promise pay-as-you go, on demand and elastic scalability for developing and deploying many IT projects. Compared with an on-premises IT environment, cloud computing reduces upfront IT costs and enables organizations to scale their IT resources as required, while paying only for the resources they use. The cloud is therefore an ideal environment for big data projects, given the large data volumes and unpredictable nature of the analytic workloads involved. This is one of the reasons why the industry is seeing a sudden and significant jump in the use of cloud computing. Another reason for this sudden growth is that cloud technologies are maturing and organizations are overcoming their data security issues and concerns. Barriers still remain to successful cloud adoption, however. Chief among these is complexity of integrating cloud and on-premises data and the inability of many cloud services to efficiently and rapidly move data into and out of the cloud environment this topic is discussed in more detail below. USING BIG DATA IN THE CLOUD Most traditional data warehousing and business analytics projects to date have involved managing and analyzing data extracted from on-premises business transaction systems. 1 In some situations, cloud services have been used for developing analytics on business transaction data stored in a cloud computing system such as salesforce.com, but these have been piecemeal and standalone projects. In fact, one of the risks of cloud computing is that it has made it easier for business groups and business users to bypass IT and purchase their own cloud-based IT services. This is why it is important for IT to partner and collaborate with the business in deploying and using cloud services to reduce risk, avoid poor technology selection, and manage data governance and data security issues. For the foreseeable future, it is unlikely that many organizations will move their existing business transaction systems or sensitive transaction data for analysis purposes to a public cloud environment. 2 However, cloud adoption for business transaction processing is increasing, especially for new projects and projects involving packaged application solutions, and so in the longer term this will lead to more traditional business transaction processing and associated analytic processing being done in the cloud. The biggest potential for cloud computing is the processing of data that already exists in 1 The exceptions are newer and Web-focused companies whose sole business is oriented towards Internet commerce. These companies have few legacy systems and it is therefore easier for them to move to a cloud-computing environment. 2 Many companies are, however, beginning to deploy private clouds and virtualized environments for in-house use, but this topic is beyond the scope of this paper. Copyright 2014 BI Research, All Rights Reserved. 2

the cloud. This data includes the large volumes of data on internal and public web servers, and also generated by third-party providers. It also includes externally generated data (certain types of machine sensor data, for example) that can as easily be delivered to a cloud environment as it can to an in-house environment. These large volumes of Web and sensor data can be captured, filtered and transformed in the cloud and then delivered to an in-house system for analysis. In many cases, the data can also be analyzed in the cloud and the results delivered to internal and external business users. One of the key requirements here is to keep data movement to a minimum and to process data where it resides. As noted in the beginning of this article, it is important to realize that big data in the cloud is not a one-size-fits-all solution. It pays to make use of cloud services where it makes sense from the perspective of satisfying business needs, reducing costs, achieving faster time to value, and providing flexibility and scalability. USE CASES FOR BIG DATA IN THE CLOUD When examining how organizations use cloud computing for big data projects, three main use cases become apparent these are outlined below. Standalone reporting and analysis of Web, social media or sensor data: A cloudbased reporting and analysis system is a cost-effective way of capturing, storing and analyzing high-volume web log/clickstream, social media (from Twitter, for example) or sensor (from telemetric devices, for example) data. In this use case, data from each source is uploaded in small batches or streamed directly to the cloud service for reporting and analysis. Data analysis and visualization of e-commerce data: Many organizations (web retailers, on-line gaming companies, etc.) run their entire businesses on the web. For these companies, monitoring business operations, analyzing customer and user behavior and tracking marketing programs are top priorities. The applications involved in e- commerce are often deployed on hundreds of servers and handle requests from millions of users and a variety of devices. They also generate terabytes of data every day. A cloudbased system is ideally suited to collecting, analyzing and visualizing all of this data to help business managers track and analyze overall business operations and performance. Data warehouse augmentation: A cloud-based data refinery or data hub is a costeffective way of capturing, storing, transforming and archiving high-volume data while also providing connectivity to a data warehouse for transferring data. In this use case, the data warehouse remains the primary source of analytics for business users, but direct reporting and analysis of cloud-based data may also be provided as required. THE CHALLENGES OF BIG DATA IN THE CLOUD The main tasks in any big data project involve acquiring and integrating the raw source data, managing that data, processing the data, and finally delivering the results to the Copyright 2014 BI Research, All Rights Reserved. 3

systems and users that require the processed data. Processing may involve transforming and filtering the data and also possibly analyzing the data. As in an on-premises environment, cloud users have the choice of integrating various cloud products and services themselves or using an integrated end-to-end solution. In the same way that an integrated hardware and software appliance simplifies development, deployment and administration for on-premises project, an integrated end-to-end cloud solution for big data offers similar benefits to an appliance approach. A cloud solution also provides flexible scalability and a pay-as-you-go pricing model. As mentioned earlier, one of the biggest barriers to cloud deployment is data integration and data movement. Ideally, the data should be processed where it resides, but even when the source data already resides in the cloud it may still have to be moved to a different cloud system for processing in the same way that data is moved from business transaction systems to a data warehouse in an on-premises environment. An added complication with big data is that the project may also involve a mixture of data in the cloud and on-premises data. In this case, the on-premises data may be accessed dynamically by a cloud application or staged from the on-premises environment to the cloud for use by the cloud application. Again, this is the same as in an on-premises environment where big data projects are increasingly using data from a variety of data sources in addition to a data warehouse. The key difference in a cloud environment is that data movement occurs across an Internet connection, which has security, performance and cost implications. It is very important in a cloud environment therefore to look for big data solutions that not only simplify development, deployment and administration, but that also provide solid and well performing data integration and data movement capabilities. VENDOR EXAMPLE: TREASURE DATA Treasure Data was founded in 2011 and is based in Mountain View, California. The company offers a managed cloud service that provides an end-to-end solution for the acquisition, storage and analysis of big data. At the time of writing, Treasure Data had some 90 customers, including several Fortune 500 companies. These customers come from a variety of industries but most of their big data projects fit into one of the three use cases outlined earlier. Data Acquisition Data is uploaded to the Treasure Data service using a parallel bulk data import tool or real-time data collection agents that run on the customer s local systems. The bulk data import tool is typically used to import data from relational databases, flat files (Microsoft Excel, comma delimited, etc.) and applications systems (ERP, CRM, etc.). Data collection agents are designed to capture data in real-time from Web and application logs, sensors, mobile systems, and so forth. Since near-real-time data is critical for the majority of customers, most data comes into Treasure Data system using data collection agents. Copyright 2014 BI Research, All Rights Reserved. 4

Data collection agents filter, transform and/or aggregate data before it is transmitted to the cloud service. All data is transmitted in a binary format known as MessagePack. 3 The agent technology has been designed to be lightweight, extensible and reliable. It also employs parallelization, buffering and compression mechanisms to maximize performance, minimize network traffic, and ensure no data loss or duplication during transmission. Buffer sizes can be tuned based on timing and data size. One of Treasure Data s customers, for example, uses data collection agents to transmit over a terabyte of compressed log data per day to the service for customer billing purposes. Another collects and transmits real-time gaming data from 3,500 servers for analysis on the Treasure Data service. The agent technology comes in two versions: an open source version (Fluentd) and an enhanced version supported by Treasure Data (Treasure Agent). The Fluentd open source community has some 2,000 members who have developed and contributed over 200 data capture plug-ins for use on-premises and in the cloud (including the Treasure Data cloud service). Treasure Data supplies a range of enhanced enterprise-ready plug-ins that provide improved compatibility and performance. Treasure Data also offers a monitoring and alerting service for Treasure Agent users. Data Storage The Treasure Data cloud service currently employs Amazon web services and the Amazon S3 object storage layer, but Treasure Data claims it can easily port to other platforms as customer needs dictate. All data flowing into the Treasure Data cloud service is time stamped, transformed into a compressed columnar MessagePack format, and stored in a proprietary columnar database known as Plazma. This database can then be queried using an enhanced Hadoop processing environment. Data is first kept in real-time files and then moved into archive files at regular intervals. This latter process enables time-based partitioning and larger data files for more efficient processing. The process is completely transparent to applications. A Web-based management console is provided for monitoring resources, managing access controls, and raising support tickets. Treasure Data is working on expanding this console to provide Treasure Viewer, an interface to query and visualize data. This interface is currently in beta testing. Treasure Data uses a flat-rate tiered pricing model that is based on the number of data rows imported into the service annually, guaranteed processing capacity, service-level requirements, and the level of support needed. The Treasure Data service provides a multi-tenant environment where additional machine resources expand to meet customer 3 MessagePack is an efficient binary serialization format used for exchanging data between systems. It is similar to the JSON data format, but is faster and more compact than JSON. MessagePack encodes single integers into a single byte, which means that short strings typically incur only one byte of overhead when encoded. Copyright 2014 BI Research, All Rights Reserved. 5

needs and where customers can use up to four times the guaranteed capacity at no extra cost if that capacity is available. Data Analysis Applications access and analyze data in a Treasure Data environment using Hadoop Hive or Treasure Query Accelerator queries coded in SQL syntax. Some Treasure Data customers are happy with Hive, while others often require a more interactive and highperformance interface than that supported by the MapReduce batch jobs generated by Hive. To meet this customer need, Treasure Data offers the Treasure Query Accelerator, which extends the SQL interface to support an enhanced version of Cloudera Impala. The Treasure Data platform separates the query engine from the storage layer to make it easier to add other SQL interfaces as other open source products mature. Both ODBC and JDBC drivers are available for the query interface, which enables many popular BI and analytics tools to be used with the service. Several customers, for example, use Tableau to access and analyze data managed by Treasure Data. Treasure Data Value Proposition The objective of Treasure Data is to provide an end-to-end cloud service for big data projects that is fast and easy to deploy, is economic, and also well supported. Its managed service model makes it attractive to companies with limited technical resources. The company receives high marks from its customers for fast implementation times and the support it provides. Another objective of Treasure Data s cloud service is to overcome the data integration and data movement issues outlined in this paper by providing optimized data collection agents that support a wide range of data sources. The Treasure Data service is especially well suited to those organizations that have data in the cloud data and/or externally generated sensor data, are open to a cloud-based approach, wish to use a managed big data service rather than a set of complex platform services, and do not have the skills or desire to manage a big data platform. CONCLUSION Big data is currently receiving significant industry attention and there is considerable hype associated with this topic. At the same time, however, more and more companies are beginning to see the business value of big data projects, and as this field matures the rate of adoption will accelerate. There is also considerable interest in cloud computing for reducing up-front IT costs, providing elastic scalability and enabling the rapid deployment of new projects. As a result an increasing number of companies will deploy their big data projects in the cloud. Both big data and cloud computing require a new set of skills, and organizations need to ensure these skills are in place before embarking on big data in the cloud. Vendors can help organizations get up to speed in this area and this is why choosing the right cloud vendor is important. A companion paper to this one looks at how organizations should prepare and get started on big data projects in the cloud and also offers suggestions for the features an organization should look for in selecting a cloud services vendor. Copyright 2014 BI Research, All Rights Reserved. 6