Proact whitepaper on Big Data



Similar documents
Virtualizing Apache Hadoop. June, 2012

CA Technologies Big Data Infrastructure Management Unified Management and Visibility of Big Data

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

EMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst

Protecting Big Data Data Protection Solutions for the Business Data Lake

NetApp Big Content Solutions: Agile Infrastructure for Big Data

HYPER-CONVERGED INFRASTRUCTURE STRATEGIES

Can Storage Fix Hadoop

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

BIG DATA-AS-A-SERVICE

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Agenda. Big Data & Hadoop ViPR HDFS Pivotal Big Data Suite & ViPR HDFS ViON Customer Feedback #EMCVIPR

Big Data. How New Data Analytics Systems will Impact Storage. John Webster. Senior Partner Evaluator Group

Hadoop Architecture. Part 1

Big Data and Apache Hadoop Adoption:

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

BIG DATA TRENDS AND TECHNOLOGIES

I/O Considerations in Big Data Analytics

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

The Future of Data Management

CDH AND BUSINESS CONTINUITY:

Next-Generation Cloud Analytics with Amazon Redshift

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Enabling High performance Big Data platform with RDMA

Big Data Technologies Compared June 2014

The Inside Scoop on Hadoop

IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Big Data - Infrastructure Considerations

EMC IRODS RESOURCE DRIVERS

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Introduction to Cloud Computing

Big Data and Data Science: Behind the Buzz Words

Red Hat Storage Server

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Deploying Virtualized Hadoop Systems with VMware vsphere Big Data Extensions A DEPLOYMENT GUIDE

Trends driving software-defined storage

Build Your Competitive Edge in Big Data with Cisco. Rick Speyer Senior Global Marketing Manager Big Data Cisco Systems 6/25/2015

Shared Storage for Shared Nothing

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Integrated Grid Solutions. and Greenplum

Big Data and Telecom Analytics Market: Business Case, Market Analysis & Forecasts

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

HDP Enabling the Modern Data Architecture

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

MapReduce with Apache Hadoop Analysing Big Data

Next Generation Data Centers: Hyperconverged Architectures Impact On Storage. PRESENTATION TITLE GOES HERE Mark O Connell Distinguished Engineer EMC

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Building a Scalable Storage with InfiniBand

VIEWPOINT. High Performance Analytics. Industry Context and Trends

Saving Millions through Data Warehouse Offloading to Hadoop. Jack Norris, CMO MapR Technologies. MapR Technologies. All rights reserved.

Whitepaper. NexentaConnect for VMware Virtual SAN. Full Featured File services for Virtual SAN

Big + Fast + Safe + Simple = Lowest Technical Risk

SOLUTIONS PRODUCTS INDUSTRIES RESOURCES SUPPORT ABOUT US ClearCube Technology, Inc. All rights reserved. Contact Support

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

HDP Hadoop From concept to deployment.

The BIG Data Era has. your storage! Bratislava, Slovakia, 21st March 2013

Understanding Object Storage and How to Use It

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data Too Big To Ignore

Hadoop: Embracing future hardware

TRANSFORM YOUR BUSINESS: BIG DATA AND ANALYTICS WITH VCE AND EMC

High Performance IT Insights. Building the Foundation for Big Data

Modernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. ddn.com

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Unleashing the Power of Hadoop for Big Data Analytics

WHITE PAPER. Review of Hadoop Platforms & Distributions. Abstract

Understanding Enterprise NAS

Copyright 2012 EMC Corporation. All rights reserved.

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

DDN updates object storage platform as it aims to break out of HPC niche

What Sellers Need to Know. IBM System x Solutions for One and Two Socket Servers

How To Understand The Business Case For Big Data

OnX Big Data Reference Architecture

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Big Data: Are You Ready? Kevin Lancaster

TABLE OF CONTENTS 1 Chapter 1: Introduction 2 Chapter 2: Big Data Technology & Business Case 3 Chapter 3: Key Investment Sectors for Big Data

Building Confidence in Big Data Innovations in Information Integration & Governance for Big Data

Accelerating and Simplifying Apache

How To Manage A Single Volume Of Data On A Single Disk (Isilon)

Advanced In-Database Analytics

Simple. Extensible. Open.

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Information Architecture

Einsatzfelder von IBM PureData Systems und Ihre Vorteile.

NoSQL and Hadoop Technologies On Oracle Cloud

Manifest for Big Data Pig, Hive & Jaql

Apache Hadoop. Alexandru Costan

EMC BACKUP MEETS BIG DATA

White. Paper. EMC Isilon: A Scalable Storage Platform for Big Data. April 2014

Ubuntu and Hadoop: the perfect match

BIG DATA USING HADOOP

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

CONVERGE APPLICATIONS, ANALYTICS, AND DATA WITH VCE AND PIVOTAL

THE JOURNEY TO A DATA LAKE

THE EMC ISILON STORY. Big Data In The Enterprise. Copyright 2012 EMC Corporation. All rights reserved.

Discovering Business Insights in Big Data Using SQL-MapReduce

Transcription:

Proact whitepaper on Big Data Summary Big Data is not a definite term. Even if it sounds like just another buzz word, it manifests some interesting opportunities for organisations with the skill, resources and need to analyse humungous amounts of data. The challenge is twofold: (1) collect and access the data and (2) analyse the data. Technically, this means: It is not enough to store data and then manage it storage, management and availability of data is one unified challenge. Our current vendors EMC, NetApp and VMware are well positioned in the Big Data marketplace. Proact is no new-comer in the field of Big Data Storage. We have delivered systems for humongous amounts of data with hard-to-meet I/O demands since the 1990-ies. Jakob Fredriksson, Proact IT Group AB, 2013-01-30, version C

Contents 1 What is Big Data?... 3 1.1 Big Data Storage... 3 1.2 Big Data Analytics... 3 1.2.1 RDBMS vs. Big Data... 3 1.2.2 Who need Big Data Analytics?... 4 2 Key technologies that enable Big Data... 5 3 Current development segments in Big Data Analytics... 6 3.1 MapReduce and Hadoop... 6 3.2 Scalable database... 7 3.3 Stream... 7 3.4 Appliance... 7 4 Big Data offerings explained... 8 4.1 EMC: Isilon and Greenplum... 8 4.2 NetApp... 8 4.3 VMware... 9 5 Turn-key solutions for Hadoop... 10 5.1 Hortonworks... 10 5.2 Cloudera... 10 5.3 MapR... 10 6 Why use Proact s solutions for your Big Data challenges?... 11 2 (12)

1 What is Big Data? The term Big Data originates from the open source community that tried to develop analytical processes that could incorporate unstructured data and were faster and more scalable than the data warehousing solutions at that time. The term Big Data has since developed and as a buzz word it covers a lot more. Big Data is nowadays used to refer to a number of advanced data storage, access and analytics technologies aimed at handling high volume and/or fast moving data in a variety of scenarios. The recent years discussions around, and development of, Big Data solutions can be divided into two categories: Big Data Storage Big Data Analytics 1.1 Big Data Storage Big Data Storage is the storage that supports Big Data Analytics to be performed. Big Data Storage is all about the regular technological discussions that we in Proact are well familiar with: I/O performance, latency Direct attached storage vs. SAN or NAS Data Management on the infrastructure layer and how it interfaces/interacts with the applications 1.2 Big Data Analytics Big Data Analytics come from web applications but are now entering segments far away from the web companies. It is fair to say that there are opportunities for Big Data Analytics in all major industry segments. The reason for considering Big Data Analytics is the opportunities to gain business insight from analysing the humongous amount of data that the relational database management systems (RDBMS) cannot handle (more on this in the next section). 1.2.1 RDBMS vs. Big Data In the previous section RDBMS was discarded as old and insufficient. In this section you will understand why. It is not enough to state that RDBMS is focused on Online Transaction Processing (OLTP). The real reason for the failure of RDBMS when it comes to the data that is covered by the term Big Data is that it is difficult to store, manage and analyse using RDBMS technologies, because RDBMS: only handle tabular data does not parallelise well enough to accommodate commodity HW clusters Does not benefit from the network speed improvements that has been a lot faster than the improvements of seek time of physical storage has difficulties to scale-out efficiently clustering beyond a few nodes is hard does not integrate well with non-relational sources of data does not handle large enough databases, at least not yet As you can see from the list above, RDBMS has problems to benefit from the inventions and development in networking and scale-out capabilities as well as having hard time coping with disparate data types and the mere size and growth of data. 3 (12)

Big Data solutions does not compete with RDBMS, they complement each other. However, new means of doing the actual analytics are being developed so you can bring results from both sources together. 1.2.2 Who need Big Data Analytics? Recent technology trends and the growth of the Internet have generated an immense wave of complex data. As a result, companies that seek a competitive advantage must find effective ways of analyzing new sources of information. In which industries do they have large enough data and/or wants to gain business insights from its data sources that cannot be done with RDBMS? Who has the need to analyse the new sources of information? The usual suspects are: Web 2.0, Media, Biochemical, Oil & Gas, etc. These industries are mainly related to research, but are not the ones that are in focus for the Big Data hype. The focus is instead on the markets that develop a need to improve competitive advantage and responsiveness, for example: Telco: Prevent churn 1, by applying social relationships with customers Medical: Computer aided diagnostics Bank: Risk management Retail: Optimization of offerings and discounts. 4 (12)

2 Key technologies that enable Big Data There are a number of different technologies and solutions that enable the analysis of humongous amounts of data in the market. However, some key technologies stand out: Technology Solid State Disc SSD Scale-out technologies How it enables Big Data Analytics The seek time on hard disk drives is too long. Will ensure economics in Full Stack deployments 2 processing facilities storage implementations, and management of the systems performing the analytics and store the data. Will also ensure scalability and ability to parallelise computing. New database technologies (NoSQL, StreamSQL ) MapReduce/Hadoop This is the imperative component for storing, accessing and processing the amounts and types of data that is targeted. 5 (12)

3 Current development segments in Big Data Analytics This section will explain the four most important areas of development when it comes to Big Data Analytics: MapReduce/Hadoop Scalable database Stream Appliance 3.1 MapReduce and Hadoop MapReduce is a parallel programming model (coming from Google) that is suitable for processing of Big Data. MapReduce works as follows, it: a) splits input data into distributable chunks b) defines the steps to process those chunks c) run that process in parallel on the chunks Thus MapReduce provide the model for breaking down the task of processing large amount of data into pieces that can be distributed into large numbers of compute nodes that work on the jobs in parallel. MapReduce allows scaling through adding more compute notes. This does not imply that all scale-out systems fit in this model since each compute node must have access to all data for this to work. Note that MapReduce is not an implementation it is the blueprint. Apache Hadoop 3 is the most common platform that implements MapReduce. Performance optimisation is one important driver for Big Data and Hadoop does it with a different twist: moving the software code to the data and not the data to the software. The design work of Hadoop started off with the idea that it is inefficient to move large amount of data in a cluster, it is much more efficient to deliver the programs to where the data is. This is made possible by HDFS (Hadoop Distributed File System). Hadoop can with the use of MapReduce and HDFS deliver the programs to where the data is instead of moving the data to the programs. Hadoop also face challenges. Hadoop deployments that utilize a traditional infrastructure approach, especially direct-attached storage (DAS), often lead to a number of significant issues. Data stored in Hadoop is typically replicated multiple times and distributed across the Hadoop cluster to optimize performance and reliability. Traditional Hadoop deployments run on a cluster of commodity servers for computation (MapReduce), utilizing DAS for data storage (HDFS) and connected together through a network. A single server, referred to as the NameNode, stores all of the metadata for the files stored in HDFS. This traditional approach for Hadoop environments results in a number of challenges for enterprises: Storage utilization of DAS is inefficient, a problem made even worse by HDFS replication. In addition, managing large pools of locally attached disks is complex and expensive. The Hadoop data staging and loading processes involving DAS is complex and highly inefficient. 6 (12)

The NameNode is a single point of failure and thereby introduces significant risk in traditional Hadoop deployments. In addition, data backup and disaster recovery processes are often missed in Hadoop environments using DAS. Additional challenges include: technical complexities of Hadoop implementation and management inability to independently increase computing or storage capacity complex integrations across multiple open source projects Vendors have of course developed approaches to address these issues (see section 4). 3.2 Scalable database 3.3 Stream 3.4 Appliance The structured database vendors and communities are not going to lay flat and there is much going on in the structured data communities. Oracle is of course building on their next version and Teradata has acquired a company to add a SQL-MapReduce product to their portfolio. Other initiatives are NoSQL, MongoDB, Terrastor, and SciDB. StreamSQL has been around since 2003 and has the ability to do real time analytics on data streams. It has until now only been used in some niche markets: financial services, surveillance and telecommunications network monitoring. Main vendors in this area are IBM (InfoSphere Streams) and Streambase Systems. When aiming for the enterprise data centers, we usually see a lot of appliances being pushed from the main vendors. Big Data is no exception. Vendors in this field are: EMC Greenplum, IBM/Netezza, Oracle and Teradata. 7 (12)

Storage Compute 4 Big Data offerings explained This section is a brief introduction into the Big Data offerings Proact covers. 4.1 EMC: Isilon and Greenplum EMC have a different approach to solving the issues with traditional Hadoop deployments than most of the competition. EMC avoids the DAS-related issues by avoiding DAS. They can do this since Isilon can integrate natively with the HDFS layer. This integration enable EMC to combine the scale-out capabilities and resource economics of Isilon and the analytics power of Greenplum HD to create a system that isn t as complex to deploy and manage, that uses storage capacity effective and where computing power and storage capacity can be scaled independently. Greenplum can be offered both as SW only and as an appliance. 4.2 NetApp EMC has been clever about this architecture since you can use Isilon and its HDFS protocol interface separately integrating any flavour of Hadoop. We expect this to fit most deployments. NetApp s story 4 is different from other vendors stories since they use the term Big Data in a wider meaning. They do not only focus on the new sources of data and the challenges current analysis solutions have. NetApp also lets Big Data cover the exploding size and management challenges of file systems and databases, bandwidth intense use cases as Full Motion Video, Media Content Management, and Seismic Processing, and HPC systems. Analytics solutions for Hadoop deployments are delivered with partners: the 2 major ones are Cloudera and Hortonworks, which seems to be the 2 market leaders in this segment. NetApp-based solutions can be described as follows. Big Analytics Providing efficient analytics for extremely large datasets. Solutions (in partnership with Hortonworks): NetApp Open Solution for Hadoop and NetApp Open Solution for Hadoop Rack. 8 (12)

Open solution for Hadoop Open solution for Hadoop rack Big Bandwidth Obtaining better performance for very fast workloads. Solution: E-series for storage Big Content Delivering boundless, secure, scalable data storage. Solutions are different depending on challenge: File Services: ONTAP 8 Enterprise content repositories: ONTAP 8 and especially the Infinite Volume Distributed content repositories: StorageGRID software running on virtualized server infrastructure and E-Series storage systems for storing data 4.3 VMware In June 2012 VMware launched a new open-source project, called Serengeti, that aims to let the Hadoop data-processing platform run on the virtualization leader s vsphere hypervisor. The reason was explained in the press release 5 : By decoupling Apache Hadoop nodes from the underlying physical infrastructure, VMware can bring the benefits of cloud infrastructure rapid deployment, high-availability, optimal resource utilization, elasticity, and secure multi-tenancy to Hadoop. However, there are challenges when running Hadoop in a virtual environment 6 and VMware doesn t pin all their hopes on Serengeti. Serengeti is just one attempt to secure a piece of the Hadoop market. Some other initiatives: In February they launched the Spring Hadoop project to help developers write big data applications using Spring Java Framework. In April they bought Big Data start-up Cetas Software, which potentially can provide the means to analyse large amounts of data within their virtual or cloud environments. In June they presented a reference architecture (along with Hortonworks) for making Hadoop highly available by running the NameNode and JobTracker services on VMs. 9 (12)

5 Turn-key solutions for Hadoop 5.1 Hortonworks 7 5.2 Cloudera 8 5.3 MapR 9 Hortonworks Data Platform (HDP) is a 100% open source data platform based on Apache Hadoop. Hortonworks rely on external vendors for complete. Hortonworks was formed by Yahoo! and Benchmark Capital in June 2011 in order to accelerate the development and adoption of Apache Hadoop. Cloudera Enterprise is a turn-key solution for running Hadoop. Cloudera rely on external vendors for complete deployments. MapR distribution for Apache Hadoop is a turn-key solution for running Hadoop. MapR rely on external vendors for complete deployments. 10 (12)

6 Why use Proact s solutions for your Big Data challenges? Big Data is really not a fair term. The term aims for astronomical humongous amounts of data that is to be ingested, analysed and reported in real time. The challenges when doing Big Data business are many, but I have made the table below as an overview. Item Challenge Solution Leverage Store Manage Analyse Use Ability to understand how the information impacts business in order to transfer it into actions Change Volume x 2 each 18 th month Type: >80% unstructured data Sources of data is disparate Increased complexity driven by change Current solutions are limited to structured data and are too slow Multiple access needed to serve diverse use cases Model information in current operations with potential strategy impact. Leverage technology to adapt. A unified information/content storage methodology that enables management of volume, type and source of information. Tools and services to manage volume, types and sources of information Buy/build real-time analytical solutions for the new sources of information Share, collaborate and act on insights anywhere, anytime and on any device One important challenge isn t listed above since it is more of a prerequisite for all items in the table but the first one: availability of data. The observant reader has already discovered that which has been Proact s message and strength for some time now: It is not enough to store data and then manage it storage, management and availability of data is one unified challenge for which there are good solutions! 11 (12)

NOTES 1 Churn = the rate at which consumers switch service providers for their home TV, phone, Internet, and bundle services. Customers churn habits exhibit patterns of when in their contract life cycle they consider switching providers. In addition, only a handful of motivators satisfaction with cost, quality of service and customer care have a major influence on churn. 2 For example the UCS-based solutions that we sell: FlexPod, vblock, etc. Can also be called Converged Infrastructure 3 More information can be found at https://hadoop.apache.org/ 4 The story is called Big Data ABC, http://www.netapp.com/us/company/leadership/big-data/ 5 http://www.vmware.com/company/news/releases/vmw-serengeti-hadoop-06-13-12.html 6 Refer to https://wiki.apache.org/hadoop/virtual%20hadoop for more information on running Hadoop in a virtual environment and using cloud services for Hadoop. 7 http://hortonworks.com/ 8 http://www.cloudera.com/ 9 http://www.mapr.com/ 12 (12)