Big Data Maximizing the Flow



Similar documents
Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Is Hyperconverged Cost-Competitive with the Cloud?

Drive Business Further Faster With RetailNext

In-Memory Analytics for Big Data

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Bringing Big Data into the Enterprise

Delivering new insights and value to consumer products companies through big data

Evaluation of Enterprise Data Protection using SEP Software

Object Storage: Out of the Shadows and into the Spotlight

The Future of Data Management

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

BIG Data Analytics Move to Competitive Advantage

I D C A N A L Y S T C O N N E C T I O N. T h e C r i t i cal Role of I/O in Public Cloud S e r vi c e P r o vi d e r E n vi r o n m e n t s

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Tap into Big Data at the Speed of Business

Virtual Data Warehouse Appliances

BIG DATA TRENDS AND TECHNOLOGIES

How To Handle Big Data With A Data Scientist

1 Performance Moves to the Forefront for Data Warehouse Initiatives. 2 Real-Time Data Gets Real

ANALYTICS BUILT FOR INTERNET OF THINGS

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

Big Data Are You Ready? Jorge Plascencia Solution Architect Manager

Intel Platform and Big Data: Making big data work for you.

ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION

The 4 Pillars of Technosoft s Big Data Practice

White. Paper. EMC Isilon: A Scalable Storage Platform for Big Data. April 2014

Are You Ready for Big Data?

The Future of Data Management with Hadoop and the Enterprise Data Hub

How Big Is Big Data Adoption? Survey Results. Survey Results Big Data Company Strategy... 6

Business Analytics In a Big Data World Ted Malone Solutions Architect Data Platform and Cloud Microsoft Federal

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Are You Ready for Big Data?

Deploying Big Data to the Cloud: Roadmap for Success

SQL Server 2012 Performance White Paper

Big Data Comes of Age: Shifting to a Real-time Data Platform

EO Data by using SAP HANA Spatial Hinnerk Gildhoff, Head of HANA Spatial, SAP Satellite Masters Conference 21 th October 2015 Public

IBM s Big Data Systems a Portfolio Approach

The Future of M2M Application Enablement Platforms

Executive Report BIG DATA: BUSINESS OPPORTUNITIES, REQUIREMENTS AND ORACLE S APPROACH. RICHARD WINTER December 2011

BANKING ON CUSTOMER BEHAVIOR

A Hurwitz white paper. Inventing the Future. Judith Hurwitz President and CEO. Sponsored by Hitachi

Key Performance Indicator 26 Success Secrets. Copyright by Benjamin Hodges

Confidently Anticipate and Drive Better Business Outcomes

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches

Big Data Use Cases Update

The Economics of File-based Storage

The Enterprise Data Hub and The Modern Information Architecture

Parallel Data Warehouse

The Next Wave of Data Management. Is Big Data The New Normal?

City Deploys Big Data BI Solution to Improve Lives and Create a Smart-City Template

Storage for Virtualized Workloads

CA Technologies Big Data Infrastructure Management Unified Management and Visibility of Big Data

The Business Case for Using Big Data in Healthcare

The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage

Big Data Are You Ready? Thomas Kyte

Johan Hallberg Research Manager / Industry Analyst IDC Nordic Services & Sourcing Digital Transformation Global CIO Agenda

Big Data & the Cloud: The Sum Is Greater Than the Parts

Maximizing Your Desktop and Application Virtualization Implementation

Big Data Trends A Basis for Personalized Medicine

SAP HANA An In-Memory Data Platform for Real-Time Business

Why Big Data in the Cloud?

Information Architecture

Find the Information That Matters. Visualize Your Data, Your Way. Scalable, Flexible, Global Enterprise Ready

Unisys ClearPath Forward Fabric Based Platform to Power the Weather Enterprise

Gain Contextual Awareness for a Smarter Digital Enterprise with SAP HANA Vora

Accenture and Oracle: Leading the IoT Revolution

Protecting Data with a Unified Platform

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Big Data Services From Hitachi Data Systems

While a number of technologies fall under the Big Data label, Hadoop is the Big Data mascot.

Big data: Unlocking strategic dimensions

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Big Data. Fast Forward. Putting data to productive use

Steps to Migrating to a Private Cloud

Implement Hadoop jobs to extract business value from large and varied data sets

Industry Impact of Big Data in the Cloud: An IBM Perspective

Hitachi Visualization. Twin Cities Public Safety Presentation

The future: Big Data, IoT, VR, AR. Leif Granholm Tekla / Trimble buildings Senior Vice President / BIM Ambassador

EMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst

BIG DATA-AS-A-SERVICE

Big Data: Overview and Roadmap eglobaltech. All rights reserved.

Microsoft SQL Server and Oracle Database:

Six Days in the Network Security Trenches at SC14. A Cray Graph Analytics Case Study

IBM: An Early Leader across the Big Data Security Analytics Continuum Date: June 2013 Author: Jon Oltsik, Senior Principal Analyst

IT CHANGE MANAGEMENT & THE ORACLE EXADATA DATABASE MACHINE

Hadoop for Enterprises:

The Rise of Industrial Big Data

Transcription:

Technology Insight Paper Big Data Maximizing the Flow By John Webster August 15, 2012 Enabling you to make the best technology decisions

Big Data Maximizing the Flow 1

Big Data Maximizing the Flow 2 The Case for Big Data Apps It is commonly believed that Big Data and the interest among enterprises in comprehensive data analytics platforms such as Hadoop is a recent phenomenon. However, we believe that the pursuit of applications that deliver new insight from the convergence of multiple data sources has, in fact, been building over the last seven to eight years. In a book by Chris Stakutis and John Webster entitled Inescapable Data Harnessing the Power of Convergence, published in 2005, the co-authors speak of converging new information sources possibly in real-time by networking existing systems with pervasive, digital sensory and wireless technologies including RFID, GPS, CCD, cell, and others. In doing the research for this book in 2004, they conducted 50 interviews with business and technology leaders from a broad spectrum of industry segments. Their objective at the time was to see if these business leaders were first aware of the potential to network pervasive information sources with existing systems and second, if they had plans to move forward with specific projects. Many senior executives were not only aware of and understood the underlying technologies, they were, in some cases, already moving ahead with projects. Some examples: Healthcare Human Genome Modeling to Lower Medical Costs A CEO in healthcare had become aware that, because a patient s genomic makeup could be used as a predictor for certain diseases, correlating it through the use of analytics with therapies that are tailored to the individual genome could prove more effecting in treating certain diseases. Two things needed to happen: First, more data was needed to prove the conjecture and develop individualized therapies, and second, the cost for generating a patient s specific genome had to be less than $1,000. Today, the cost of generating a personal genome is less than $1,000. Personalized Medicine or Genetically Informed Medicine is now a reality. Genomic data can now be used by patients and healthcare providers to not only improve the outcome when patients are sick, but to also make positive adjustments in lifestyle and inform them as to the genetic relationship to a particular disease they may have, and suggestions for further consultation and education. Retail Maximizing Revenue with Analytics The CEO of a large metropolitan retail chain described a project that used wireless digital video cameras to track how shoppers moved around on a retail floor. The system s primary application was to track customer movements and count the number of customers as they entered and left a particular sales area. By applying analytics to this data, the CEO could better understand in-store traffic patterns and

Big Data Maximizing the Flow 3 optimize product placement and staffing levels throughout the day, and measure the impact of advertising and special promotions. Today, retailers use data from web-based social media, their own transactional databases and other sources to uncover market trends and customer buying behaviors. As a result, they can improve sales through up-selling and cross-selling. They can also use the information derived from data analysis to optimize floor planning, reduce advertising costs by targeting promotions to the most likely buyers, and make more profitable purchasing decisions. Public Safety Real-Time Analytics to Improve Response Times There were two national political conventions in 2004. One made extensive use of digital video and digital infrared devices, and radar. Security officials located onsite and at remote locations could combine these feeds in real-time to identify abnormal and potentially dangerous activity. At the time, this was an advanced example of large scale surveillance that showed the power of converging data sources in real-time for public safety applications based on analytics. Today, New York City s Real Time Crime Center (RTCC) is a model for demonstrating the effectiveness of this vision. It can access hundreds of millions of NYC criminal complaints, arrests, and 911 call records dating back more than 15 years. It has access to more than five million criminal records and parole files maintained by the State of New York. And it can search the more than 31 million records of crimes committed nationwide. It can then transmit data including images to handheld and tablet devices being used by officers in the field. The RTCC s processing capabilities include a data warehouse, a real-time analytics engine, and a Data Wall. The data warehouse collects and converges, via a reconciliation engine, the data from various data silos (squads, precincts, motor vehicle records, etc.) that exist within the department and statewide. The data analytics engine can run queries against this data as well as correlate the results to satellite images, for example, to display query results on a map and identify nearby landmarks. Realtime analytics capabilities include the ability to map the origin of 911 calls as they come in and an event notification system that alerts officers to criminal activity as it unfolds. Big Data Apps Now When researching Inescapable Data, the authors found other examples in business, government, agriculture, and entertainment. However, they also found that, while there was demand to combine data sources to leverage a new multiplicity of data sources, systems were not yet ready and available to deliver on their visions. Traditional data warehousing systems could not stand up to the challenge and

Big Data Maximizing the Flow 4 Google and Yahoo were just beginning to apply distributed computing and massively parallel processing techniques to this problem. In 2012, the needed systems (Hadoop, MySQL clusters, StreamSQL processing, etc.) are here and now. CEOs and CIOs are realizing their visions and building competitive differentiation as outlined above. Today, the Web has become an immense source of data that can be leveraged in real-time to gain a competitive business advantage. And, as immense as it is, it is only one of a number of data sources that can be leveraged for business advantage. Additional sources include the traditional information silos where data is buried in spreadsheets and departmental applications, and emerging technologies like smart phones and wireless sensors. At this moment, businesses are using real-time analytics processes powered by a growing range of data sources to create stronger and more profitable relationships with their customers and harvest information they could only dream of a few years ago. They are part of the emerging Big Data movement. Maximizing Data Flow It should be obvious from the above application examples that moving data and large volumes of it has become critical to the success of systems that underlie data analytics processes. However, data movement in this context has multi-dimensional aspects: 1. Data has to get from the many sources to the storage devices within the analytics system in order for the convergence to occur 2. Data has to flow between storage and compute layer during the analytics process 3. Results (again in the form of data) have to be delivered to information users 4. Source data and the results of analytics processes are likely to be shared with other systems The prerequisite sources for Big Data apps are many and varied (structured, unstructured, Web, machine, and mobile). In fact, these sources are more likely to live outside the corporate data center. EGI research shows that between 50 and 80 percent of the data needed will have to be moved into and stored within the data center. A robust network infrastructure will be required to: Get large volumes of data into and out of the data center Move data around within the data center to apps that will power information consumers Be responsive to the needs of real time information users when real-time information availability is requred

Big Data Maximizing the Flow 5 Moving the Elephant through the Pipes Network infrastructure for Big Data apps should be bandwidth capable, adaptable, and cost-efficient to handle the impact of ingesting large volumes of data and delivering it to Big Data analytics systems. In particular, the administrators of distributed computing-based analytics systems such as Hadoop have to be critically aware of internal system network performance because the impacts of network performance both positive and negative are experienced at the level of analytics application users. Network-related factors impacting analytics system performance can be summed up in one word: bandwidth. I/O Bandwidth The rate at which data flows between storage and processors within a Hadoop cluster has a direct effect on cluster performance and scalability. Indeed the cluster s design parameters and underlying architectural assumptions relating to query response times are gated by a number of factors, one of which is I/O performance between storage and servers within the cluster 1. This is true for Hadoop and other types of analytics clusters running SQL variants (MySQL, NoSQL, NewSQL, etc.). Internal Cluster Network Bandwidth Hadoop clusters are generally built using open source software and commodity server hardware interconnected by commodity 1Gb Ethernet (1GbE) networking gear. The general objective of the early users of Hadoop who were supporting web-facing, Big Data applications was to do as much processing as possible with a minimized amount of capital resources. However, as Hadoop clusters add nodes to support more data and/or more users, the performance of the internal network at 1GbE, for example, invariably becomes a bottleneck beyond a certain scale point. The internal Hadoop cluster network has to not only handle internal cluster communications and data transfers, it also has to distribute multiple copies of data (usually three) among cluster nodes. A phrase often used to characterize the Hadoop cluster s hardware architecture is cheap and deep. As this phrase implies, cluster administrators generally add more nodes to boost performance when more is needed. Commodity servers are relatively cheap and simply adding them can yield some gain in performance depending on a number of factors. However, growing the cluster in this way can introduce other cost-related issues: 1 Readers interested in this issue are encouraged to investigate CAP theorem which deals with the limitations of distributed computing clusters and what system architects must consider when trying to overcome them.

Big Data Maximizing the Flow 6 Staff time is devoted to manually balancing and rebalancing the workload across the cluster Adding more nodes increases the probability that more staff time will be spent recovering from node failures that could result in cluster failure As enterprises become more dependent on Hadoop-based applications to identify incomegenerating opportunities, increasing instances of cluster failure will have a negative impact on revenue Another way to manage cluster scaling is to look at increasing internal network bandwidth when more performance is required. As noted, internal cluster networks typically use 1GbE. However, Emulex has shown recently that using internal networks based on 10Gb Ethernet (10GbE) can allow a Hadoop cluster to handle increased demand for throughput without adding cluster nodes (see figures 1 and 2 below). Figure 1

Big Data Maximizing the Flow 7 Figure 2 The above graphics demonstrate that throughput can be scaled upward by increasing network bandwidth from 1 to 10GbE while holding the size of the cluster constant, avoiding the need to add Hadoop cluster nodes. External network bandwidth Getting data into and out of Hadoop clusters is also a major concern for large Hadoop users particularly service providers offering analytics as a service to clients who are not yet willing to build and support their own clusters. External network bandwidth also limits how quickly query results are delivered to users impacting those needing real or near-real time query response. Again, 10GbE network connectivity can be used to reduce the time needed to move data into and out of the cluster. Emulex has also measured the performance of 1GbE vs. 10GbE networking when performing data ingesting operations (see figure 3 below).

Big Data Maximizing the Flow 8 Figure 3 Conclusion - Sustainable Scalability and Why I/O is Strategic We believe that a pent-up demand for the tangible business benefits resulting from Big Data analytics, including Hadoop, now exists within the enterprise demand that has been building at the levels of CEO and CIO as well as business line managers and marketing/sales managers. The Big Data phenomenon is not a case of a technology looking for a solution. The technology is now available and the demand will be unleashed. As a result, analytics applications based on new distributed computing clusters as well as revamped data warehouses will grow in the amount of data they ingest, the size of the supporting infrastructure, and their business criticality. To meet these demands, clusters will be scaled in multiple dimensions. More nodes will be added for more processing power and storage. Faster processers will also be adopted as they become more ubiquitous. Both of these factors and others, will place increasing stress on internal and external networks. Network performance will yield lower costs relative to the size of analytics clusters and greater productivity. Therefore, networking infrastructure cannot remain static as the rest of the infrastructure scales upward. It must be allowed grow in bandwidth capacity to meet the service level expectations of business users who will come to depend on analytics applications.

Big Data Maximizing the Flow 9 ### About Evaluator Group Evaluator Group Inc. is dedicated to helping IT professionals and vendors create and implement strategies that make the most of the value of their storage and digital information. Evaluator Group services deliver in-depth, unbiased analysis on storage architectures, infrastructures and management for IT professionals. Since 1997 Evaluator Group has provided services for thousands of end users and vendor professionals through product and market evaluations, competitive analysis and education. www.evaluatorgroup.com Follow us on Twitter @evaluator_group Copyright 2012 Evaluator Group, Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying and recording, or stored in a database or retrieval system for any purpose without the express written consent of Evaluator Group Inc. The information contained in this document is subject to change without notice. Evaluator Group assumes no responsibility for errors or omissions. Evaluator Group makes no expressed or implied warranties in this document relating to the use or operation of the products described herein. In no event shall Evaluator Group be liable for any indirect, special, inconsequential or incidental damages arising out of or associated with any aspect of this publication, even if advised of the possibility of such damages. The Evaluator Series is a trademark of Evaluator Group, Inc. All other trademarks are the property of their respective companies.