Big Data Maximizing the Flow

Technology Insight Paper Big Data Maximizing the Flow By John Webster August 15, 2012 Enabling you to make the best technology decisions

Big Data Maximizing the Flow 1

Big Data Maximizing the Flow 2 The Case for Big Data Apps It is commonly believed that Big Data and the interest among enterprises in comprehensive data analytics platforms such as Hadoop is a recent phenomenon. However, we believe that the pursuit of applications that deliver new insight from the convergence of multiple data sources has, in fact, been building over the last seven to eight years. In a book by Chris Stakutis and John Webster entitled Inescapable Data Harnessing the Power of Convergence, published in 2005, the co-authors speak of converging new information sources possibly in real-time by networking existing systems with pervasive, digital sensory and wireless technologies including RFID, GPS, CCD, cell, and others. In doing the research for this book in 2004, they conducted 50 interviews with business and technology leaders from a broad spectrum of industry segments. Their objective at the time was to see if these business leaders were first aware of the potential to network pervasive information sources with existing systems and second, if they had plans to move forward with specific projects. Many senior executives were not only aware of and understood the underlying technologies, they were, in some cases, already moving ahead with projects. Some examples: Healthcare Human Genome Modeling to Lower Medical Costs A CEO in healthcare had become aware that, because a patient s genomic makeup could be used as a predictor for certain diseases, correlating it through the use of analytics with therapies that are tailored to the individual genome could prove more effecting in treating certain diseases. Two things needed to happen: First, more data was needed to prove the conjecture and develop individualized therapies, and second, the cost for generating a patient s specific genome had to be less than $1,000. Today, the cost of generating a personal genome is less than $1,000. Personalized Medicine or Genetically Informed Medicine is now a reality. Genomic data can now be used by patients and healthcare providers to not only improve the outcome when patients are sick, but to also make positive adjustments in lifestyle and inform them as to the genetic relationship to a particular disease they may have, and suggestions for further consultation and education. Retail Maximizing Revenue with Analytics The CEO of a large metropolitan retail chain described a project that used wireless digital video cameras to track how shoppers moved around on a retail floor. The system s primary application was to track customer movements and count the number of customers as they entered and left a particular sales area. By applying analytics to this data, the CEO could better understand in-store traffic patterns and

Big Data Maximizing the Flow 3 optimize product placement and staffing levels throughout the day, and measure the impact of advertising and special promotions. Today, retailers use data from web-based social media, their own transactional databases and other sources to uncover market trends and customer buying behaviors. As a result, they can improve sales through up-selling and cross-selling. They can also use the information derived from data analysis to optimize floor planning, reduce advertising costs by targeting promotions to the most likely buyers, and make more profitable purchasing decisions. Public Safety Real-Time Analytics to Improve Response Times There were two national political conventions in 2004. One made extensive use of digital video and digital infrared devices, and radar. Security officials located onsite and at remote locations could combine these feeds in real-time to identify abnormal and potentially dangerous activity. At the time, this was an advanced example of large scale surveillance that showed the power of converging data sources in real-time for public safety applications based on analytics. Today, New York City s Real Time Crime Center (RTCC) is a model for demonstrating the effectiveness of this vision. It can access hundreds of millions of NYC criminal complaints, arrests, and 911 call records dating back more than 15 years. It has access to more than five million criminal records and parole files maintained by the State of New York. And it can search the more than 31 million records of crimes committed nationwide. It can then transmit data including images to handheld and tablet devices being used by officers in the field. The RTCC s processing capabilities include a data warehouse, a real-time analytics engine, and a Data Wall. The data warehouse collects and converges, via a reconciliation engine, the data from various data silos (squads, precincts, motor vehicle records, etc.) that exist within the department and statewide. The data analytics engine can run queries against this data as well as correlate the results to satellite images, for example, to display query results on a map and identify nearby landmarks. Realtime analytics capabilities include the ability to map the origin of 911 calls as they come in and an event notification system that alerts officers to criminal activity as it unfolds. Big Data Apps Now When researching Inescapable Data, the authors found other examples in business, government, agriculture, and entertainment. However, they also found that, while there was demand to combine data sources to leverage a new multiplicity of data sources, systems were not yet ready and available to deliver on their visions. Traditional data warehousing systems could not stand up to the challenge and

Big Data Maximizing the Flow 4 Google and Yahoo were just beginning to apply distributed computing and massively parallel processing techniques to this problem. In 2012, the needed systems (Hadoop, MySQL clusters, StreamSQL processing, etc.) are here and now. CEOs and CIOs are realizing their visions and building competitive differentiation as outlined above. Today, the Web has become an immense source of data that can be leveraged in real-time to gain a competitive business advantage. And, as immense as it is, it is only one of a number of data sources that can be leveraged for business advantage. Additional sources include the traditional information silos where data is buried in spreadsheets and departmental applications, and emerging technologies like smart phones and wireless sensors. At this moment, businesses are using real-time analytics processes powered by a growing range of data sources to create stronger and more profitable relationships with their customers and harvest information they could only dream of a few years ago. They are part of the emerging Big Data movement. Maximizing Data Flow It should be obvious from the above application examples that moving data and large volumes of it has become critical to the success of systems that underlie data analytics processes. However, data movement in this context has multi-dimensional aspects: 1. Data has to get from the many sources to the storage devices within the analytics system in order for the convergence to occur 2. Data has to flow between storage and compute layer during the analytics process 3. Results (again in the form of data) have to be delivered to information users 4. Source data and the results of analytics processes are likely to be shared with other systems The prerequisite sources for Big Data apps are many and varied (structured, unstructured, Web, machine, and mobile). In fact, these sources are more likely to live outside the corporate data center. EGI research shows that between 50 and 80 percent of the data needed will have to be moved into and stored within the data center. A robust network infrastructure will be required to: Get large volumes of data into and out of the data center Move data around within the data center to apps that will power information consumers Be responsive to the needs of real time information users when real-time information availability is requred

Big Data Maximizing the Flow 5 Moving the Elephant through the Pipes Network infrastructure for Big Data apps should be bandwidth capable, adaptable, and cost-efficient to handle the impact of ingesting large volumes of data and delivering it to Big Data analytics systems. In particular, the administrators of distributed computing-based analytics systems such as Hadoop have to be critically aware of internal system network performance because the impacts of network performance both positive and negative are experienced at the level of analytics application users. Network-related factors impacting analytics system performance can be summed up in one word: bandwidth. I/O Bandwidth The rate at which data flows between storage and processors within a Hadoop cluster has a direct effect on cluster performance and scalability. Indeed the cluster s design parameters and underlying architectural assumptions relating to query response times are gated by a number of factors, one of which is I/O performance between storage and servers within the cluster 1. This is true for Hadoop and other types of analytics clusters running SQL variants (MySQL, NoSQL, NewSQL, etc.). Internal Cluster Network Bandwidth Hadoop clusters are generally built using open source software and commodity server hardware interconnected by commodity 1Gb Ethernet (1GbE) networking gear. The general objective of the early users of Hadoop who were supporting web-facing, Big Data applications was to do as much processing as possible with a minimized amount of capital resources. However, as Hadoop clusters add nodes to support more data and/or more users, the performance of the internal network at 1GbE, for example, invariably becomes a bottleneck beyond a certain scale point. The internal Hadoop cluster network has to not only handle internal cluster communications and data transfers, it also has to distribute multiple copies of data (usually three) among cluster nodes. A phrase often used to characterize the Hadoop cluster s hardware architecture is cheap and deep. As this phrase implies, cluster administrators generally add more nodes to boost performance when more is needed. Commodity servers are relatively cheap and simply adding them can yield some gain in performance depending on a number of factors. However, growing the cluster in this way can introduce other cost-related issues: 1 Readers interested in this issue are encouraged to investigate CAP theorem which deals with the limitations of distributed computing clusters and what system architects must consider when trying to overcome them.

Big Data Maximizing the Flow 6 Staff time is devoted to manually balancing and rebalancing the workload across the cluster Adding more nodes increases the probability that more staff time will be spent recovering from node failures that could result in cluster failure As enterprises become more dependent on Hadoop-based applications to identify incomegenerating opportunities, increasing instances of cluster failure will have a negative impact on revenue Another way to manage cluster scaling is to look at increasing internal network bandwidth when more performance is required. As noted, internal cluster networks typically use 1GbE. However, Emulex has shown recently that using internal networks based on 10Gb Ethernet (10GbE) can allow a Hadoop cluster to handle increased demand for throughput without adding cluster nodes (see figures 1 and 2 below). Figure 1

Big Data Maximizing the Flow 7 Figure 2 The above graphics demonstrate that throughput can be scaled upward by increasing network bandwidth from 1 to 10GbE while holding the size of the cluster constant, avoiding the need to add Hadoop cluster nodes. External network bandwidth Getting data into and out of Hadoop clusters is also a major concern for large Hadoop users particularly service providers offering analytics as a service to clients who are not yet willing to build and support their own clusters. External network bandwidth also limits how quickly query results are delivered to users impacting those needing real or near-real time query response. Again, 10GbE network connectivity can be used to reduce the time needed to move data into and out of the cluster. Emulex has also measured the performance of 1GbE vs. 10GbE networking when performing data ingesting operations (see figure 3 below).

Big Data Maximizing the Flow 8 Figure 3 Conclusion - Sustainable Scalability and Why I/O is Strategic We believe that a pent-up demand for the tangible business benefits resulting from Big Data analytics, including Hadoop, now exists within the enterprise demand that has been building at the levels of CEO and CIO as well as business line managers and marketing/sales managers. The Big Data phenomenon is not a case of a technology looking for a solution. The technology is now available and the demand will be unleashed. As a result, analytics applications based on new distributed computing clusters as well as revamped data warehouses will grow in the amount of data they ingest, the size of the supporting infrastructure, and their business criticality. To meet these demands, clusters will be scaled in multiple dimensions. More nodes will be added for more processing power and storage. Faster processers will also be adopted as they become more ubiquitous. Both of these factors and others, will place increasing stress on internal and external networks. Network performance will yield lower costs relative to the size of analytics clusters and greater productivity. Therefore, networking infrastructure cannot remain static as the rest of the infrastructure scales upward. It must be allowed grow in bandwidth capacity to meet the service level expectations of business users who will come to depend on analytics applications.

Big Data Maximizing the Flow 9 ### About Evaluator Group Evaluator Group Inc. is dedicated to helping IT professionals and vendors create and implement strategies that make the most of the value of their storage and digital information. Evaluator Group services deliver in-depth, unbiased analysis on storage architectures, infrastructures and management for IT professionals. Since 1997 Evaluator Group has provided services for thousands of end users and vendor professionals through product and market evaluations, competitive analysis and education. www.evaluatorgroup.com Follow us on Twitter @evaluator_group Copyright 2012 Evaluator Group, Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying and recording, or stored in a database or retrieval system for any purpose without the express written consent of Evaluator Group Inc. The information contained in this document is subject to change without notice. Evaluator Group assumes no responsibility for errors or omissions. Evaluator Group makes no expressed or implied warranties in this document relating to the use or operation of the products described herein. In no event shall Evaluator Group be liable for any indirect, special, inconsequential or incidental damages arising out of or associated with any aspect of this publication, even if advised of the possibility of such damages. The Evaluator Series is a trademark of Evaluator Group, Inc. All other trademarks are the property of their respective companies.