OnX Big Data Reference Architecture



Similar documents
Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Virtualizing Apache Hadoop. June, 2012

Object Storage: A Growing Opportunity for Service Providers. White Paper. Prepared for: 2012 Neovise, LLC. All Rights Reserved.

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Big Data and Transactional Databases Exploding Data Volume is Creating New Stresses on Traditional Transactional Databases

The Future of Data Management

International Journal of Innovative Research in Computer and Communication Engineering

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Open source Google-style large scale data analysis with Hadoop

NoSQL and Hadoop Technologies On Oracle Cloud

Big Data With Hadoop

How Cisco IT Built Big Data Platform to Transform Data Management

Data Refinery with Big Data Aspects

Hadoop IST 734 SS CHUNG

NextGen Infrastructure for Big DATA Analytics.

Information Architecture

I/O Considerations in Big Data Analytics

Understanding How Sensage Compares/Contrasts with Hadoop

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Data processing goes big

Testing Big data is one of the biggest

Big Data - Infrastructure Considerations

Proact whitepaper on Big Data

Cisco IT Hadoop Journey

HadoopTM Analytics DDN

Implement Hadoop jobs to extract business value from large and varied data sets

BIG DATA TRENDS AND TECHNOLOGIES

Large scale processing using Hadoop. Ján Vaňo

Hadoop Big Data for Processing Data and Performing Workload

Networking in the Hadoop Cluster

A very short Intro to Hadoop

Big data management with IBM General Parallel File System

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

HDFS. Hadoop Distributed File System

Manifest for Big Data Pig, Hive & Jaql

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Hadoop. Sunday, November 25, 12

Data Solutions with Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

Business white paper. environments. The top 5 challenges and solutions for backup and recovery

Hadoop implementation of MapReduce computational model. Ján Vaňo

Big Data and Market Surveillance. April 28, 2014

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Big Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe May, 2013

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

CDH AND BUSINESS CONTINUITY:

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

STeP-IN SUMMIT June 2014 at Bangalore, Hyderabad, Pune - INDIA. Performance testing Hadoop based big data analytics solutions

White Paper: Evaluating Big Data Analytical Capabilities For Government Use

Microsoft Analytics Platform System. Solution Brief

Time-Series Databases and Machine Learning

Cisco Unified Data Center Solutions for MapR: Deliver Automated, High-Performance Hadoop Workloads

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches

Dell Reference Configuration for Hortonworks Data Platform

BIG DATA: BIG CHALLENGE FOR SOFTWARE TESTERS

Hadoop Cluster Applications

High Performance Server SAN using Micron M500DC SSDs and Sanbolic Software

NetApp Big Content Solutions: Agile Infrastructure for Big Data

Hadoop Market - Global Industry Analysis, Size, Share, Growth, Trends, and Forecast,

There s no way around it: learning about Big Data means

Top 10 Automotive Manufacturer Makes the Business Case for OpenStack

Ubuntu and Hadoop: the perfect match

Bringing Big Data into the Enterprise

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Big Data, Big Traffic. And the WAN

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Big Fast Data Hadoop acceleration with Flash. June 2013

Accelerating and Simplifying Apache

Hadoop and Map-Reduce. Swati Gore

Building a Scalable Big Data Infrastructure for Dynamic Workflows

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Can Flash help you ride the Big Data Wave? Steve Fingerhut Vice President, Marketing Enterprise Storage Solutions Corporation

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Big Data and Apache Hadoop Adoption:

Hadoop Architecture. Part 1

Trafodion Operational SQL-on-Hadoop

PEPPERDATA OVERVIEW AND DIFFERENTIATORS

Suresh Lakavath csir urdip Pune, India

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Get More Scalability and Flexibility for Big Data

Customized Report- Big Data

Big Data: What You Should Know. Mark Child Research Manager - Software IDC CEMA

How To Create A Data Visualization With Apache Spark And Zeppelin

HDFS Users Guide. Table of contents

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Transcription:

OnX Big Data Reference Architecture

Knowledge is Power when it comes to Business Strategy The business landscape of decision-making is converging during a period in which: > Data is considered by most to be an organization s most important asset. > The data volume is increasing at an unbelievable rate, with total volume expected to reach 35 zetabytes by 2020 a number that is 44 times over the total volume of data that was handled in 2009. > Unstructured / untraditional data sources, such as customer behavior in social media networks and data from connected machines, are contributing to a majority of this data growth. Business Analytics the science of turning this data into usable information is playing a major role in supporting business decisions. In successful organizations, analytics are no longer controlled by technical teams. Rather, business users are driving analytics, with technical teams focusing primarily on timely data delivery and the underlying infrastructure to support the analytics. The Concept of Big Data There is no standard definition for Big Data. Typically, an organization considers a data volume Big if it is bigger than anything that has been historically managed within that organization. The thing is, data is exploding for every industry, business and customer. Traditional infrastructure and data management architecture cannot meet the demands of today s data growth. Also, most of the incoming data is via an unstructured format through all sorts of sources including customer behavior in web, social networks, geographic location, syndicated data, government data, machine data, etc. Organizations must get a handle on managing the data volume and understanding/leveraging all the data to make informed, timely business decisions or risk falling behind. OnX Big Data Reference Architecture 1

Big Data Characteristics Big Data can be characterized by 4 V s: Volume Variety, Velocity and Veracity. > Volume: As discussed earlier, the data volume is growing at an exponential rate and will continue to do so. There is no doubt that technology must be in place to enable business users to make use of this huge surge in data volume. > Variety: Also discussed earlier, data is no longer arriving solely from internal applications with very known and rigid structures. It s now generated from a variety of sources, including unstructured ones. Business users must make sense of this data to enhance their decision making. There are two aspects of Variety to consider: Syntax and Semantics. In the past, these have determined the extent to which data could be reliably Extracted, Transformed, and Loaded (ETL) into a database and then analyzed. While modern ETL tools are very capable of dealing with data that arrives in virtually any syntax, they are less capable of dealing with semantically rich data, such as free text. Because of this, most organizations have been restricted to data analysis of a more narrow range of data. The value that Big Data technology brings with its inclusion of data of all syntaxes and semantics is perhaps one of its major appeals. > Velocity: The rate at which data is being received and has to be acted upon is becoming much more real-time. Delays in execution will inevitably limit the effectiveness of campaigns, limit interventions or lead to sub-optimal processes. For example, a discount offer to a customer based on their web browsing in a public terminal will not be successful once they log out of that terminal. > Veracity: This is related to its basic unstructured nature and occurrence from various sources, some with questionable reliability. If you can t trust the data, you have a veracity problem. Data needs to be clean and intact in order for it to be accurately leveraged. Sometimes it s important to supplement the incoming data with additional information from knowledge gathered through other sources, to get a complete and accurate understanding of the data, rather than depending on incoming data alone. OnX Big Data Reference Architecture 2

Big Data Technology and Infrastructure Big Data: The Rebel with a (MapReduce) Cause The Big Data architecture doesn t follow some of the basic principles of data architecture and data management. One of the fundamental principles of data architecture involves understanding the usage of data before it s captured. With Big Data technologies, business users are often interested in capturing data at the point of occurrence, with the expectation that utility will be understood at a future point in time. Also, data is no longer brought towards one processing engine and found in one place. Big Data technology breaks down the entire workload in small, manageable chunks and spreads it across multiple servers, with result sets being accumulated afterwards to provide an answer. This operation is called Hadoop MapReduce, with the first step focused on Mapping a job to all servers, and later on Reducing the result set to a desired number of targets. Hadoop provides significant improvement of Velocity (time-to-market) and Veracity (reliability) compared to traditional data architecture. At a minimum, data is replicated three times across the platform, so that there is minimal potential for data loss. The Hadoop Alley-Oop: Cost Benefits Big Data technologies provide a very appealing cost advantage over traditional data infrastructure. Hadoop, the open-source software framework for large scale data processing, costs merely hundreds of dollars per TB of data, as compared to thousands of dollars per TB in traditional storage architecture. With this enormous processing power that is spread across the nodes to process smaller chunks of data in less costly commodity platforms, Hadoop technologies provide exceptional value to both business and IT. SAN Storage $2-$10/Gigabyte $1M gets: 0.5 Petabytes 200,000 IOPS 1 GB/sec SAN Storage $1-$5/Gigabyte $1M gets: 1 Petabyte 400,000 IOPS 2 GB/sec Local Storage $0.05/Gigabyte $1M gets: 20 Petabytes 10,000,000 IOPS 800 GB/sec OnX Big Data Reference Architecture 3

Reference Architecture OnX has existing partnerships with most major names in the industry and has built and delivered Hadoop platforms on various infrastructure architectures, including IBM, HP and Oracle, among others. With the increased number of vendors in the Big Data space and increased industry acceptance of the Hadoop platform, it was important for OnX to focus on a Big Data reference architecture. The OnX Big Data reference architecture is built on Cisco UCS servers, with StackIQ as Cluster Manager and MapR Hadoop distribution. Using this reference architecture, OnX implemented Hadoop cluster for a major financial customer, for their customer loyalty program and product recommendations. The cluster was scaled up to over 1,200 servers to meet the customer s requirements. The financial customer saved millions of dollars over infrastructure cost through the implementation. Reference Architecture Details The OnX Big Data Reference Architecture has been configured to support two types of business requirements: High Performance and High Storage Capacity. OnX Big Data Reference Architecture 4

Hadoop Distribution with MapR The Apache Hadoop solution delivered by MapR Technologies introduces a completely new way of handling big data. Unlike traditional databases that store structured data, Hadoop enables distribution and analysis of structured and unstructured data smoothly on a single data infrastructure. MapR has the strongest foundation of available Hadoop distributions. Although they support Apache Hadoop standards and underlying technology, they have significant performance advantages over other distributions. The fundamental difference of the MapR file system is shown below. HBase JVM DFS JVM ext3 Disks HBase JVM MapR Disks Unified Disks Other Distributions MapR M5 Edition Edition MapR has now unified tables and files into a unified data platform - there is no separate HBase infrastructure. The environment is much simpler to manage by eliminating the various redundant components. There is a uniform data management layer across files and tables that provide a consistent data protection layer. Additionally, recovery from node failures is in seconds, there is 100% data locality, and HBase can read directly from snapshots. Furthermore, Files and tables are in the same namespace, volumes, and directories. OnX Big Data Reference Architecture 5

The underlying clusters are managed by StackIQ, which is based on Rocks open source cluster management software. It provides fully integrated Big Data platform that is extremely easy to manage. StackIQ Hadoop provisioning tool installs and configures all the required software and services on bare metal to a working Hadoop cluster. The installation can be completed in minutes, even for large Hadoop distributions involving hundreds of nodes. Cluster management software provides a GUI interface for managing clusters including addition/drop of nodes, upgrade of individual nodes, implementation of patches in entire cluster or individual nodes etc. The Complete Big Data Management Platform Map Reduce MapR-FS Monitoring Network Mgmt Disk Mgmt CONFIGURE DEPLOY MANAGE SCALE Operating System UCS Manager OnX Big Data Reference Architecture 6

Summary The reference architecture enables organizations to implement a starter package for Big Data, with a highly flexible architecture that can later be scaled to meet an organization s data & analytics needs. The High Capacity and High Performance set up can be architected with fewer servers for small or medium packages with 4 and 8 nodes respectively as well and scaled up to full 16 nodes configuration later when there is demand for additional storage and/or performance. Small Medium Large 4 nodes UCS C240 (2.9 GHz) 8 nodes UCS C240 (2.9 GHz) 16 nodes UCS C240 (2.9 GHz) 21 TB per node 21 TB per node 21 TB per node 84 TB total capacity 168 TB total capacity 336 TB total capacity High Performance 4 nodes UCS C240 (2.6 GHz) 8 nodes UCS C240 (2.6 GHz) 16 nodes UCS C240 (2.6 GHz) 36 TB per node 36 TB per node 36 TB per node High Capacity 144 TB total capacity 288 TB total capacity 576 TB total capacity Ready to learn more? Contact your local OnX Account Executive, or call 1-866-906-4669. www.onx.com OnX Big Data Reference Architecture 7