Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Similar documents
Dell Reference Configuration for Hortonworks Data Platform

HiBench Introduction. Carson Wang Software & Services Group

HiBench Installation. Sunil Raiyani, Jayam Modi

Red Hat Network Satellite Management and automation of your Red Hat Enterprise Linux environment

Red Hat Satellite Management and automation of your Red Hat Enterprise Linux environment

Cost-Effective Business Intelligence with Red Hat and Open Source

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Understanding Hadoop Performance on Lustre

Implement Hadoop jobs to extract business value from large and varied data sets

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms

Deploying Hadoop on SUSE Linux Enterprise Server

Red Hat enterprise virtualization 3.0 feature comparison

CSE-E5430 Scalable Cloud Computing Lecture 2

Linux Performance Optimizations for Big Data Environments

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Scaling the Deployment of Multiple Hadoop Workloads on a Virtualized Infrastructure

RED HAT ENTERPRISE LINUX 7

Introduction. Various user groups requiring Hadoop, each with its own diverse needs, include:

Best Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software

WebLogic on Oracle Database Appliance: Combining High Availability and Simplicity

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 7

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

Benchmarking Sahara-based Big-Data-as-a-Service Solutions. Zhidong Yu, Weiting Chen (Intel) Matthew Farrellee (Red Hat) May 2015

Increasing Hadoop Performance with SanDisk Solid State Drives (SSDs)

How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm (

Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM

RED HAT ENTERPRISE VIRTUALIZATION FOR SERVERS: COMPETITIVE FEATURES

<Insert Picture Here> Introducing Oracle VM: Oracle s Virtualization Product Strategy

Automating Big Data Benchmarking for Different Architectures with ALOJA

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Intel Distribution for Apache Hadoop* Software: Optimization and Tuning Guide

A Study of Data Management Technology for Handling Big Data

Big Fast Data Hadoop acceleration with Flash. June 2013

Big Business, Big Data, Industrialized Workload

Performance brief for IBM WebSphere Application Server 7.0 with VMware ESX 4.0 on HP ProLiant DL380 G6 server

Duke University

Chapter 7. Using Hadoop Cluster and MapReduce

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

RED HAT ENTERPRISE VIRTUALIZATION PERFORMANCE: SPECVIRT BENCHMARK

BIG DATA TECHNOLOGY ON RED HAT ENTERPRISE LINUX: OPENJDK VS. ORACLE JDK

Skyscape Cloud Services Deploys Hadoop in the Cloud on VMware vsphere

BIG DATA USING HADOOP

Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers

Hadoop on OpenStack Cloud. Dmitry Mescheryakov Software

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

CloudSpeed SATA SSDs Support Faster Hadoop Performance and TCO Savings

MapReduce Evaluator: User Guide

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Managing your Red Hat Enterprise Linux guests with RHN Satellite

DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

RED HAT ENTERPRISE VIRTUALIZATION FOR SERVERS: PRICING & LICENSING GUIDE

System and Storage Virtualization For ios (AS/400) Environment

QuickSpecs. HP Integrity Virtual Machines (Integrity VM) Overview. Currently shipping versions:

Experiences with Lustre* and Hadoop*

GraySort on Apache Spark by Databricks

Develop a process for applying updates to systems, including verifying properties of the update. Create File Systems

HPSA Agent Characterization

Cloud Computing through Virtualization and HPC technologies

A Brief Introduction to Apache Tez

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

Platfora Big Data Analytics

Hadoop Architecture. Part 1

HDP Hadoop From concept to deployment.

CloudRank-D:A Benchmark Suite for Private Cloud Systems

INTRODUCTION TO CLOUD MANAGEMENT

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

An Oracle White Paper June Consolidating Oracle Siebel CRM Environments with High Availability on Sun SPARC Enterprise Servers

Configuring and Managing a Private Cloud with Enterprise Manager 12c

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

FOR SERVERS 2.2: FEATURE matrix

Java Garbage Collection Characteristics and Tuning Guidelines for Apache Hadoop TeraSort Workload

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Big Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe May, 2013

HPCHadoop: MapReduce on Cray X-series

Business white paper. HP Process Automation. Version 7.0. Server performance

Oracle Big Data SQL Technical Update

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Intel Cloud Builders Guide to Cloud Design and Deployment on Intel Platforms

Red Hat Enterprise Linux is open, scalable, and flexible

Week Overview. Installing Linux Linux on your Desktop Virtualization Basic Linux system administration

Hadoop and Map-Reduce. Swati Gore

Installing Hadoop over Ceph, Using High Performance Networking

Deploying Fusion Middleware in a 100% Virtual Environment Using OVM ANDY WEAVER FISHBOWL SOLUTIONS, INC.

Virtualizing Apache Hadoop. June, 2012

IOS110. Virtualization 5/27/2014 1

ORACLE OPS CENTER: PROVISIONING AND PATCH AUTOMATION PACK

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Extending Hadoop beyond MapReduce

IOmark- VDI. HP HP ConvergedSystem 242- HC StoreVirtual Test Report: VDI- HC b Test Report Date: 27, April

SQL Server Virtualization 101. David Klee, Group Principal and Practice Lead. SQL PASS Virtualization VC,

A Performance Analysis of Distributed Indexing using Terrier

<Insert Picture Here> Private Cloud with Fusion Middleware

Big Data Analytics OverOnline Transactional Data Set

IRON Big Data Appliance Platform for Hadoop

Cloud Operating Systems for Servers

Transcription:

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager, Hortonworks

Agenda Introduction to Red Hat and Hortonworks Alliance Trusted open source technologies Architecting an enterprise Big Data solution Physical system deployments Setting up the infrastructure and configuring the data platform Exploring virtualization and consolidation use cases Installation validation and performance assessment Summary 2

A Deepened Strategic Alliance 3 Page 3

Infrastructure overview Red Hat Enterprise Linux 6 and OpenJDK 7

Trusted Enterprise Platform: RHEL and OpenJDK Red Hat Enterprise Linux Trusted by 90% of FORTUNE 500 companies According to a recent user survey* RHEL is used in: Infrastructure (70% of respondents) Big Data analytics, BI and data visualization (28% of respondents) Big Data processing: Hadoop, MapReduce (21% of respondents) * Red Hat survey of 187 users of Red Hat Enterprise Linux www.techvalidate.com/product-research/red-hat-enterprise-linux Note: this is a multiple-choice question response percentages may not add up to 100. 5

Trusted Enterprise Platform: RHEL and OpenJDK OpenJDK Open source implementation of the Java SE specification Red Hat has a leadership role in the OpenJDK project World Record* SPECjbb2013 -Composite result for CriticalJOPS running on Red Hat Enterprise Linux Same or better performance compared to Oracle JDK** Hadoop infrastructure (Sort and Terasort) Machine Learning (Bayesian classification and k-means) * As of April 3, 2014. SPEC and SPECjbb are registered trademarks of the Standard Performance Evaluation Corporation. For more information about SPEC and it's benchmarks see www.spec.org ** BIG DATA TECHNOLOGY ON RED HAT ENTERPRISE LINUX: OPENJDK VS. ORACLE JDK report 6

Hadoop overview Hortonworks Data Platform (HDP) 2.0

Hortonworks Data Platform 2.0 (HDP 2.0) 8

HDP 2.0 What s changed 9

Architecting an enterprise Big Data solution Physical system deployments

Physical and Logical Configurations Two master nodes Four data nodes 2 x Intel Xeon X5670; 96GB RAM; 6 HDDs 2 x Intel Xeon X5670; 48GB RAM; 8 HDDs 11

Setup and Configuration Workflow summary 12

Setup and Configuration OS Installation Use PXE, Kickstart or physical media Generally, defaults are acceptable to start Post-install Synchronize clocks to the same NTP server Configure the file systems on each server Disable unnecessary services (cups, autofs, postfix etc., etc., etc.) Install the latest version of OpenJDK 7 and make it the default Install tuned and set it to enterprise-storage profile 13

Setup and Configuration Hortonworks Data Platform 2.0 installation Set up Repositories Ambari repo HDP Stack repo HDP Utils repo Options to Deploy Ambari Automated and GUI driven Scripted RPM: Script driven 14 Configure Modify default XML configs Optimize for Hardware, cluster layout and workloads

HDFS Configuration Best Practices For Master nodes, utilize redundant hardware For Slave nodes, use JBOD and commodity hardware Separate Operation System partition for logs Use Rack Awareness for fault tolerance and performance across racks 15

YARN Configuration Best Practices Configure for optimized allocation of distributed resources across multi-workload processing. Max. Node RAM allocation Min. Container RAM allocation 16

YARN Configuration Best Practices Property Name yarn.scheduler.minimum-allocationmb Description Smallest Container Allowed in MB All Containers must be an multiple of the minimum container size ie- 1024 allows for 1024, 2048, 3072, 4096, etc. yarn.scheduler.maximum-allocation- Largest Container Allowed. A Multiple of the minimummb allocation-mb above Depending on your setup you may want to allow the entire node for MR, or restrict it to smaller then a node to prevent potential malicious actions. mapreduce.map.memory.mb mapreduce.map.java.opts The size of the container for the Mapper task The java opts for the Mapper J, make sure that the max heap is less than the size of the container. mapreduce.reduce.memory.mb mapreduce.reduce.java.opts The size of the container for the Reducer task The java opts for the Reducer J, make sure that the max heap is less than the size of the container. 17

Architecting an enterprise Big Data solution Exploring virtualization and consolidation use cases

Flexible Deployments Virtualization (as a trivial case of consolidation) Easy transition to the updated hardware or the cloud Power savings and simplified datacenter management Well-defined s are easily provisioned and destroyed Consolidation Higher utilization of server resources, low overhead with K More than adequate performance for several workload types Hadoop infrastructure (Sort) and Machine Learning (Naive Bayes) 19

Virtualized Setup and Configuration Workflow summary 20

Physical and Logical Configurations Virtualization Two master nodes 2 x Intel Xeon X5670; 96GB RAM; 6 HDDs Four virtualized data nodes (guests) 24 vcpus; 44GB vram; 1 x 20GB system disk; 6 x 300GB data disks Four physical servers 2 x Intel Xeon X5670; 48GB RAM; 8 HDDs 21

Setup and Configuration Virtualization OS Installation on Hypervisor Use PXE, Kickstart or physical media Post-install Install K: Virtualization, Platform and Client Synchronize clocks to the same NTP server Disable unnecessary services (cups, autofs, postfix etc etc etc) Install tuned and set it to virtual-host profile Configure disk storage for virtual guests Configure virtual networking 22

Setup and Configuration Virtualization OS Installation on Guest Use PXE, Kickstart or physical media Post-install Synchronize clocks to the same NTP server Disable unnecessary services (cups, autofs, postfix etc etc etc) Install tuned and set it to virtual-guest profile Install the latest version of OpenJDK 7 and make it default Clone this guest as many times as needed Move images to the respective hypervisor and attached disks Start each guest and configure their networking and file system 23

Physical and Logical Configurations Yan Consolidation Two master nodes: 2 x Intel Xeon X5670; 96GB RAM; 6 HDDs Four virtualized data nodes (guests): 12 vcpus; 48GB vram; 1 x 20GB system disk; 8 x 73GB data disks - One physical server: 2 x Intel Xeon E5-2697 v2; 384GB RAM; 24 HDDs 24

Setup and Configuration Consolidation OS installation on Hypervisor Follow the same installation instructions as before Follow same instructions for post-install, plus Configure data disks for guests to use OS installation on Guests Follow the same installation instructions as before Divide the total number of cores on the system equally among all guests and create first Clone this guest as many times as needed 25

Best Practices Rack Awareness for Virtualized Infrastructures Data Center Rack1 Rack2 NodeG 1 Host1 NodeG 2 Host2 Host3 NodeG 3 Host5 Host4 NodeG 4 Host6 Host7 Host8 26

Post-installation validation Physical and virtual

Post-installation Validation Use several common workloads from Intel Hibench suite Select tests with most balanced CPU, IO, and network profiles Observe OpenJDK performance and compare to Oracle JDK Test Hadoop infrastructure Sort - micro benchmark performing sort operations, a critical feature of many MapReduce jobs Terasort - is a Big Data version of Sort. It sorts 10 billion 100-byte records produced by the TeraGen generator program 28

Post-installation Validation Test real-world workloads (Machine learning) Naïve Bayesian Classification - machine learning and classification implementation; used for finding patterns and assigning data sets to classes K-Means - implement highly used and well-understood KMeans clustering algorithm operating on a large set of randomly generated numerical multidimensional vectors with specific statistical distributions 29

Post-installation validation Findings Virtualization Consolidation Time in seconds (lower is better) Bare-metal Low-level Data Manipulation: Sort Data Classification: Naïve Bayes Data Manipulation: TeraSort 30 Data Clustering: K-Means

Post-installation Validation Findings OpenJDK Oracle JDK Time in seconds (lower is better) 2500 2000 1500 1000 500 0 Sort TeraSort Bayes classifier 31 K-means

Summary The Hortonworks Data Platform is a completely open source production-ready distribution of Apache Hadoop Result prove that Red Hat Enterprise Linux with OpenJDK provide a solid foundation for enterprise deployments of HDP OpenJDK performs as well as Oracle JDK when running HDP This infrastructure works well in both physical and virtual domains This powerful infrastructure platform for Hadoop deployments can support your organization s needs today and well into the future 32

Resources Get the Hortonworks Data Platform Exploring the next generation of Big Data solutions with Hadoop 2 Big Data Technology on Red Hat Enterprise Linux: OpenJDK vs Oracle JDK

Questions? Contact: Yan Fisher yfisher@redhat.com Rohit Bakhshi rohit@hortonworks.com