Benchmarking Sahara-based Big-Data-as-a-Service Solutions. Zhidong Yu, Weiting Chen (Intel) Matthew Farrellee (Red Hat) May 2015



Similar documents
Big Data Analytics on Object Storage -- Hadoop over Ceph* Object Storage with SSD Cache

Hadoop on OpenStack Cloud. Dmitry Mescheryakov Software

Savanna Hadoop on. OpenStack. Savanna Technical Lead

Addressing Storage Management Challenges using Open Source SDS Controller

Using SUSE Cloud to Orchestrate Multiple Hypervisors and Storage at ADP

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Hadoop & Spark Using Amazon EMR

Building Storage as a Service with OpenStack. Greg Elkinbard Senior Technical Director

RED HAT STORAGE PORTFOLIO OVERVIEW

Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk

Big Data Trends and HDFS Evolution

Modernizing Servers and Software

Red Hat Enterprise Linux OpenStack Platform 7 OpenStack Data Processing

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

SUSE Cloud 2.0. Pete Chadwick. Douglas Jarvis. Senior Product Manager Product Marketing Manager

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

TUT5605: Deploying an elastic Hadoop cluster Alejandro Bonilla

Openstack. Cloud computing with Openstack. Saverio Proto

OpenStack Manila Shared File Services for the Cloud

Ubuntu OpenStack on VMware vsphere: A reference architecture for deploying OpenStack while limiting changes to existing infrastructure

A Brief Introduction to Apache Tez

Big Fast Data Hadoop acceleration with Flash. June 2013

THE FUTURE OF STORAGE IS SOFTWARE DEFINED. Jasper Geraerts Business Manager Storage Benelux/Red Hat

One-click Hadoop Cluster Deployment on OpenPOWER Systems Pradeep K Surisetty IBM. #OpenPOWERSummit

Dell Reference Configuration for Hortonworks Data Platform

Red Hat Enterprise Linux OpenStack Platform Update February 17, 2016

Experiences with Lustre* and Hadoop*

Intel Service Assurance Administrator. Product Overview

Big Data Performance Growth on the Rise

The Virtualization Practice

High Performance Computing OpenStack Options. September 22, 2015

Protecting Big Data Data Protection Solutions for the Business Data Lake

Scalable Architecture on Amazon AWS Cloud

Mark Bennett. Search and the Virtual Machine

Accelerating Enterprise Big Data Success. Tim Stevens, VP of Business and Corporate Development Cloudera

Next-Gen Big Data Analytics using the Spark stack

An Intro to OpenStack. Ian Lawson Senior Solution Architect, Red Hat

Fast Lane OpenStack Overview Red Hat Enterprise Linux OpenStack Platform

version 7.0 Planning Guide

Introduction. Various user groups requiring Hadoop, each with its own diverse needs, include:

HP OpenStack & Automation

Moving From Hadoop to Spark

Modern Application Architecture for the Enterprise

A Complete Open Cloud Storage, Virt, IaaS, PaaS. Dave Neary Open Source and Standards, Red Hat

HPC ON WALL ST OPENSTACK AND BIG DATA. Brent Holden Chief Field Architect, Eastern US April 2014

owncloud Architecture Overview

Ubuntu OpenStack Fundamentals Training

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Best Practices for Increasing Ceph Performance with SSD

Storage Architectures for Big Data in the Cloud

The path to the cloud training

Mambo Running Analytics on Enterprise Storage

A STUDY OF ADOPTING BIG DATA TO CLOUD COMPUTING

Cisco UCS CPA Workflows

Building an AWS-Compatible Hybrid Cloud with OpenStack

Cloud Computing Architecture with OpenNebula HPC Cloud Use Cases

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Modern App Architecture for the Enterprise Delivering agility, portability and control with Docker Containers as a Service (CaaS)

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Virtualizing Apache Hadoop. June, 2012

Automating Big Data Benchmarking for Different Architectures with ALOJA

Boas Betzler. Planet. Globally Distributed IaaS Platform Examples AWS and SoftLayer. November 9, IBM Corporation

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

SMB in the Cloud David Disseldorp

With Red Hat Enterprise Virtualization, you can: Take advantage of existing people skills and investments

Clodoaldo Barrera Chief Technical Strategist IBM System Storage. Making a successful transition to Software Defined Storage

Cloud computing - Architecting in the cloud

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Business transformation with Hybrid Cloud

Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler

Introduction to Cloud Computing

Enabling High performance Big Data platform with RDMA

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Platfora Big Data Analytics

Data movement for globally deployed Big Data Hadoop architectures

Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud Edition Lustre* and Amazon Web Services

Déployer son propre cloud avec OpenStack. GULL François Deppierraz

FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab

Lenovo ThinkServer Solution For Apache Hadoop: Cloudera Installation Guide

OpenStack Manila File Storage Bob Callaway, PhD Cloud Solutions Group,

Wojciech Furmankiewicz Senior Solution Architect Red Hat CEE

Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp

Data Security in Hadoop

Big Data Too Big To Ignore

Getting Started with Database As a Service on OpenStack

State of the Art Cloud Infrastructure

SUSE Linux uutuudet - kuulumiset SUSECon:sta

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

In Memory Accelerator for MongoDB

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Bright Cluster Manager

Building Docker Cloud Services with Virtuozzo

Red Hat Enterprise Linux OpenStack Platform. Rhys Oxenham Principal Product Manager, OpenStack

Transcription:

Benchmarking Sahara-based Big-Data-as-a-Service Solutions Zhidong Yu, Weiting Chen (Intel) Matthew Farrellee (Red Hat) May 2015

Agenda o Why Sahara o Sahara introduction o Deployment considerations o Performance testing and results o Future envisioning o Summary and Call to Action 2

Why Sahara: Cloud features o You or someone at your company is using AWS, Azure, or Google o You re probably doing it for easy access to OS instances, but also the modern application features, e.g. AWS EMR or RDS or Storage o [expecting anyone to choose openstack infra for their workloads means providing app level services, e.g. Sahara & Trove] o [app writers apps are complex enough without having to manage the supporting infra. examples outside cloud in mobile (feedhenry, parse, kinvey)] 3

Why Sahara: Data analysis o [all this true for database provisioning, and that s a known quantity] o [all this especially true for data processing, which many developers are only recently (compared to RDBMS) integrating into their applications] o [data processing workflow, show huge etl effort, typical workflow does not even take into account infra to run it] o [flow into key features of sahara] 4

Agenda o Why Sahara o Sahara introduction o Deployment considerations o Performance testing and results o Future envisioning o Summary and Call to Action 5

Sahara features o Repeatable cluster provisioning and management operations o Data processing workflows (EDP) o Cluster scaling (elasticity), Storage integration (Swift, Cinder, HCFS) o Network and security group (firewall) integration o Service anti-affinity (fault domains & efficiency) 6

Sahara architecture 7

Sahara plugins o Users get choice of integrated data progressing engines o Vendors get a way to integrate with OpenStack and access users o Upstream - Apache Hadoop (Vanilla), Hortonworks, Cloudera, MapR, Apache Spark, Apache Storm o Downstream - depends on your OpenStack vendor 8

Agenda o Why Sahara o Sahara introduction o Deployment considerations o Performance testing and results o Future envisioning o Summary and Call to Action 9

o o Storage Architecture Tenant provisioned (in ) o in the same s of computing tasks vs. in the different s o Ephemeral disk vs. Cinder volume Admin provided o Logically disaggregated from computing tasks o Physical collocation is a matter of deployment o For network remote storage, Neutron DVR is very useful feature #1 Computing Task #2 #3 #4 Computing Task Computing Task Computing Task Computing Task o 10 A disaggregated (and centralized) storage system has significant values o No data silos, more business opportunities o Could leverage Manila service o Allow to create advanced solutions (.e.g. in-memory overlayer) o More vendor specific optimization opportunities Legacy NFS GlusterFS Ceph External Scenario #1: computing and data service collocate in the s Scenario #2: data service locates in the host world Scenario #3: data service locates in a separate world Scenario #4: data service locates in the remote network Swift

Compute Engine Pros Best support in OpenStack Strong security Cons Slow to provision Relatively high runtime performance overhead Container Bare-Metal o Light-weight, fast provisioning Better runtime performance than Best performance and QoS Best security isolation Container seems to be promising but still need better support Nova-docker readiness Cinder volume support is not ready yet Weaker security than Not the ideal way to use container Ironic readiness Worst efficiency (e.g. consolidation of workloads with different behaviors) Worst flexibility (e.g. migration) Worst elasticity due to slow provisioning o Determining the appropriate cluster size is always a challenge to tenants o e.g. small flavor with more nodes or large flavor with less nodes 11

Data Processing API o o o Direct Cluster Operations o Sahara is used as a provisioning engine o Tenants expect to have direct access to the virtual cluster o e.g. directly SSH into the s o May use whatever APIs comes with the distro o e.g. Oozie EDP approach o Sahara s EDP is designed to be an abstraction layer for tenants to consume the services o Ideally should be vendor neutral and plugin agnostic o Limited job types are supported at present 3rd party abstraction APIs o Not supported yet o e.g. Cask CDAP

Deployment Considerations Matrix Data Processing API Legacy EDP (Sahara native) 3rd party APIs Distro/Plugin Vanilla Spark Storm CDH HDP MapR Compute Container Bare-metal Performance results in the next section Storage Tenant vs. Admin provisioned Disaggregated vs. Collocated vs. other options

Agenda o Why Sahara o Sahara introduction o Deployment considerations o Performance testing and results o Future envisioning o Summary and Call to Action 14

Testing Environment OS: CentOS7 Guest OS: CentOS7 Hadoop 2.6.0 4-Nodes Cluster Baremetal OpenStack using K qemu-kvm v1.5 OpenStack using Docker Docker v1.6 15

Ephemeral Disk Performance.. Lower is better over RAID5 brings extra 10% performance overhead vs. Nova Inst. Store.. RAID TO BE UPDATED Two factors bring??% overhead access pattern change accounts for 10% virtualization overhead accounts for??%

Collocated Performance.. TO BE Replaced by real data o To be added.

Swift Performance Swift.. Swift TO BE Replaced by real data o To be added.

Bare-metal vs. Container vs. Higher is better TO BE UPDATED Docker provide similar disk write result with K ~15% performance loss for both K and Docker Docker use less resources than K

Agenda o Why Sahara o Sahara introduction o Deployment considerations o Performance testing and results o Future envisioning o Summary and Call to Action 20

Future of Sahara [To be removed] This page is for what s in our vision. Don t have to be consensus of the community or already planned in roadmap. o Architectured for disaggregated computing and storage o Supporting more storage backend o Integration with Manila o Better support for container and bare-metal (Nova-docker and Ironic) o Murano as an alternative? o EDP as a PaaS like layer for Sahara core provisioning engine o Data connector abstraction o Binary/job management o Workflow management o Policy engine for resource and SLA management o Auto-scale, auto-tune o Sahara needs to be open to all the vendors in the big data ecosystem o A complete big data stack may have many options at each layer o e.g. acceleration libraries, analytics, developer oriented application framework (e.g. CDAP) o Requires a more generic plugin/driver framework to support it 21

Summary and Call-to-Action [To be removed] Zhidong s draft o Great improvement in Sahara Kilo release. Production ready with real customer deployments. o A complete Big-Data-as-a-Service solution requires more considerations than simply adding a Sahara service to the existing OpenStack deployment o Preliminary benchmark results show. o Many features could be added to enhance Sahara. Opportunities exist for various types of vendors. Join in the Sahara community and make it even more vibrant! 22

BACKUP

DD Testing Result Test Case with Multiple Disks Container with Multiple Disks in RAID5 in RAID5 Container in RAID5 Throughput 100 MB/s x 4 = 400 MB/s 90 MB/s x 4 = 360 MB/s 320 MB/s 270 MB/s 270 MB/s Multiple Disks Configuration: 1 x 1TB SATA HDD for System, 4 x 1TB SATA HDDs for Data, RAID 5 Configuration: 5 x 1TB SATA HDD dd command: dd if=/dev/zero of=/mnt/test1 bs=1m count=8192 (4096, 8192, 16384, 24576) conv=fdatasync

Sort IO Profiling 2-phases in Sort running period for disk write Shuffle Map-Reduce Data -> Use temp folder to store intermediate data(40%total throughput) Write Output -> Write(60%total throughput) Shuffling data using temp folder Disk IO Peak Write output to /External Storage

Storage Suggestion in Computing Node Dedicate storage disks to spread disk io for better performance A system disk for operating system Several data disks for tmp folder and Comput-ing ing Task Task System Disk Data Disk Data Disk For operating system, it can be used to allocate boot, root, and swap partition in this disk. RAID is also available in this disk for better failover. For intermediate data, it is used to assign a volume for intermediate data in mapreduce configuration. For, it is used to process. It can also be replaced with any kind of external storage like swift.

Storage Strategy in Sahara #2 Computing Task Computing Task #3 #4 Computing Task #1 #3 Computing Task Computing Task Swift Transient Cluster Use external storage(swift, External ) to persist data. Pros persist data in external storage terminate computing node after finishing the jobs Cons lost performance from external storage Long Run Cluster Use ephemeral storage/cinder storage for better performance. Pros better performance using internal storage Cons may still need backup from external storage

DVR enhance network performance o Additional external network in computing node o Using iperf and 1Gb in internal and external network o DVR provide better performance from Instance to 941Mb Instance Instance Instance 941Mb 941Mb Instance w/o DVR:941Mb w/ DVR: 14Gb Instance communicates with different hosts. The scenario usually uses in control path from host to instances. Instances communicate between hosts. The scenario usually uses in Internal data communication. Instance communicates with the same host. The scenario usually uses in control path from host to instances. Put datanode in the host with data locality may bring a new concept for data persistent Instance 14Gb Instance Two instances in the same host.internal data communication.

Disclaimer Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at [intel.com]. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Include as footnote in appropriate slide where performance data is shown: o o Configurations: [describe config + what test used + who did testing] For more information go to http://www.intel.com/performance. Intel, the Intel logo, {List the Intel trademarks in your document} are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 2015 Intel Corporation. 29