HOPS: Hadoop Open Platform-as-a-Service

Similar documents
Savanna Hadoop on. OpenStack. Savanna Technical Lead

Amazon Elastic Beanstalk

Hadoop on OpenStack Cloud. Dmitry Mescheryakov Software

TUT5605: Deploying an elastic Hadoop cluster Alejandro Bonilla

One click Hadoop clusters - anywhere

A Complete Open Cloud Storage, Virt, IaaS, PaaS. Dave Neary Open Source and Standards, Red Hat

HP OpenStack & Automation

Cloudera Manager Training: Hands-On Exercises

KT ucloud storage. Two Years of Life with OpenStack Swift / Jaesuk Ahn, Cloud OS Dev. Team, Korea Telecom

Automated Configuration of Open Stack Instances at Boot Time

AVLOR SERVER CLOUD RECOVERY

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

cloud functionality: advantages and Disadvantages

OpenStack Introduction. November 4, 2015

Sahara. Release rc2. OpenStack Foundation

Apache Hadoop. Alexandru Costan

Project Documentation

Hadoop & Spark Using Amazon EMR

Technology and Cost Considerations for Cloud Deployment: Amazon Elastic Compute Cloud (EC2) Case Study

SUSE OpenStack Cloud 4 Private Cloud Platform based on OpenStack. Gábor Nyers Sales gnyers@suse.com

OpenStack IaaS. Rhys Oxenham OSEC.pl BarCamp, Warsaw, Poland November 2013

An Introduction to Cloud Computing Concepts

OpenStack. Orgad Kimchi. Principal Software Engineer. Oracle ISV Engineering. 1 Copyright 2013, Oracle and/or its affiliates. All rights reserved.

How an Open Source Cloud Will Help Keep Your Cloud Strategy Options Open

Installation Runbook for Avni Software Defined Cloud

Big Data Technology Core Hadoop: HDFS-YARN Internals

Introduction to Cloud Computing

Service Catalogue. virtual services, real results

Cloud Computing. Adam Barker

CloudStack and Big Data. Sebastien May 22nd 2013 LinuxTag, Berlin

Cloud Platform Comparison: CloudStack, Eucalyptus, vcloud Director and OpenStack

McAfee Public Cloud Server Security Suite

Introduction to OpenStack

Sistemi Operativi e Reti. Cloud Computing

CLOUDFORMS Open Hybrid Cloud

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Comparing Open Source Private Cloud (IaaS) Platforms

Virtualizing Apache Hadoop. June, 2012

Comparing Ganeti to other Private Cloud Platforms. Lance Albertson

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Deploying Database clusters in the Cloud

A Cost-Evaluation of MapReduce Applications in the Cloud

CloudCenter Full Lifecycle Management. An application-defined approach to deploying and managing applications in any datacenter or cloud environment

HYPER-CONVERGED INFRASTRUCTURE STRATEGIES

APP DEVELOPMENT ON THE CLOUD MADE EASY WITH PAAS

Using The Hortonworks Virtual Sandbox

Enterprise PaaS Evaluation Guide

The Total Newbie s Introduction to Heat Orchestration in OpenStack

Introduction to Hadoop

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Business transformation with Hybrid Cloud

How To Compare Cloud Computing To Cloud Platforms And Cloud Computing

Building Out Your Cloud-Ready Solutions. Clark D. Richey, Jr., Principal Technologist, DoD

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Mark Bennett. Search and the Virtual Machine

Oracle Virtualization Strategy and Roadmap

RED HAT STORAGE SERVER TECHNICAL OVERVIEW

Availability Digest. HPE Helion Private Cloud and Cloud Broker Services February 2016

Proactively Secure Your Cloud Computing Platform

ArcGIS for Server: In the Cloud

2) Xen Hypervisor 3) UEC

Iron Chef: Bare Metal OpenStack

Deploying Hadoop with Manager

Cloudify and OpenStack Heat

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud

How To Use Openstack On Your Laptop

WHITE PAPER: Egenera Cloud Suite

SUSE Cloud 2.0. Pete Chadwick. Douglas Jarvis. Senior Product Manager Product Marketing Manager

Introduction to Cloud : Cloud and Cloud Storage. Lecture 2. Dr. Dalit Naor IBM Haifa Research Storage Systems. Dalit Naor, IBM Haifa Research

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

OpenStack Alberto Molina Coballes

Implementing Microsoft Windows Server Failover Clustering (WSFC) and SQL Server 2012 AlwaysOn Availability Groups in the AWS Cloud

Benchmarking Sahara-based Big-Data-as-a-Service Solutions. Zhidong Yu, Weiting Chen (Intel) Matthew Farrellee (Red Hat) May 2015

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

HDFS Cluster Installation Automation for TupleWare

OpenTOSCA Release v1.1. Contact: Documentation Version: March 11, 2014 Current version:

Scale Cloud Across the Enterprise

Monitor Open stack environments from the bottom up and front to back. Roger Ruttimann VP Engineering, GroundWork OpenSource November 17, 2015

Amazon EC2 Product Details Page 1 of 5

A STUDY OF ADOPTING BIG DATA TO CLOUD COMPUTING

T Mobile Cloud Computing Private Cloud & Assignment

Assignment # 1 (Cloud Computing Security)

Postgres Plus Cloud Database!

Change the Game with HP Helion

EMC ENTERPRISE HYBRID CLOUD 2.5 FEDERATION SOFTWARE- DEFINED DATA CENTER EDITION

Transcription:

HOPS: Hadoop Open Platform-as-a-Service Alberto Lorente, Hamid Afzali, Salman Niazi, Mahmoud Ismail, Kamal Hakimazadeh, Hooman Piero, Jim Dowling jdowling@sics.se Scale Research Laboratory

What is Hadoop? Open-source software for storing enormous data sets across distributed commodity servers and then running distributed analysis applications in parallel over the data. Hadoop 1.x Hadoop 2.x MapReduce (data processing) Others (spark, mpi, giraph, etc) MapReduce (resource mgmt, job scheduler, data processing) HDFS (distributed storage) YARN (resource mgmt, job scheduler) HDFS (distributed storage)

Why Hadoop on Virtualized Infrastructures*? Scheduling and resource utilization - Elastic scalability on demand Datacenter efficiency - Share your computing resources Ease-of-Deployment - Lower barrier to adoption http://wiki.apache.org/hadoop/virtual%20hadoop*

Hadoop on the Cloud Cloud Computing traditionally separates storage and computation. Amazon OpenStack Web Services Nova (Compute) EC2 Swift Elastic (Object Block Storage) Glance (VM S3 Images)

Hadoop is designed around Data Locality

Can Hadoop be deployed in virtual infrastructures?

Yes, But Cloud hardware configurations should support data locality Hadoop s original topology awareness breaks Placement of >1 VM containing block replicas for the same file on the same physical host increases correlated failures Need VMWare s NodeGroup aware topology HADOOP-8468

Data Locality complicates Hadoop Lifecycle Shouldn t upgrade Hadoop clusters by replacing VMs - Assuming we re not using instances with external storage volumes, we need to upgrade software on existing instances Create VM from pre-built image Install/upgrade software Configure and restart services

HOPS and HDFS

HDFS in Hadoop 1.0 SpoF Client read /home/jd/mydna.bam NN mgmt & reports DN DN DN DN

High Availability in HDFS 2.0 ZK ZK ZK Heartbeats FailoverController Active JN JN JN FailoverController Standby NN Active Shared NN log stored in NN quorum of journal nodes NN Standby DN DN DN DN

HDFS in HOPS Up to 48 nodes NDB 72 million 100-btye transactional reads/sec* NN NN NN Master DN DN DN DN *http://mikaelronstrom.blogspot.se/2012/05/mysql-cluster-72-achieves-43bn-reads.html

HOPS HDFS Features Increased read and write throughput Supports TBs of metadata (vs GBs in vanilla HDFS) Easy to customize metadata - Block-level indexing NDB provides online backups, geographic replication, online add-node Major challenge for users: Installing, configuring, and managing HOPS

HOPS has typically four different stacks* ResourceMgr NodeMgr MYSQLD NN DN NDBD MGMD ssh, agent, chef, collectd ssh, agent, chef, collectd ssh, agent, chef, collectd ssh, agent, chef, collectd *Configured stacks include apps, dependencies, and firewalls.

Plus a Dashboard Stack REST APIs Web Application JClouds Glassfish ssh, chef, collectd-server

How do we deploy our PaaS? Data Center NDBD NDBD MGMD DashB NN NN DN DN DN

You could bootstrap the Dashboard, old style..

Cloudera Manager Cloud Express Wizard* Abridged EC2-specific installation instructions* Go to EC2 in AWS web console and select Instances Use the default N. Virginia (us-east-1) region. Click on Launch Instance On the next page, pick the Ubuntu Server 12.04 LTS 64-bit image. select Create a new Key Pair. click Create and Download your key pair. save this file or you won t be able to SSH into the instance we re about to launch. $ wget http://archive.cloudera.com/cm4/installer/latest/clouderamanager-installer.bin $ chmod +x cloudera-manager-installer.bin $ sudo./cloudera-manager-installer.bin *http://blog.cloudera.com/blog/2013/03/how-to-create-a-cdh-cluster-on-amazon-ec2-via-cloudera-manager/

Oh yes, I forgot it s an Open PaaS By Open PaaS, we mean that we should be able to deploy our PaaS on different platforms with no code changes: - Public Clouds: AWS, Rackspace, etc - Private Clouds: OpenStack, Docker, etc - Bare-Metal No vendor lock-in.

Direct deployment of PaaS on AWS from a Portal Public Cloud (AWS) NDBD NDBD MGMD Website DashB NN NN DN DN DN

Launch via Dashboard when few public IPs available Private Cloud (OpenStack) NDBD NDBD MGMD Website DashB NN NN DN DN DN

How do we install the software on the VMs?

The Dashboard is typically a custom VMI AWS - Provides APIs to automate the creation and loading of custom VM images (VMIs)* DashB OpenStack - Glance is OpenStack s HTTP-based image server (slow due to serialized access)** *http://techblog.netflix.com/2013/03/ami-creation-with-aminator.html **http://blog.gridcentric.com/bid/318277/boosting-openstack-s-parallel-performance

Other nodes are configured using Chef

Chef to Install software on Nodes Dashboard Recipes Chef Recipes are infrastructure as code: idempotent & composable ResourceMgr NodeMgr chef chef chef MYSQLD NN DN MGMD ssh, agent, chef, collectd ssh, agent, chef, collectd ssh, agent, chef, collectd

But Chef and VMIs are not enough...

HOPS Open PaaS HOPS PaaS API (YAML) Chef VMIs JClouds Create VMs BitTorrent Reduce Install Times ssh Amazon Webservices OpenStack Bare Metal

Define the cluster in single file

PaaS Orchestration Language --- name: HopsTest environment: dev recipes: [ssh, chefclient, collectd] authorizeports: [8080, 8181, 3306, 4343, 3321] global scope (all nodes) provider: Define the cloud you will deploy on, and default node types name: aws-ec2 instancetype: m1.large login: ubuntu image: eu-west-1/ami-35667941 region: eu-west-1 zones: [eu-west-1a]

PaaS Orchestration Language nodes: - ndbd number: 2 - mgmd number: 1 recipes: [mysqld] - namenode number: 2 repository: http://myrepo.git - datanode recipes: [nodemgr] number: 3 - dashboard recipes: [resourcemgr] number: 1 Should be the name of a chef recipe Explicitly add chef recipes to this node Use a custom recipe for the NameNode # chefattributes: {..} # authorizeports: [..] # roles: [..] # image:... # bittorrent: true other parameters per node-group

Language Design Decisions Declarative language Chef-friendly default behaviors Configure nodes by adding recipes or roles - Users can fork our default recipes and customize them. - Per node-group parameters supported. Language implementation can be changed to use a different configuration mgmt system, if needed.

Jclouds/Chef in the Wild Node provisioning can fail for a variety of reasons. Stragglers will appear as clusters grow in size. Low default number of parallel operations can be issued to AWS, OpenStack, etc. Benchmark VMs to identify poorly performing ones.

Provisioning Scheduler Respawn & provision failed and slow VMs. Scheduler executes Chef recipes as a series of phases - Chef recipes are decomposed into the following phases: init, bittorrent, install, configure BitTorrent phase Useful for larger software packages and private clouds. Standard package repos used in the event of failure in the install phase.

Related Work Virtualization and Hadoop - Project Serengeti (VMWare) - Project Savanna (Hortonworks & OpenStack) - Elastic MapReduce (Amazon Web Services) Administration of Hadoop Clusters - Cloudera Manager with Puppet - Hortonworks Ambari with Puppet Open PaaS - Cloudify provide a general OpenPaaS middleware based on a groovy DSL, with additional support for chef.

Tack för mig!

VM S LAUNCHING VM LAUNCHING! [XKCD Comic 303]