Chef Patterns at Bloomberg Scale HADOOP INFRASTRUCTURE TEAM. https://github.com/bloomberg/chef-bach Freenode: #chef-bach



Similar documents
CHEF IN THE CLOUD AND ON THE GROUND

Cloudera Manager Training: Hands-On Exercises

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Ankush Cluster Manager - Hadoop2 Technology User Guide

The Greenplum Analytics Workbench

Upgrading a Single Node Cisco UCS Director Express, page 2. Supported Upgrade Paths to Cisco UCS Director Express for Big Data, Release 2.

How Bigtop Leveraged Docker for Build Automation and One-Click Hadoop Provisioning

Big Data Operations Guide for Cloudera Manager v5.x Hadoop

docs.hortonworks.com

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

docs.hortonworks.com

Spectrum Scale HDFS Transparency Guide

How to Install and Configure EBF15328 for MapR or with MapReduce v1

Deploying Hadoop with Manager

IBM Security Access Manager for Enterprise Single Sign-On V8.2 Implementation Exam.

HDFS Users Guide. Table of contents

IBM Cloud Manager with OpenStack

HADOOP MOCK TEST HADOOP MOCK TEST II

docs.hortonworks.com

Single Sign On. Configuration Checklist for Single Sign On CHAPTER

Installation Guide Avi Networks Cloud Application Delivery Platform Integration with Cisco Application Policy Infrastructure

Perforce Helix Threat Detection OVA Deployment Guide

Pivotal HD Enterprise

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

How To Use Cloudera Manager Backup And Disaster Recovery (Brd) On A Microsoft Hadoop (Clouderma) On An Ubuntu Or 5.3.5

Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage Division

COURSE CONTENT Big Data and Hadoop Training

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

The Hadoop Distributed File System

Hadoop as a Service. VMware vcloud Automation Center & Big Data Extension

Cloudera Manager Introduction

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster

AFW: Automating host-based firewalls with Chef

Control-M for Hadoop. Technical Bulletin.

CDH 5 Quick Start Guide

Hadoop Distributed File System Propagation Adapter for Nimbus

Glassfish Architecture.

Installing and Administering VMware vsphere Update Manager

Cloudera Backup and Disaster Recovery

IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Hyper-V Server Agent Version Fix Pack 2.

Chancery SMS Database Split

Single Sign On. Configuration Checklist for Single Sign On CHAPTER

Jenkins and Chef Infrastructure CI and Application Deployment

PrivateWire Gateway Load Balancing and High Availability using Microsoft SQL Server Replication

Hadoop Training Hands On Exercise

Pivotal HD Enterprise 1.0 Stack and Tool Reference Guide. Rev: A03

Comparing Scalable NOSQL Databases

CI Pipeline with Docker

CRITEO INTERNSHIP PROGRAM 2015/2016

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

DevOps Best Practices for Mobile Apps. Sanjeev Sharma IBM Software Group

Testing Spark: Best Practices

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

FioranoMQ 9. High Availability Guide

CDH 5 High Availability Guide

Ensure that your environment meets the requirements. Provision the OpenAM server in Active Directory, then generate keytab files.

CORD Monitoring Service

Release Notes for Fuel and Fuel Web Version 3.0.1

Deploying and Managing SolrCloud in the Cloud ApacheCon, April 8, 2014 Timothy Potter. Search Discover Analyze

IBM InfoSphere MDM Server v9.0. Version: Demo. Page <<1/11>>

Hadoop Setup. 1 Cluster

Insights to Hadoop Security Threats

Introduction to HDFS. Prasanth Kothuri, CERN

SUSE Cloud Installation: Best Practices Using an Existing SMT and KVM Environment

Cisco UCS CPA Workflows

This How To guide will take you through configuring Network Load Balancing and deploying MOSS 2007 in SharePoint Farm.

Exam Name: IBM InfoSphere MDM Server v9.0

Apache Sentry. Prasad Mujumdar

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

User and Group-Based Reporting in TRITON - Web Security: Best Practices and Troubleshooting

HADOOP MOCK TEST HADOOP MOCK TEST I

DEPLOYING EMC DOCUMENTUM BUSINESS ACTIVITY MONITOR SERVER ON IBM WEBSPHERE APPLICATION SERVER CLUSTER

Cloudera Backup and Disaster Recovery

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

Advantages and Disadvantages of Application Network Marketing Systems

SUSE Cloud 2.0. Pete Chadwick. Douglas Jarvis. Senior Product Manager Product Marketing Manager

The future of middleware: enterprise application integration and Fuse

TIBCO Spotfire Statistics Services Installation and Administration Guide. Software Release 5.0 November 2012

Deploy Big Data Extensions on vsphere Standard Edition

docs.hortonworks.com

Understanding MySQL storage and clustering in QueueMetrics. Loway

Hadoop Basics with InfoSphere BigInsights

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

Unicenter NSM Integration for Remedy (v 1.0.5)

Cloudera Manager Health Checks

IceWarp to IceWarp Server Migration

Hadoop IST 734 SS CHUNG

vcenter Operations Manager for Horizon Supplement

STREAM ANALYTIX. Industry s only Multi-Engine Streaming Analytics Platform

Table Of Contents. 1. GridGain In-Memory Database

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Chase Wu New Jersey Ins0tute of Technology

Transcription:

CHEF PATTERNS AT BLOOMBERG SCALE HADOOP INFRASTRUCTURE TEAM https://github.com/bloomberg/chef-bach Freenode: #chef-bach

BLOOMBERG CLUSTERS 2 APPLICATION SPECIFIC Hadoop, Kafka ENVIRONMENT SPECIFIC Networking, Storage BUILT REGULARLY DEDICATED BOOTSTRAP SERVER Virtual Machine DEDICATED CHEF-SERVER

WHY A VM? 3 LIGHTWEIGHT PRE-REQUISITE Low memory/storage Requirements RAPID DEPLOYMENT Vagrant for Bring-Up Vagrant for Re-Configuration EASY RELEASE MANAGEMENT MULTIPLE VM PER HYPERVISOR Multiple Clusters EASY RELOCATION

SERVICES OFFERED 4 REPOSITORIES APT Ruby Gems Static Files (Chef!) CHEF SERVER KERBEROS KDC PXE SERVER DHCP/TFTP Server Cobbler (https://github.com/bloomberg/cobbler-cookbook) Bridged Networking (for test VMs) STRONG ISOLATION

BUILDING BOOTSTRAP 5 CHEF AND VAGRANT Generic Image (Jenkins) NETWORK CONFIGURATION CORRECTING KNIFE.RB CHEF SERVER RECONFIGURATION CLEAN UP (CHEF REST API) CONVERT BOOTSTRAP TO BE AN ADMIN CLIENT Secrets/Keys

BUILDING BOOTSTRAP CHEF-SOLO PROVISIONER # Chef provisioning bootstrap.vm.provision "chef_solo" do chef chef.environments_path = [[:vm,""]] chef.environment = env_name chef.cookbooks_path = [[:vm,""]] chef.roles_path = [[:vm,""]] chef.add_recipe("bcpc::bootstrap_network") chef.log_level="debug" chef.verbose_logging=true chef.provisioning_path="/home/vagrant/chef-bcpc/" CHEF SERVER RECONFIGURATION NGINX, SOLR, RABBITMQ # Reconfigure chef-server bootstrap.vm.provision :shell, :inline => "chef-server-ctl reconfigure" 6

BUILDING BOOTSTRAP CLEAN UP (REST API) ruby_block "cleanup-old-environment-databag" do block do rest = Chef::REST.new(node[:chef_client][:server_url], "admin", \ "/etc/chef-server/admin.pem") rest.delete("/environments/generic") rest.delete("/data/configs/generic") ruby_block "cleanup-old-clients" do block do system_clients = ["chef-validator", "chef-webui"] rest = Chef::REST.new(node[:chef_client][:server_url], "admin", \ "/etc/chef-server/admin.pem") rest.get_rest("/clients").each do client if!system_clients.include?(client.first) rest.delete("/clients/#{client.first}") 7

BUILDING BOOTSTRAP 8 CONVERT TO ADMIN (BOOTSTRAP_CONFIG.RB) ruby_block "convert-bootstrap-to-admin" do block do rest = Chef::REST.new(node[:chef_client][:server_url], "admin", "/etc/chef-server/admin.pem") rest.put_rest("/clients/#{node[:hostname]}",{:admin => true}) rest.put_rest("/nodes/#{node[:hostname]}", { :name => node[:hostname], :run_list => ['role[bcpc-bootstrap]'] } )

CLUSTER USABILITY 9 CODE DEPLOYMENT APPLICATION COOKBOOKS RUBY GEMS Zookeeper, WebHDFS CLUSTERS ARE NOT SINGLE MACHINE Which machine to deploy Idempotency; Races

DEPLOY TO HDFS 10 USE CHEF DIRECTORY RESOURCE USE CUSTOM PROVIDER https://github.com/bloomberg/chefbach/blob/master/cookbooks/bcpchadoop/libraries/hdfsdirectory.rb directory /projects/myapp do mode 755 owner foo recursive true provider BCPC::HdfsDirectory

DEPLOY KAFKA TOPIC 11 USE LWRP Dynamic Topic; Right Zookeeper PROVIDER CODE AVAILABLE AT https://github.com/mthssdrbrg/kafka-cookbook/pull/49 # Kafka Topic Resource actions :create, :update attribute :name, :kind_of => String, :name_attribute => true attribute :partitions, :kind_of => Integer, :default => 1 attribute :replication, :kind_of => Integer, :default => 1

KERBEROS 12 KEYTABS Per Service / Host Up to 10 Keytabs per Host WHAT ABOUT MULTI HOMED HOSTS? Hadoop imputes _HOST PROVIDERS WebHDFS uses SPNEGO SYSTEM ROLE ACCOUNTS TENANT ROLE ACCOUNTS AVAILABLE AT https://github.com/bloomberg/chef-bach/tree/kerberos

LOGIC INJECTION 13 Statutory Warning Code snippets are edited to fit the slides which may have resulted in logic incoherence, bugs and un-readability. Readers discretion requested. COMPLETE CODE CAN BE FOUND AT Community cookbook https://github.com/mthssdrbrg/kafka-cookbook#controlling-restart-ofkafka-brokers-in-a-cluster Wrapper custom recipe https://github.com/bloomberg/chefbach/blob/rolling_restart/cookbooks/kafka-bcpc/recipes/coordinate.rb

LOGIC INJECTION 14 WE USE COMMUNITY COOKBOOKS Takes care of standard install, enable and starting of services NEED TO ADD LOGIC TO COOKBOOK RECIPES Take action on a service only when conditions are satisfied Take action on a service based on depent service state

LOGIC INJECTION 15 VANILLA COMMUNITY COOKBOOK: template ::File.join(node.kafka.config_dir, 'server.properties') do source 'server.properties.erb'... helpers(kafka::configuration) if restart_on_configuration_change? notifies :restart, 'service[kafka]', :delayed service 'kafka' do provider kafka_init_opts[:provider] supports start: true, stop: true, restart: true, status: true action kafka_service_actions

LOGIC INJECTION VANILLA COMMUNITY COOKBOOK: template ::File.join(node.kafka.config_dir, 'server.properties') do source 'server.properties.erb'... helpers(kafka::configuration) if restart_on_configuration_change? notifies :restart, 'service[kafka]', :delayed #----- Remove ----# service 'kafka' do provider kafka_init_opts[:provider] supports start: true, stop: true, restart: true, status: true action kafka_service_actions #----- Remove----# 16

LOGIC INJECTION 17 VANILLA COMMUNITY COOKBOOK 2.0: template ::File.join(node.kafka.config_dir, 'server.properties') do source 'server.properties.erb... helpers(kafka::configuration) if restart_on_configuration_change? notifies :create, 'ruby_block[pre-shim]', :immediately #----- Replace----# include_recipe node["kafka"]["start_coordination"]["recipe"] #----- Replace----#

LOGIC INJECTION 18 COOKBOOK COORDINATOR RECIPE: ruby_block 'pre-shim' do # pre-restart no-op notifies :restart, 'service[kafka] ', :delayed service 'kafka' do provider kafka_init_opts[:provider] supports start: true, stop: true, restart: true, status: true action kafka_service_actions

LOGIC INJECTION 19 WRAPPER COORDINATOR RECIPE: ruby_block 'pre-shim' do # pre-restart done here notifies :restart, 'service[kafka] ', :delayed service 'kafka' do provider kafka_init_opts[:provider] supports start: true, stop: true, restart: true, status: true action kafka_service_actions notifies :create, 'ruby_block[post-shim] ', :immediately ruby_block 'post-shim' do # clean-up done here

SERVICE ON DEMAND 20 COMMON SERVICE WHICH CAN BE REQUESTED Copy log files from applications into a centralized location Single location for users to review logs and helps with security Service available on all the nodes Applications can request the service dynamically

SERVICE ON DEMAND 21 NODE ATTRIBUTE TO STORE SERVICE REQUESTS default['bcpc']['hadoop']['copylog'] = {} DATA STRUCTURE TO MAKE SERVICE REQUESTS { } 'app_id' => { 'logfile' => "/path/file_name_of_log_file", 'docopy' => true (or false) },...

SERVICE ON DEMAND 22 APPLICATION RECIPES MAKE SERVICE REQUESTS # # Updating node attributes to copy HBase master log file to HDFS # node.default['bcpc']['hadoop']['copylog']['hbase_master'] = { 'logfile' => "/var/log/hbase/hbase-master-#{node.hostname}.log", 'docopy' => true } node.default['bcpc']['hadoop']['copylog']['hbase_master_out'] = { 'logfile' => "/var/log/hbase/hbase-master-#{node.hostname}.out", 'docopy' => true }

SERVICE ON DEMAND 23 RECIPE FOR THE COMMON SERVICE node['bcpc']['hadoop']['copylog'].each do id,f if f['docopy'] template "/etc/flume/conf/flume-#{id}.conf" do source "flume_flume-conf.erb action :create... variables(:agent_name => "#{id}", :log_location => "#{f['logfile']}" ) notifies :restart,"service[flume-agent-multi-#{id}]",:delayed service "flume-agent-multi-#{id}" do supports :status => true, :restart => true, :reload => false service_name "flume-agent-multi" action :start start_command "service flume-agent-multi start #{id}" restart_command "service flume-agent-multi restart #{id}" status_command "service flume-agent-multi status #{id}"

PLUGGABLE ALERTS 24 SINGLE SOURCE FOR MONITORED STATS Allows users to visualize stats across different parameters Didn t want to duplicate the stats collection by alerting system Need to feed data to the alerting system to generate alerts

PLUGGABLE ALERTS ATTRIBUTE WHERE USERS CAN DEFINE ALERTS default["bcpc"]["hadoop"]["graphite"]["queries"] = { 'hbase_master' => [ { 'type' => "jmx", 'query' => "memory.nonheapmemoryusage_committed", 'key' => "hbasenonheapmem", 'trigger_val' => "max(61,0)", 'trigger_cond' => "=0", 'trigger_name' => "HBaseMasterAvailability", 'trigger_dep' => ["NameNodeAvailability"], 'trigger_desc' => "HBase master seems to be down", 'severity' => 1 },{ 'type' => "jmx", 'query' => "memory.heapmemoryusage_committed", 'key' => "hbaseheapmem",... },...], namenode' => [...]...} Query to pull stats from data source Define alert criteria 25

TEMPLATE PITFALLS 26 LIBRARY FUNCTION CALLS IN WRAPPER COOKBOOKS Community cookbook provider accepts template as an attribute Template passed from wrapper makes a library function call Wrapper recipe includes the module of library function

TEMPLATE PITFALLS WRAPPER RECIPE... Chef::Resource.s(:include, Bcpc::OSHelper)... cobbler_profile "bcpc_host" do kickstart "cobbler.bcpc_ubuntu_host.preseed" distro "ubuntu-12.04-mini-x86_64... 27 FUNCTION CALL IN TEMPLATE... d-i passwd/user-password-crypted password <%="#{get_config(@node, 'cobbler-root-password-salted')}"%> d-i passwd/user-uid string...

TEMPLATE PITFALLS 28 MODIFIED FUNCTION CALL IN TEMPLATE... d-i passwd/user-password-crypted password <%="#{Bcpc::OSHelper.get_config(@node, 'cobbler-root-passwordsalted')}"%> d-i passwd/user-uid string...

DYNAMIC RESOURCES 29 ANIT-PATTERN? ruby_block "create namenode directories" do block do node[:bcpc][:storage][:mounts].each do d dir = Chef::Resource::Directory.new("#{mount_root}/#{d}/dfs/nn", run_context) dir.owner "hdfs" dir.group "hdfs" dir.mode 0755 dir.recursive true dir.run_action :create exe = Chef::Resource::Execute.new("fixup nn owner", run_context) exe.command "chown -Rf hdfs:hdfs #{mount_root}/#{d}/dfs" exe.only_if { Etc.getpwuid(File.stat("#{mount_root}/#{d}/dfs/").uid).name!= "hdfs " }

DYNAMIC RESOURCES 30 SYSTEM CONFIGURATION Lengthy Configuration of a Storage Controller Setting Attributes at Converge Time Compile Time Actions? MUST WRAP IN RUBY_BLOCK S Does not Update the Resource Collection Lazy s everywhere: Guards: not_if{lazy{node[ ]}.call.map{ }}

SERVICE RESTART 31 WE USE JMXTRANS TO MONITOR JMX STATS Service to be monitored varies with node There can be more than one service to be monitored Monitored service restart requires JMXtrans to be restarted**

SERVICE RESTART 32 DATA STRUCTURE IN ROLES TO DEFINE THE SERVICES "default_attributes" : { "jmxtrans :{ "servers :[ { "type": "datanode", "service": "hadoop-hdfs-datanode", "service_cmd": "org.apache.hadoop.hdfs.server.datanode.datanode" }, { "type": "hbase_rs", "service": "hbase-regionserver", "service_cmd": org.apache.hadoop.hbase.regionserver.hregionserver" } ] }... Depent Service Name String to uniquely identify the service process

SERVICE RESTART 33 JMXTRANS SERVICE RESTART LOGIC BUILT DYNAMICALLY jmx_services = Array.new jmx_srvc_cmds = Hash.new node['jmxtrans']['servers'].each do server jmx_services.push(server['service']) jmx_srvc_cmds[server['service']] = server['service_cmd'] service "restart jmxtrans on depent service" do service_name "jmxtrans" supports :restart => true, :status => true, :reload => true Store the depent service name and process ids in local variables action :restart jmx_services.each do jmx_dep_service subscribes :restart, "service[#{jmx_dep_service}]", :delayed only_if {process_require_restart?("jmxtrans","jmxtrans-all.jar, jmx_srvc_cmds)} Subscribes from all depent services What if a process is re/started externally?

SERVICE RESTART 34 def process_require_restart?(process_name, process_cmd, dep_cmds) tgt_proces_pid = `pgrep -f #{process_cmd}`... tgt_proces_stime = `ps --no-header -o start_time #{tgt_process_pid}`... ret = false restarted_processes = Array.new dep_cmds.each do dep_process, dep_cmd dep_pids = `pgrep -f #{dep_cmd}` if dep_pids!= "" dep_pids_arr = dep_pids.split("\n") dep_pids_arr.each do dep_pid Start time of the service process Start time of all the service processes on which it is depent on Compare the start time dep_process_stime = `ps --no-header -o start_time #{dep_pid}` if DateTime.parse(tgt_proces_stime) < DateTime.parse(dep_process_stime) restarted_processes.push(dep_process) ret = true...

ROLLING RESTART 35 AUTOMATIC CONVERGENCE AVAILABILITY HOW High Availability Toxic Configuration Check Masters for Slave Status Synchronous Communication Locking

ROLLING RESTART 36 FLAGGING Negative Flagging flag when a service is down Positive Flagging flag when a service is reconfiguring Deadlock Avoidance CONTENTION Poll & Wait Fail the Run Simply Skip Service Restart and Go On Store the Need for Restart Breaks Assumptions of Procedural Chef Runs

ROLLING RESTART 37 SERVICE DEFINITION HADOOP_SERVICE "ZOOKEEPER-SERVER" DO DEPENDENCIES ["TEMPLATE[/ETC/ZOOKEEPER/CONF/ZOO.CFG]", "TEMPLATE[/USR/LIB/ZOOKEEPER/BIN/ZKSERVER.SH]", "TEMPLATE[/ETC/DEFAULT/ZOOKEEPER-SERVER]"] PROCESS_IDENTIFIER "ORG.APACHE.ZOOKEEPER... QUORUMPEERMAIN" END

ROLLING RESTART 38 SYNCH STATE STORE Zookeeper SERVICE RESTART (KAFKA) VALIDATION CHECK Based on Jenkins pattern for wait_until_ready! Verifies that the service is up to an acceptable level Passes or stops the Chef run FUTURE DIRECTIONS Topology Aware Deployment Data Aware Deployment

WE ARE HIRING JOBS.BLOOMBERG.COM: Hadoop Infrastructure Engineer DevOps Engineer Search Infrastructure https://github.com/bloomberg/chef-bach Freenode: #chef-bach