Troubleshooting Your SUSE TUT19873. OpenStack Cloud. Adam Spiers SUSE Cloud/HA Senior Software Engineer



Similar documents
How To Make A Cloud Work For You

SUSE Cloud 5 Private Cloud based on OpenStack

How an Open Source Cloud Will Help Keep Your Cloud Strategy Options Open

TUT5605: Deploying an elastic Hadoop cluster Alejandro Bonilla

Using SUSE Cloud to Orchestrate Multiple Hypervisors and Storage at ADP

HO15982 Deploy OpenStack. The SUSE OpenStack Cloud Experience. Alejandro Bonilla. Michael Echavarria. Cameron Seader. Sales Engineer

Automated Deployment of an HA OpenStack Cloud

HO5604 Deploying MongoDB. A Scalable, Distributed Database with SUSE Cloud. Alejandro Bonilla. Sales Engineer abonilla@suse.com

SUSE OpenStack Cloud 4 Private Cloud Platform based on OpenStack. Gábor Nyers Sales gnyers@suse.com

High Availability Storage

Open Source High Availability Writing Resource Agents for your own services. Lars Marowsky-Brée Team Lead SUSE Labs

SUSE Linux uutuudet - kuulumiset SUSECon:sta

Ceph Distributed Storage for the Cloud An update of enterprise use-cases at BMW

Software Defined Everything

Advanced Systems Management with Machinery

SUSE Cloud 2.0. Pete Chadwick. Douglas Jarvis. Senior Product Manager Product Marketing Manager

SUSE Customer Center Roadmap

SUSE Cloud Installation: Best Practices Using an Existing SMT and KVM Environment

Deploying Hadoop with Manager

DevOps and SUSE From check-in to deployment

Relax-and-Recover. Johannes Meixner. on SUSE Linux Enterprise 12.

SUSE Cloud Installation: Best Practices Using a SMT, Xen and Ceph Storage Environment

Running SAP HANA One on SoftLayer Bare Metal with SUSE Linux Enterprise Server CAS19256

Installing, Tuning, and Deploying Oracle Database on SUSE Linux Enterprise Server 12 Technical Introduction

We are watching SUSE

Using btrfs Snapshots for Full System Rollback

Build Platform as a Service (PaaS) with SUSE Studio, WSO2 Middleware, and EC2 Chris Haddad

KVM, OpenStack and the Open Cloud SUSECon November 2015

Challenges Implementing a Generic Backup-Restore API for Linux

Data Center Automation with SUSE Manager Federal Deployment Agency Bundesagentur für Arbeit Data Center Automation Project

TUT8155 Best Practices: Linux High Availability with VMware Virtual Machines

Based on Geo Clustering for SUSE Linux Enterprise Server High Availability Extension

Operating System Security Hardening for SAP HANA

Oracle Products on SUSE Linux Enterprise Server 11

High Availability and Disaster Recovery for SAP HANA with SUSE Linux Enterprise Server for SAP Applications

SUSE Enterprise Storage Highly Scalable Software Defined Storage. Gábor Nyers Sales

Btrfs and Rollback How It Works and How to Avoid Pitfalls

Ubuntu OpenStack on VMware vsphere: A reference architecture for deploying OpenStack while limiting changes to existing infrastructure

Iron Chef: Bare Metal OpenStack

Big Data, SAP HANA. SUSE Linux Enterprise Server for SAP Applications. Kim Aaltonen

Of Pets and Cattle and Hearts

How SUSE Is Helping You Rock The Public Cloud

SUSE Storage. FUT7537 Software Defined Storage Introduction and Roadmap: Getting your tentacles around data growth. Larry Morris

Public Cloud. Build, Use, Manage. Robert Schweikert. Public Cloud Architect

Wicked A Network Manager Olaf Kirch

TUT19741 Use SUSE Cloud 5 with Manila to utilize NetApp s enterprise class storage for SAP workloads

CS312 Solutions #6. March 13, 2015

Using SUSE Linux Enterprise to "Focus In" on Retail Optical Sales

SUSE Cloud Deployment Guide Questionnaire

How To Use Openstack On Your Laptop

Kangaroot SUSE TechUpdate Interoperability SUSE Linux Enterprise and Windows

Configuration Management in SUSE Manager 3

RED HAT ENTERPRISE LINUX OPENSTACK PLATFORM

User Guide for VMware Adapter for SAP LVM VERSION 1.2

Automation and DevOps Best Practices. Rob Hirschfeld, Dell Matt Ray, Opscode

Novell PlateSpin Orchestrate

SUSE Cloud. End User Guide. August 06, 2014

SUSE Cloud. Deployment Guide. January 26, 2015

How To Install Openstack On Ubuntu (Amd64)

Develop a process for applying updates to systems, including verifying properties of the update. Create File Systems

SUSE Virtualization Technologies Roadmap

OpenLDAP in High Availability Environments

SUSE Cloud. Deployment Guide. February 20, 2015

Release Notes for Fuel and Fuel Web Version 3.0.1

From Idea to Working Deployment:

NetIQ Sentinel Quick Start Guide

Splunk for VMware Virtualization. Marco Bizzantino Vmug - 05/10/2011

Novell Collaboration Vibe OnPrem

Comparing Ganeti to other Private Cloud Platforms. Lance Albertson

SUSE Manager. A Comprehensive Linux Server Management the Linux Way. Name. Title

CAS18543 Migration from a Windows Environment to a SUSE Linux Enterprise based Infrastructure Liberty Christian School

HP OpenStack & Automation

CloudCIX Bootcamp. The essential IaaS getting started guide.

Implementing Linux Authentication and Authorisation Using SSSD

Getting Started with an OpenStackbased Cloud Using SUSE Cloud to Run SAP Applications

Outline. Why Neutron? What is Neutron? API Abstractions Plugin Architecture

Project Documentation

OpenStack Introduction. November 4, 2015

High Availability Solutions for the MariaDB and MySQL Database

Wicked Trip into Wicked Network Management

Event Manager. LANDesk Service Desk

Novell Sentinel Log Manager 1.2 Release Notes. 1 What s New. 1.1 Enhancements to Licenses. Novell. February 2011

Novell PlateSpin Orchestrate

Cloud Computing for Control Systems CERN Openlab Summer Student Program 9/9/2011 ARSALAAN AHMED SHAIKH

Novell Remote Manager Administration Guide

Novell Identity Manager Resource Kit

Workflow und Identity Management - Genehmigungsprozesse, Role Mining, Role Design und Compliance Management

Implementing the SUSE Linux Enterprise High Availability Extension on System z

Transcription:

Troubleshooting Your SUSE TUT19873 OpenStack Cloud Adam Spiers SUSE Cloud/HA Senior Software Engineer Dirk Müller SUSE OpenStack Senior Software Engineer

SUSE OpenStack Cloud...

SUSE OpenStack Cloud

SUSE OpenStack Cloud 4653 Parameters

SUSE OpenStack Cloud 14 Components

SUSE OpenStack Cloud 2 Hours

SUSE OpenStack Cloud Troubleshooting <1 Hour

SUSE OpenStack Cloud Management Billling VM SUSE Mgmt Image SUSE Tool Portal App Monitor Sec & Perf Manager Studio Cloud Dashboard (Horizon) Cloud APIs Heat Telemetry RadosGW Required Services Message RabbitMQQ Postgresql Database Compute (Nova) Hypervisor Xen, KVM Vmware, HyperV AUTH (Keystone) Images (Glance) Object (Swift) Network (Neutron) Adapters Block (Cinder) Adapters RBD SUSE Linux Operating Enterprise System Server 11 SP3 Rados Physical Infrastructure: x86-64, Switches, Storage OpenStack SUSE Management Cloud AddsTools SUSE OS and Product Hypervisor Partner Physical Solutions Infrastructure

Non-HA SUSE OpenStack Cloud

HA SUSE OpenStack Cloud

Crowbar and Chef

Generic SLES Troubleshooting All Nodes in SUSE OpenStack Cloud are SLES based Watch out for typical issues: dmesg for hardware-related errors, OOM, interesting kernel messages usual syslog targets, e.g. /var/log/messages Check general node health via: top, vmstat, uptime, pstree, free core files, zombies, etc #

Supportconfig supportconfig can be run on any cloud node supportutils-plugin-susecloud.rpm installed on all SUSE OpenStack Cloud nodes automatically collects precious cloud-specific information for further analysis #

Admin Node: Crowbar UI Useful Export Page available in the Crowbar UI in order to export various log files #

Cloud Install screen install-suse-cloud --verbose /var/log/crowbar/install.log /var/log/crowbar/barclamp_install/*.log

SUSE OpenStack Cloud Admin SUSE Cloud Addon Crowbar UI Crowbar Services Chef/Rabbit Repo Mirror Install logs: /var/log/crowbar/install.log Chef/Rabbit: /var/log/rabbitmq/*.log /var/log/chef/server.log /var/log/couchdb/couchdb.log Crowbar repo server: /var/log/apache2/provisioner*log Crowbar: /var/log/crowbar/production.{out, log} SLES 11 SP3

Chef Cloud uses Chef for almost everything: All Cloud and SLES non-core packages All config files are overwritten All daemons are started Database tables are initialized http://docs.getchef.com/chef_quick_overview.html #

Admin Node: Using Chef knife node list knife node show <nodeid> knife search node "*:*"

SUSE OpenStack Cloud Admin Populate ~root/.ssh/authorized_keys prior install Barclamp install logs: /var/log/crowbar/barclamp_install Node discovery logs: /var/log/crowbar/sledgehammer/d<macid>.<domain>.log Syslog of crowbar installed nodes sent via rsyslog to: /var/log/nodes/d<macid>.log #

Useful Tricks Root login to the Cloud installed nodes should be possible from admin node (even in discovery stage) If admin network is reachable: ~/.ssh/config: host 192.168.124.* StrictHostKeyChecking no user root #

SUSE OpenStack Cloud Admin If a proposal is applied, chef client logs are at: /var/log/crowbar/chef-client/<macid>.<domain>.log Useful crowbar commands: crowbar machines help crowbar transition <node> <state> crowbar <barclamp> proposal list show <name> crowbar <barclamp> proposal delete default crowbar_reset_nodes crowbar_reset_proposal <barclamp> default #

Admin Node: Crowbar Services Nodes are deployed via PXE boot: /srv/tftpboot/discovery/pxelinux.cfg/* Installed via AutoYaST; profile generated to: /srv/tftpboot/nodes/d<mac>.<domain>/autoyast.xml Can delete & rerun chef-client on the admin node Can add useful settings to autoyast.xml: <confirm config:type="boolean">true</confirm> (don t forget to chattr +i the file) #

Admin Node: Crowbar UI Raw settings in barclamp proposals allow access to "expert" (hidden) options Most interesting are: debug: true verbose: true #

Admin Node: Crowbar Gotchas

Admin Node: Crowbar Gotchas Be patient Only transition one node at a time Only apply one proposal at a time Cloud nodes should boot from: 1. Network 2. First disk #

SUSE OpenStack Cloud SUSE Cloud Addon All managed via Chef: /var/log/chef/client.log rcchef-client status Cloud Node Node specific services chef-client can be invoked manually Chef Client SLES 11 SP3

SUSE OpenStack Cloud Control Node SUSE Cloud Addon OpenStack API services.. Chef Client Just like any other cloud node: /var/log/chef/client.log rcchef-client status chef-client Control Node Chef overwrites all config files it touches chattr +i is your friend SLES 11 SP3

High Availability

What is High Availability? Availability = Uptime / Total Time 99.99% ( 4 nines ) == ~53 minutes / year 99.999% ( 5 nines ) == ~5 minutes / year High Availability (HA) Typically accounts for mild / moderate failure scenarios e.g. hardware failures and recoverable software errors automated recovery by restarting / migrating services HA!= Disaster Recovery (DR) Cross-site failover Partially or fully automated HA!= Fault Tolerance 30

31 Internal architecture

Resource Agents Executables which start / stop / monitor resources RA types: LSB init scripts OCF scripts (~ LSB + meta-data + monitor action +...) /usr/lib/ocf/resource.d/ Legacy Heartbeat RAs (ancient, irrelevant) systemd services (in HA for SLE12+) 32

Results of resource failures If fail counter is exceeded, clean-up is required: crm resource cleanup $resource Failures are expected: when a node dies when storage or network failures occur Failures are not expected during normal operation: applying a proposal starting or cleanly stopping resources or nodes Unexpected failures usually indicate a bug! Do not get into the habit of cleaning up and ignoring! 33

Before diagnosis Understand initial state / context crm conf igure graph is awesome! crm_mon Which fencing devices are in use? What's the network topology? What was done leading up to the failure? Look for first (relevant) failure Failures can cascade, so don't confuse cause and effect Watch out for STONITH 34

35 crm configure graph FTW!

Diagnosis What failed? Resource? Node? Orchestration via Crowbar / chef-client? cross-cluster ordering Pacemaker config? (e.g. incorrect constraints) Corosync / cluster communications? chef-client logs are usually a good place to start More on logging later 36

Symptoms of resource failures Failures reported via Pacemaker UIs Failed actions: neutron-ha-tool_start_0 on d52-54-01-77-77-01 'unknown error' (1): call=281, status=complete, last-rc-change='thu Jun 4 16:15:14 2015', queued=0ms, exec=1734ms neutron-ha-tool_start_0 on d52-54-02-77-77-02 'unknown error' (1): call=259, status=complete, last-rc-change='thu Jun 4 16:17:50 2015', queued=0ms, exec=392ms Services become temporarily or permanently unavailable Services migrate to another cluster node 37

Symptoms of node failures Services become temporarily or permanently unavailable, or migrated to another cluster node Node is unexpectedly rebooted (STONITH) Crowbar web UI may show a red bubble icon next to a controller node Hawk web UI stops responding on one of the controller nodes (should still be able to use the others) ssh connection to a cluster node freezes 38

Symptoms of orchestration failures Proposal / chef-client failed Synchronization time-outs are common and obvious INFO: Processing crowbar-pacemaker_sync_mark[wait-keystone_database] action guess (keystone::server line 232) INFO: Checking if cluster founder has set keystone_database to 5... FATAL: Cluster founder didn't set keystone_database to 5! Find synchronization mark in recipe: crowbar_pacemaker_sync_mark "wait-keystone_database" # Create the Keystone Database database "create #{node[:keystone][:db][:database]} database" do... So node timed out waiting for cluster founder to create keystone database i.e. you're looking at the wrong log! So... root@crowbar:~ # knife search node founder:true -i 39

Logging All changes to cluster configuration driven by chef-client either from application of barclamp proposal admin node: /var/log/crowbar/chef-client/$node.log or run by chef-client daemon every 900 seconds /var/log/chef/client.log on each node Remember chef-client often runs in parallel across nodes All HAE components log to /var/log/messages on each cluster node Nothing Pacemaker-related on admin node 40

HAE logs Which nodes' log files to look at? Node failures: /var/log/messages from DC Resource failures: /var/log/messages from DC and node with failed resource but remember the DC can move around (elections) Use hb_report or crm history or Hawk to assemble chronological cross-cluster log Saves a lot of pain strongly recommended! 41

Syslog messages to look out for Fencing going wrong pengine[16374]: warning: cluster_status: We do not have quorum - fencing and resource management disabled pengine[16374]: warning: stage6: Node d52-54-08-77-77-08 is unclean! pengine[16374]: warning: stage6: Node d52-54-0a-77-77-0a is unclean! pengine[16374]: notice: stage6: Cannot fence unclean nodes until quorum is attained (or no-quorum-policy is set to ignore) Fencing going right crmd[16376]: notice: te_fence_node: Executing reboot fencing operation (66) on d52-54-0a-77-77-0a (timeout=60000) stonith-ng[16371]: notice: handle_request: Client crmd.16376.f6100750 wants to fence (reboot) 'd52-54-0a-77-77-0a' with device '(any)' Reason for fencing is almost always earlier in log. Don't forget all the possible reasons for fencing! Lots more get used to reading /var/log/messages! 42

Stabilising / recovering a cluster Start with a single node Stop all others rcchef-client stop rcopenais stop rccrowbar_join stop Clean up any failures crm resource cleanup crm_resource -C is buggy crm_resource -o \ awk '/\tstopped Timed Out/ { print $1 }' \ xargs -n1 crm resource cleanup Make sure chef-client is happy 43

Stabilising / recovering a cluster (cont.) Add another node in rm /var/spool/corosync/block_automatic_start service openais start service crowbar_join start Ensure nothing gets fenced Ensure no resource failures If fencing happens, check /var/log/messages to find out why, then rectify cause Repeat until all nodes are in cluster 44

Degrade Cluster for Debugging crm configure location fixup-cl-apache cl-apache \ rule -inf: '#uname' eq $HOSTNAME Allows to degrade an Activate/Activate resource to only one instance per cluster Useful for tracing Requests #

TL; DR: Just Enough HA crm resource list crm_mon crm resource restart <X> crm resource cleanup <X> #

Now Coming to OpenStack

OpenStack Architecture Diagram

OpenStack Block diagram Accesses almost everything Keystone: SPOF

OpenStack Architecture Typically each OpenStack component provides: an API daemon / service one or many backend daemons that do the actual work openstack / <prj> command line client to access the API <proj>-manage client for admin-only functionality dashboard ("Horizon") Admin tab for a graphical view on the service uses an SQL database for storing state #

OpenStack Packaging Basics Packages are usually named: openstack-<codename> usually a subpackage for each service (-api, -scheduler, etc) log to /var/log/<codename>/<service>.log each service has an init script: dde-ad-be-ff-00-01:~# rcopenstack-glance-api status Checking for service glance-api...running #

OpenStack Debugging Basics Log files often lack useful information without verbose enabled TRACEs of processes are not logged without verbose Many reasons for API error messages are not logged unless debug is turned on Debug is very verbose (>10GB per hour) https://ask.openstack.org/ http://docs.openstack.org/icehouse/ #

OpenStack Architecture Accesses almost everything Keystone: SPOF

OpenStack Dashboard: Horizon /var/log/apache2 openstack-dashboarderror_log Get the exact URL it tries to access! Enable debug in Horizon barclamp Test components individually #

OpenStack Identity: Keystone Needed to access all services Needed by all services for checking authorization Use keystone token-get to validate credentials and test service availability #

OpenStack Imaging: Glance To validate service availability: glance image-list glance image-download <id> > /dev/null glance image-show <id> Don t forget hypervisor_type property! #

OpenStack Compute: Nova nova-manage service list nova-manage logs errors nova show <id> shows compute node virsh list, virsh dumpxml

Nova Overview API Scheduler Conductor Compute Compute Compute "Launches" go to Scheduler; rest to Conductor

Nova Booting VM Workflow

Nova: Scheduling a VM Nova scheduler tries to select a matching compute node for the VM #

Nova Scheduler Typical errors: No suitable compute node can be found All suitable compute nodes failed to launch the VM with the required settings nova-manage logs errors INFO nova.filters [req-299bb909-49bc-4124-8b88-732797250cf5 c24689acd6294eb8bbd14121f68d5b44 acea50152da04249a047a52e6b02a2ef] Filter RamFilter returned 0 hosts #

OpenStack Volumes: Cinder Scheduler API Volume Volume Volume Volume #

OpenStack Cinder: Volumes Similar syntax to Nova: cinder-manage service list cinder-manage logs errors cinder-manage host list cinder list show #

OpenStack Networking: Neutron Swiss Army knife for SDN neutron agent-list neutron net-list neutron port-list neutron router-list There's no neutron-manage #

Basic Network Layout

Networking with OVS: Compute Node http://docs.openstack.org/havana/config-reference/content/under_the_hood_openvswitch.html

Networking with LB: Compute Node

Neutron Troubleshooting Neutron uses IP Networking Namespaces on the Network node for routing overlapping networks neutron net-list ip netns list ip netns exec qrouter-<id> bash ping.. arping.. ip ro.. curl.. #

Q&A http://ask.openstack.org/ http://docs.openstack.org/ https://www.suse.com/documentation/suse-cloud4/ Thank you #

Bonus Material

OpenStack Orchestration: Heat

OpenStack Orchestration: Heat Uses Nova, Cinder, Neutron to assemble complete stacks of resources heat stack-list heat resource-list show <stack> heat event-list show <stack> Usually necessary to query the actual OpenStack service for further information #

OpenStack Imaging: Glance Usually issues are in the configured glance backend itself (e.g. RBD, swift, filesystem) so debugging concentrates on those Filesytem: /var/lib/glance/images RBD: ceph -w rbd -p <pool> ls #

SUSE OpenStack Cloud #

Unpublished Work of SUSE. All Rights Reserved. This work is an unpublished work and contains confidential, proprietary, and trade secret information of SUSE. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability. General Disclaimer This document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.