Performance Tuning best pracitces and performance monitoring with Zabbix

Similar documents
MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM?

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

D1.2 Network Load Balancing

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Monitoring Databases on VMware

Best Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software

Virtualization and Performance NSRC

Red Hat Network Satellite Management and automation of your Red Hat Enterprise Linux environment

Cluster Implementation and Management; Scheduling

Using TrueSpeed VNF to Test TCP Throughput in a Call Center Environment

How To Make Your Database More Efficient By Virtualizing It On A Server

Quantum StorNext. Product Brief: Distributed LAN Client

Application Note. Windows 2000/XP TCP Tuning for High Bandwidth Networks. mguard smart mguard PCI mguard blade

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers

Red Hat Satellite Management and automation of your Red Hat Enterprise Linux environment

RED HAT ENTERPRISE VIRTUALIZATION

An Oracle White Paper July Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

WHITE PAPER 1

VMWARE WHITE PAPER 1

PERFORMANCE TUNING ORACLE RAC ON LINUX

Microsoft Hyper-V chose a Primary Server Virtualization Platform

Understanding Flash SSD Performance

High Frequency Trading and NoSQL. Peter Lawrey CEO, Principal Consultant Higher Frequency Trading

Stingray Traffic Manager Sizing Guide

perfsonar: End-to-End Network Performance Verification

Understanding Linux on z/vm Steal Time

Windows Server 2008 R2 Hyper V. Public FAQ

Throughput Capacity Planning and Application Saturation

Microsoft SQL Server 2012 on Cisco UCS with iscsi-based Storage Access in VMware ESX Virtualization Environment: Performance Study

Test Run Analysis Interpretation (AI) Made Easy with OpenLoad

STeP-IN SUMMIT June 18 21, 2013 at Bangalore, INDIA. Performance Testing of an IAAS Cloud Software (A CloudStack Use Case)

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

Liferay Portal Performance. Benchmark Study of Liferay Portal Enterprise Edition

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

Performing Load Capacity Test for Web Applications

A Comparison of Oracle Performance on Physical and VMware Servers

How To Set Up Foglight Nms For A Proof Of Concept

JBoss Seam Performance and Scalability on Dell PowerEdge 1855 Blade Servers

RDMA Performance in Virtual Machines using QDR InfiniBand on VMware vsphere 5 R E S E A R C H N O T E

Delivering Quality in Software Performance and Scalability Testing

Hypervisor Competitive Differences: Beyond the Data Sheet. Chris Wolf Senior Analyst, Burton Group

my forecasted needs. The constraint of asymmetrical processing was offset two ways. The first was by configuring the SAN and all hosts to utilize

Windows Server 2012 R2 Hyper-V: Designing for the Real World

Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM

Building a Scalable Storage with InfiniBand

Allocating Network Bandwidth to Match Business Priorities

Enabling Technologies for Distributed Computing

Frequently Asked Questions

Web Application s Performance Testing

Pump Up Your Network Server Performance with HP- UX

UDR: UDT + RSYNC. Open Source Fast File Transfer. Allison Heath University of Chicago

OpenLoad - Rapid Performance Optimization Tools & Techniques for CF Developers

RED HAT ENTERPRISE VIRTUALIZATION FOR SERVERS: COMPETITIVE FEATURES

System Requirements Table of contents

Red Hat enterprise virtualization 3.0 feature comparison

1. The subnet must prevent additional packets from entering the congested region until those already present can be processed.

Windows Server Performance Monitoring

Diablo and VMware TM powering SQL Server TM in Virtual SAN TM. A Diablo Technologies Whitepaper. May 2015

Mark Bennett. Search and the Virtual Machine

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

Seradex White Paper. Focus on these points for optimizing the performance of a Seradex ERP SQL database:

MS EXCHANGE SERVER ACCELERATION IN VMWARE ENVIRONMENTS WITH SANRAD VXL

Best Practices for Data Sharing in a Grid Distributed SAS Environment. Updated July 2010

Red Hat Enterprise Virtualization - KVM-based infrastructure services at BNL

Analysis of VDI Storage Performance During Bootstorm

Lecture 8 Performance Measurements and Metrics. Performance Metrics. Outline. Performance Metrics. Performance Metrics Performance Measurements

A Comparison of Oracle Performance on Physical and VMware Servers

Benchmark Performance Test Results for Magento Enterprise Edition

VDI Optimization Real World Learnings. Russ Fellows, Evaluator Group

Vertical Scaling of Oracle 10g Performance on Red Hat Enterprise Linux 5 on Intel Xeon Based Servers. Version 1.0

Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking

RED HAT ENTERPRISE VIRTUALIZATION 3.0

Oracle Database Scalability in VMware ESX VMware ESX 3.5

5nine Cloud Monitor for Hyper-V

DEPLOYMENT GUIDE Version 1.1. Configuring BIG-IP WOM with Oracle Database Data Guard, GoldenGate, Streams, and Recovery Manager

Distributed File Systems

Preparation Guide. How to prepare your environment for an OnApp Cloud v3.0 (beta) deployment.

Hadoop on the Gordon Data Intensive Cluster

Distribution One Server Requirements

Business white paper. HP Process Automation. Version 7.0. Server performance

Deploying Red Hat Enterprise Virtualization On Tintri VMstore Systems Best Practices Guide

General Pipeline System Setup Information

Benchmarking Cassandra on Violin

Disaster Recovery Infrastructure

Network performance in virtual infrastructures

Enabling Technologies for Distributed and Cloud Computing

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

1 Data Center Infrastructure Remote Monitoring

Contents Introduction... 5 Deployment Considerations... 9 Deployment Architectures... 11

PARALLELS CLOUD SERVER

Deep Dive: Maximizing EC2 & EBS Performance

A Survey of Shared File Systems

VI Performance Monitoring

Mobile Application Performance

Transcription:

Performance Tuning best pracitces and performance monitoring with Zabbix Andrew Nelson Senior Linux Consultant May 28, 2015 NLUUG Conf, Utrecht, Netherlands

Overview Introduction Performance tuning is Science! A little Law and some things to monitor Let's find peak performance Conclusion 2/47 Source code availability Test environment information

$ whoami Andrew Nelson anelson@redhat.com Senior Linux Consultant with Red Hat North America 3/47 Active in the Zabbix community for approximately 10 years Known as nelsonab in forums and IRC Author of the Zabbix API Ruby library zbxapi

Performance Tuning and SCIENCE!

Performance tuning and the Scientific Method 5/47 Performance tuning is similar to the Scientific method Define the problem State a hypothesis Prepare experiments to test the hypothesis Analyze the results Generate a conclusion

Understanding the problem Performance tuning often involves a multitude of components Identifying problem areas is often challenging Poorly defined problems can be worse than no problem at all These are not (necessarily) the solutions you want. 6/47

Understanding the problem Why? Better utilization of resources Capacity Planning and scaling For tuning to work, you must define your problem But don't be defined by the problem. You can't navigate somewhere when you don't know where you're going. 7/47

Defining the problem Often best when phrased as a declaration with a reference Poor Examples The disks are too slow It takes too long to log in It's Broken! Good Examples 8/47 Writes for files ranging in size from X to Y must take less than N seconds to write. Customer Login's must take no longer than.5 seconds The computer monitor is dark and does not wake up when moving the mouse

Define your tests Define your tests and ensure they are repeatable Poor Example (manually run tests) 1 $ time cp one /test_dir 2 $ time cp two /test_dir Good Example (automated tests with parsable output) $ run_test.sh Subsystem A write tests 9/47 Run Size Time (seconds) 1 100KB 0.05 2 500KB 0.24 3 1MB 0.47

Define your tests A good test is comprised to two main components a)it is representative of the problem b)it has easy to collate and process output. Be aware of external factors Department A owns application B which is used by group C but managed by department D. 10/47 Department D may feel that application B is too difficult to support and may not lend much assistance placing department A in a difficult position.

Perform your tests Once the tests have been agreed upon get a set of baseline data Log all performance tuning changes and annotate all tests with the changes made If the data is diverging from the goal, stop and look closer 11/47 Was the goal appropriate? Where the tests appropriate? Were the optimizations appropriate? Are there any external factors impacting the effort?

Perform your tests and DOCUMENT! When the goal is reached, stop Is there a need to go on? Was the goal reasonable? Were the tests appropriate? Were there any external issues not accounted for or foreseen? DOCUMENT DOCUMENT DOCUMENT If someone ran a test on a server, but did not log it, did it really happen? 12/47

When testing, don't forget to... DOCUMENT! 13/47

Story time! Client was migrating from Unix running on x86 to RHEL5 running on x86 Client claimed the middleware stack they were using was slower on RHEL Some of the problems encountered Problem was not clearly defined Tests were not representative and mildly consistent End goal/performance metric evolved over time 14/47 There were some external challenges observed Physical CPU clock speed was approximately 10% slower on newer systems

More Story time! Client was migrating an application from zos to RHEL 6 with GFS2 Things were slow but there was no consistent quantification of slow. Raw testing showed GFS2 to be far superior to NFS, but Developers claimed NFS was faster. Eventually GFS2 was migrated to faster storage, developers became more educated about performance and overall things are improved. Developers are learning to quantify the need for something before asking for it. 15/47

A little Law and some things to monitor

Little's Law L=λh L = Queue length h = Time to service a request λ=arrival rate Networking provides some good examples of Little's Law in action. MTU (Maximum Transmission Unit) and Speed can be analogous to lambda. The Bandwidth Delay Product (BDP) is akin to L, Queue length 17/47

Little's Law BDP is defined as: Bandwidth * End_To_End_Delay (or latency) Example 1GB/s Link with 2.24ms Round Trip Time (RTT) 1Gb/s * 2.24ms = 0.27MB 18/47 Thus, a buffer of at least 0.27MB is required to buffer all of the data on the wire.

Little's Law What happens when we alter the MTU? 9000 6,000 Packets per second 939.5MB/s Inbound Packets 150 1500 1500 6,000 Packets per second 898.5MB/s Outbound Packets 150 19/47 9000 22,000 Packets per second 493MB/s

Little's law in action. There are numerous ways to utilize Little's law in monitoring. 20/47 IO requests in flight for disks Network buffer status Network packets per second. Processor load Time to service a request

Little's law in action. Apache is the foundation for many enterprise and SAS products, so how can we monitor it's performance in Zabbix? Normal approaches involved parsing log files, or parsing the status page The normal ways don't tend to work well with Zabbix, however we can use a script to parse the logs in realtime from Zabbix and use a file socket for data output. 21/47

Little's law in action. Two pieces are involved in pumping data from Apache into Zabbix. First we build a running counter via a log pipe to a script # YYYYMMDD-HHMMSS Path BytesReceived BytesSent TimeSpent MicrosecondsSpent LogFormat "%{%Y%m%d-%H%M%S}t %U %I %O %T %D" zabbix-log CustomLog " $/var/lib/zabbix/apache-log.rb >>var/lib/zabbix/errors" zabbix-log This creates a file socket: $ cat /var/lib/zabbix/apache-data-out Count Received Sent total_time total_microsedonds 4150693 573701315 9831930078 0 335509340 22/47

Little's law in action. Next we push that data via a client side script using Zabbix_sender $ crontab -e */1 * * * * /var/lib/zabbix/zabbix_sender.sh And import the template 23/47

Let's see if we can find the peak performance with Zabbix

The test environment Hypervisor 2 (Sherri) Hypervisor 1 (Terry) Physical System (desktop) GigE Storage Server Router/Firewall 100Mbit Infiniband Zabbix Server 25/47 NOTE: See last slides for more details

What are we looking for It is normal to be somewhat unsure initially, investigative testing will help shape this. Some form of saturation will be reached, hopefully on the server. Saturation will take one or both of the following forms Increased time to service Failure to service 26/47 Request queues (or buffers) are full, meaning overall increased time to service the queue Queue is full and the request will not be serviced. The server will issue an error, or the client will time out.

Finding Peak Performance, initial test Tests were run from system Desktop Apache reports 800 connections per second. Processor load is light. 27/47 Test Window

Finding Peak Performance, initial test Network shows a plateau, but not saturation on the client. Plateau is smooth in appearance Neither of the two cores appears very busy. 28/47 Test Window

Finding Peak Performance, initial test Test Window Apache server seems to report that it responds faster with more connections Zabbix web tests show increased latency 29/47

Finding Peak Performance, initial test The actual data from Jmeter 30/47 Appearance of smooth steps and plateau

Finding Peak Performance, Initial analysis Reduced response latency may be due to processor cache. Network appears to be the bottleneck. Connections are repetitive potential leading to greater cache efficiency. During tests some Zabbix checks were timing out to the test server and other systems behind the firewall/router Router showed very high CPU utilization. Jmeter does not show many connection errors. 31/47 Network layer is throttling connections

Finding Peak Performance, Initial analysis More testing needed Tests need to come from a system on the same VLAN and switch as the server and not traverse the router. A wise man once said: I need a little more Cowbell (aka testing) 32/47

Finding Peak Performance, second test Test Window Testing from another VM with full 1Gb links to test server Based on concurrent connections count, it seems an upper limit has again been found. Graph is not smooth at plateau CPU exhibits greater load, but overall still low 33/47

Finding Peak Performance, second test Network no longer appears to be the bottleneck Rough saw-tooth plateau noted CPU Utilization follows picture of load, but it would seem there is still CPU capacity left. 34/47 Test Window

Finding Peak Performance, second test Test Window Apache again appears to respond faster under load than when idle Reduced latency shows smooth appearance Zabbix tests do not show any change in Apache performance. The router is back to normal load. 35/47

Finding Peak Performance, second test Steps are smooth and uneventful below 1200 TPS. Wild TPS graph above 1200 is due to connection errors 36/47 Jmeter graph above 1200TPS does not appear coherent with Zabbix graphs.

Finding Peak Performance, Second analysis It would appear reduced response latency is likely due to to processor cache as noted before. Increased rate of repetitiveness reduced latency further. Network did not appear to be the bottleneck Connection errors were noted in Jmeter tests as would be expected for a saturated server. Based on Jmeter and Zabbix data peak performance for this server with the test web page is about 1,200 pages per second What if we couldn't max out performance, are there other ways to find it? 37/47

Conclusion

One more story... Host was a member of a 16 node GFS2 Cluster Java containers were running on the host which preallocated memory. vm.swappiness was set to 0 OS had about 200MB of memory available for itself and appeared to spend 100% of one core's time in IO wait. 39/47

One more story... 40/47

One more story... 41/47

Conclusion Clearly define the problem Understand what the tests are before testing It is possible to use similar techniques to tuning for long term monitoring Sometimes the results you get are not what you expected. Software developers are bad at exposing performance metrics for use by external software. DOCUMENT, DOCUMENT, DOCUMENT! 42/47

Questions jautājumi Fragen vragen питання pytania вопросы 質問 otázky preguntas spørgsmål domande kysymykset 43/47

Source Code Scripts and template used are available on GitHub 44/47 https://github.com/red-tux/zbx-apache

The test environment (More details) Hypervisor 2 (Sherri) Hypervisor 1 (Terry) Physical System (desktop) GigE Router/Firewall 100Mbit Infiniband Zabbix Server 45/47 Storage Server

The test environment (More details) Hypervisors are Red Hat Virtualization 3.3 (RHEV) Guests are RHEL6 Test server is configured with 2GB of RAM and 2 CPU cores Storage for guests is via iscsi over Infiniband Switch and Firewall are small Enterprise grade Juniper equipment. 46/47 Main Router/Firewall has 100Mbit interfaces All networks are VLANed Hypervisors are LACP bonded to the internal network

The test environment (More details) Test page is a simple Hello world with a small embedded graphic. Two connections equals one page load. Apache was configured to use the aforementioned logging script JMeter was used to generate the client test loads Zabbix was configured perform a web test as well to track response times from the Zabbix server. 47/47