Grid CompuAng AnalyAcs with Splunk Finnbar Cunningham



Similar documents
Exploratory AnalyAcs for Shared- service Hadoop Clusters

This chapter introduces you to Microso2 Office Access The chapter focuses on what a database is, the components of a database, what a database

Building a cloud- based SIEM with Splunk Cloud and AWS

BBM467 Data Intensive ApplicaAons

IOmark Suite. Benchmarking Storage with Applica4on Workloads August, Evaluator Group, Inc.

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

Keeping Splunk in Check: Tools to BeGer Manage Your Investment

HP OO 10.X - SiteScope Monitoring Templates

Technology Partners. Acceleratio Ltd. is a software development company based in Zagreb, Croatia, founded in 2009.

Capacity Planning for Microsoft SharePoint Technologies

User Reports. Time on System. Session Count. Detailed Reports. Summary Reports. Individual Gantt Charts

About Me: Brent Ozar. Perfmon and Profiler 101

xassets Hosted Services Microsoft SAM Assist Audits with xassets

How To Set Up Foglight Nms For A Proof Of Concept

Splunk Operational Visibility

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Scaling out a SharePoint Farm and Configuring Network Load Balancing on the Web Servers. Steve Smith Combined Knowledge MVP SharePoint Server

Automated Performance Testing of Desktop Applications

Leveraging Machine Data to Deliver New Insights for Business Analytics

Running custom scripts which allow you to remotely and securely run a script you wrote on Windows, Mac, Linux, and Unix devices.

Server & Application Monitor

Deploying the Splunk App for Microso> Exchange

Simplified Forwarder Deployment and Deployment Server Techniques

System Aware Cyber Security Architecture

End- to- End Monitoring Unified Performance Dashboard (UPD)

SolarWinds Database Performance Analyzer (DPA) or OEM?

GigaSpaces XAP.NET Administration Training ADMINISTRATION, MONITORING AND TROUBLESHOOTING GIGASPACES XAP.NET DISTRIBUTED SYSTEMS

Splunk Enterprise in the Cloud Vision and Roadmap

Case Study - I. Industry: Social Networking Website Technology : J2EE AJAX, Spring, MySQL, Weblogic, Windows Server 2008.

Introduction to Splunk Dashboards for Service Oriented Architecture Monitoring at SurveyMonkey Michael Sela, Engineering Manager.

Planning Successful AnalyBcs Projects

Introduction. Part I: Finding Bottlenecks when Something s Wrong. Chapter 1: Performance Tuning 3

Microsoft Dynamics NAV 2013 R2 Sizing Guidelines for Multitenant Deployments

GigaSpaces XAP 9.7 Administration Training ADMINISTRATION, MONITORING AND TROUBLESHOOTING GIGASPACES XAP DISTRIBUTED SYSTEMS

Understand Your SAP HR and Payroll Reporting Options In A Cloud, On-Premise and Hybrid World

Accelera'ng Your Solu'on Development with Splunk Reference Apps

Database Services for CERN

Mobile Application Performance

Implementing Internet Storage Service Using OpenAFS. Sungjin Dongguen Arum

Best Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software

Building a Splunk-based Lumber Mill. Turning a bunch of logs into useful products

Quick Service Data for Quick Service Restaurants

Splunk implementa-on. Our experiences throughout the 3 year journey

What s Up With That Airplane? Visualizing DoD Knowledge Using Splunk Dashboards. Ken Mattern

Maintaining Non-Stop Services with Multi Layer Monitoring

How To Test For Performance And Scalability On A Server With A Multi-Core Computer (For A Large Server)

ITPS AG. Aplication overview. DIGITAL RESEARCH & DEVELOPMENT SQL Informational Management System. SQL Informational Management System 1

SharePoint 2010 Performance and Capacity Planning Best Practices

Deploying Splunk on Amazon Web Services

32-bit and 64-bit BarTender. How to Select the Right Version for Your Needs WHITE PAPER

WEB HELP DESK GETTING STARTED GUIDE

Stream Deployments in the Real World: Enhance Opera?onal Intelligence Across Applica?on Delivery, IT Ops, Security, and More

Mohammed Khan SUMMARY

Automating Big Data Benchmarking for Different Architectures with ALOJA

SafePeak Case Study: Large Microsoft SharePoint with SafePeak

vrealize Operations Manager User Guide

G DATA TechPaper #0275. G DATA Network Monitoring

Splunk for VMware Virtualization. Marco Bizzantino Vmug - 05/10/2011

ILMT Central Team. Performance tuning. IBM License Metric Tool 9.0 Questions & Answers IBM Corporation

Tuning Microsoft SQL Server for SharePoint. Daniel Glenn

Why Standardize on Oracle Database 11g Next Generation Database Management. Thomas Kyte

Optimizing Business Continuity Management with NetIQ PlateSpin Protect and AppManager. Best Practices and Reference Architecture

Scality RING High performance Storage So7ware for pla:orms, StaaS and Cloud ApplicaAons

Citrix EdgeSight Administrator s Guide. Citrix EdgeSight for Endpoints 5.3 Citrix EdgeSight for XenApp 5.3

Business Case Development for Credit and Debit Card Fraud Re- Scoring Models

Telemetry: The Customer Experience

Installation Guide. Help Desk Manager. Version v12.1.0

Data Center Reference Architectures. Manish Karir Merit Network Inc.

Tableau Server Scalability Explained

Encore Software Solutions (V3) Identity Lifecycle Management and Federated Security Suite (ILM/FSS) Overview and Technical Requirements

Sisense. Product Highlights.

MySQL Enterprise Monitor

Cloud Control Panel (CCP) Billing User Guide

The Jiffy Lube Quick Tune- up for your Splunk Environment

Oracle Enterprise Manager 12c Microsoft SQL Server Plug-in version

Asynchronous Provisioning Platform (APP)

Hardware Recommendations

From the Datacenter to the Dean s office

August 2014 San Antonio Texas The Power of Embedded Analytics with SAP BusinessObjects

Scalability and Performance Report - Analyzer 2007

Deployment Best PracHces for Splunk Apps Monitoring MicrosoK- based Infrastructure

Pcounter Web Report 3.x Installation Guide - v Pcounter Web Report Installation Guide Version 3.4

Where Mobile meets In- Store & Point of Sale: Data Collides

Caching SMB Data for Offline Access and an Improved Online Experience

The Complete Performance Solution for Microsoft SQL Server

Deep Dive Monitoring Servers using BI 4.1. Alan Mayer Solid Ground Technologies SESSION CODE: 0305

Deployment Planning Guide

TL 9000 Measurements Handbook, Release 5.0

More Comprehensive Digital Intelligence - CorrelaFng Client and Server- side Data

CA IT Client Manager Asset Inventory and Discovery

Managing the PowerPivot for SharePoint Environment

Installation Process

Quick Start Guide for VMware and Windows 7

Smart Business Architecture for Midsize Networks Network Management Deployment Guide

Monitoring HP OO 10. Overview. Available Tools. HP OO Community Guides

Transitioning from a Physical to Virtual Production Environment. Ryan Miller Middle Tennessee Electric Membership Corp

Scaling Graphite Installations

SapphireIMS 4.0 Asset Management Feature Specification

ProSystem fx Engagement. Deployment Planning Guide

PANDORA FMS OFFICIAL TRAINING

Transcription:

Copyright 2014 Splunk Inc. Grid CompuAng AnalyAcs with Splunk Finnbar Cunningham Head of Grid CompuAng OperaAons & Support Credit Suisse

Disclaimer During the course of this presentaaon, we may make forward- looking statements regarding future events or the expected performance of the company. We cauaon you that such statements reflect our current expectaaons and esamates based on factors currently known to us and that actual events or results could differ materially. For important factors that may cause actual results to differ from those contained in our forward- looking statements, please review our filings with the SEC. The forward- looking statements made in the this presentaaon are being made as of the Ame and date of its live presentaaon. If reviewed arer its live presentaaon, this presentaaon may not contain current or accurate informaaon. We do not assume any obligaaon to update any forward- looking statements we may make. In addiaon, any informaaon about our roadmap outlines our general product direcaon and is subject to change at any Ame without noace. It is for informaaonal purposes only, and shall not be incorporated into any contract or other commitment. Splunk undertakes no obligaaon either to develop the features or funcaonality described or to include any such feature or funcaonality in a future release. 2

SeTng the Context

! Integrated bank: Private Banking & Wealth Management Investment banking! Founded in 1856! OperaAons in over 50 countries! 530 Offices and branches! 45,100 Employees! Client focused business approach

A Bit About Me! Long term interest in Distributed CompuAng! Compute Science Degree! 9 Years at Credit Suisse working on evolving Grid Plaborm! Heavy SQL User pre- Splunk! Discovered Splunk in 2011! Splunk evangelist ever since! 5

Agenda! IntroducAon to Grid CompuAng @ Credit Suisse! User Dashboards: Search scripts to interact with the Grid ApplicaAon Usage Stats InteracAon between ApplicaAons Grid- wide CPU UAlisaAon CPU Usage by Grid App ApplicaAon Efficiency Tuning (SAtching Grid & OS Metrics together) Cost Transparency! Grid Team Dashboards System Health Checks! Splunk as Component of SoRware Deployment System 6

Compute Grid Mock- up Compute The Grid Grid 7

Grid CompuAng at Credit Suisse! Purpose: Performs complex risk & pricing calculaaons for financial products! How? Work divided into tasks which are executed on mulaple hosts in parallel! Scale: >100 Years of CPU Ame used daily ~1 Billion tasks processed daily 1000s Of dedicated servers 1000s Of workstaaons join grid when idle 100s Of applicaaons sharing the grid ApplicaAons guaranteed certain capacity but can borrow more if it s available

My Job: Head of Grid OperaAons & Support Responsible for:! System health! Incident resoluaon & problem invesagaaon! OperaAons: sorware deployment etc.! Efficiency & capacity management! Driving system evoluaon

Grid Glossary Grid! CollecAon of hosts working together to process work! Typically 1 ProducAon Grid per region Resource Group! Grouping of hosts e.g. Servers, WorkstaAons etc Resource Plan! Defines % of Resource Group guaranteed to each applicaaon Slot! Subdivision of a compute host e.g. 1 slot per CPU / GB of RAM Alloca:on! Number of slots currently allocated to an applicaaon Consumer! Grouping of grid applicaaons, generally by business line

Our Splunk Topology

Splunk Topology Searching ReporAng AlerAng Dashboards Global Search Head 12 Regional Indexers 1000s of Forwarders

Data Sources Logs from: Grid Daemons Grid ApplicaAons OS Performance Counters: CPU, Memory, Network AcAvity etc. Scripted Inputs: Grid API calls DB Queries 13

User Dashboards

User Dashboards for Interac:ng with Grid script soamclient /grid:ea_dev /ViewSessions /AppName:MC_bladefarm_dev_tuscan /SessionState:Open script soamclient /grid:ea_dev /AppName:MC_bladefarm_dev_griffith /SessionID:239507 /Terminate

sourcetype="grid- Symphony:consumer_demand" CLUSTER_NAME="$cluster$" CONSUMER_NAME="$consumer$" dedup _:me, CONSUMER_NAME join type=outer CONSUMER_NAME [ search earliest=- 24hr latest=now sourcetype=grid- Symphony:egoclient_ViewResourcePlan CLUSTER_NAME="$cluster$" CONSUMER_NAME="$consumer$" eval Guarantee=IF(SHARE_LIMIT<PLANNED_OWN,SHARE_LIMIT,PLANNED_OWN) stats sum(guarantee) as Guarantee by CONSUMER_NAME ] :mechart span=5min sum(used) as Alloca:on, sum(max_requested) as Demand, sum(guarantee) as Guarantee! Grid Scheduler metrics, retrieved via frequent API call (script) executed by Splunk Forwarder! Subsearch to retrieve Guarantee from resource plan data

sourcetype="grid- Symphony:consumer_resource_allocaAon" CLUSTER_NAME="$cluster$ RESOURCE_GROUP="$resourcegroup$" CONSUMER_NAME="$consumer$" Amechart span=5m max(allocated) as AllocaAon by CONSUMER_NAME! Grid Scheduler metrics, retrieved via frequent API call (script) executed by Splunk Forwarder

sourcetype="wmi:cputime" lookup grid_inventory host OUTPUT Grid ResourceGroup WHERE Grid="LON.PROD" AND ResourceGroup="ComputeHostsCSW5" timechart avg(percentprocessortime)! OS metrics, retrieved via Splunk Forwarder WMI input! 100s of hosts, ~100,000 events runs in <10s

sourcetype=grid- ProcMetrics CPUUAl<=100 APP_NAME=$appname$ eval host=upper(host) lookup grid_inventory host OUTPUT Grid Resource- Group as ResourceGroup WHERE Grid="$grid$" AND ResourceGroup="$resourcegroup$" bucket _Ame span=5min eval NumHosts=[ inputlookup grid_inventory rename Resource- Group as ResourceGroup WHERE Grid="$grid$" AND ResourceGroup="$resourcegroup$" stats count as query ] stats sum(eval(cpuual/numhosts)) as CPUUAl by _Ame,host,APP_NAME Amechart limit=12 span=5min sum(cpuual) as CPUUAl by APP_NAME! Process metrics, retrieved via Splunk Forwarder scripted input running custom exe.! 100s Of hosts, 10s of apps, ~2,000,000 events, runs in ~30s.! Subsearch to get number of hosts for Resource Group, used as denominator.

sourcetype=grid- ProcMetrics CPUUAl<=100 APP_NAME=*$appname$ eval host=upper(host) lookup grid_inventory host OUTPUT Grid Resource- Group as ResourceGroup WHERE Grid="$grid$" AND ResourceGroup="$resourcegroup$" eval Grid_ResourceGroup=Grid."_".ResourceGroup bucket _Ame span=5min eval NumHosts=[ inputlookup grid_inventory rename Resource- Group as ResourceGroup WHERE Grid="$grid$" AND ResourceGroup="$resourcegroup$" stats count as query ] stats sum(eval(cpuual/numhosts)) as CPUUAl by _Ame,Grid_ResourceGroup join _Ame, Grid_ResourceGroup [ search sourcetype="grid- Symphony:egoclient_ViewAllocaAons" [ inputlookup cluster2grid WHERE GridName="$grid$" return CLUSTER_NAME ] RESOURCE_GROUP="$resourcegroup$" APP_NAME="*$appname$" lookup cluster2grid CLUSTER_NAME OUTPUT GridName as Grid eval Grid_ResourceGroup=Grid."_".RESOURCE_GROUP eval AllAlloc=(unassigned + USED) join type=outer Grid_ResourceGroup [ search earliest=- 24hr latest=now sourcetype=grid- Symphony:egoclient_ViewResourcePlan RESOURCE_GROUP="$resourcegroup$" CONSUMER_NAME="/" eval numslots=planned_own*- 1 lookup AdminGridName2InvGridName AdminGridName as CLUSTER_ALIAS OUTPUT InvGridName eval Grid_ResourceGroup=InvGridName."_".RESOURCE_GROUP stats first(numslots) as TotalSlots by Grid_ResourceGroup ] eval Allocated=(AllAlloc/TotalSlots)*100 eval Used=(USED/TotalSlots)*100 bucket _Ame span=5m stats avg(allocated) as Allocated, avg(used) as Used by _Ame, Grid_ResourceGroup ] Amechart span=5mins avg(allocated) as Allocated, avg(used) as Used, avg(cpuual) as CPUU:l! Join between Process Metrics & Grid Scheduler Metrics! Subsearches to retrieve details from Hardware Inventory & Resource Plan! Extremely valuable as used to analyse & improve Grid App efficiency

Cost Transparency! System Usage Metrics converted into $ via unit cost lookup table! Helps business managers idenafy inefficiencies 21

Grid Team Dashboards

SoRware Consistency Checks! Scripted input on forwarders run regular md5sum checks of key folders! Lookup against a table which maps known md5sums to versions! Crucial to ensure we re running the grid sorware we think we are!

Infrastructure Health! Based on indexed data and reconciliaaon against Inventory lookup table! Dashboard checked every morning! Helps us maintain health of 1000s of hosts spread around the globe 24

Data Centre Temperatures! HW Temperatures collected by Splunk forwarders using WMI! Lookup against server locaaon table to add Data Centre and Cabinet info

Component of SoRware Deployment System

Grid Package Deployment! Many Grid Apps => Many SoRware Packages deployed! Needed Self Service & Fully AutomaAc SoRware Deployment! App Teams deliver packages to drop- off folder! Automated Change Control review! Fully Automated deployment to 1000s of hosts globally! >20TB Per week

Architecture! Deployment triggered using MulAcast message! Agents report status to Splunk via UDP! Deployment System Daemon queries Splunk using REST API! Proceeds to next step when 80% of hosts have downloaded package PACKAGE MANAGER DAEMON REST API: Runs Search Macro MulAcast Trigger Grid Hosts (1000s) UDP Message 28

29

Advice 30

Advice! Use lookups to enrich your data! Graph data together Get your data into the same units, for example % Dual Axis Chart Overlay! Make good use of Scripted Inputs! Make good use of Custom Search Scripts! Think outside the box You re only limited by your own imaginaaon!

THANK YOU