Copyright 2014 Splunk Inc. Grid CompuAng AnalyAcs with Splunk Finnbar Cunningham Head of Grid CompuAng OperaAons & Support Credit Suisse
Disclaimer During the course of this presentaaon, we may make forward- looking statements regarding future events or the expected performance of the company. We cauaon you that such statements reflect our current expectaaons and esamates based on factors currently known to us and that actual events or results could differ materially. For important factors that may cause actual results to differ from those contained in our forward- looking statements, please review our filings with the SEC. The forward- looking statements made in the this presentaaon are being made as of the Ame and date of its live presentaaon. If reviewed arer its live presentaaon, this presentaaon may not contain current or accurate informaaon. We do not assume any obligaaon to update any forward- looking statements we may make. In addiaon, any informaaon about our roadmap outlines our general product direcaon and is subject to change at any Ame without noace. It is for informaaonal purposes only, and shall not be incorporated into any contract or other commitment. Splunk undertakes no obligaaon either to develop the features or funcaonality described or to include any such feature or funcaonality in a future release. 2
SeTng the Context
! Integrated bank: Private Banking & Wealth Management Investment banking! Founded in 1856! OperaAons in over 50 countries! 530 Offices and branches! 45,100 Employees! Client focused business approach
A Bit About Me! Long term interest in Distributed CompuAng! Compute Science Degree! 9 Years at Credit Suisse working on evolving Grid Plaborm! Heavy SQL User pre- Splunk! Discovered Splunk in 2011! Splunk evangelist ever since! 5
Agenda! IntroducAon to Grid CompuAng @ Credit Suisse! User Dashboards: Search scripts to interact with the Grid ApplicaAon Usage Stats InteracAon between ApplicaAons Grid- wide CPU UAlisaAon CPU Usage by Grid App ApplicaAon Efficiency Tuning (SAtching Grid & OS Metrics together) Cost Transparency! Grid Team Dashboards System Health Checks! Splunk as Component of SoRware Deployment System 6
Compute Grid Mock- up Compute The Grid Grid 7
Grid CompuAng at Credit Suisse! Purpose: Performs complex risk & pricing calculaaons for financial products! How? Work divided into tasks which are executed on mulaple hosts in parallel! Scale: >100 Years of CPU Ame used daily ~1 Billion tasks processed daily 1000s Of dedicated servers 1000s Of workstaaons join grid when idle 100s Of applicaaons sharing the grid ApplicaAons guaranteed certain capacity but can borrow more if it s available
My Job: Head of Grid OperaAons & Support Responsible for:! System health! Incident resoluaon & problem invesagaaon! OperaAons: sorware deployment etc.! Efficiency & capacity management! Driving system evoluaon
Grid Glossary Grid! CollecAon of hosts working together to process work! Typically 1 ProducAon Grid per region Resource Group! Grouping of hosts e.g. Servers, WorkstaAons etc Resource Plan! Defines % of Resource Group guaranteed to each applicaaon Slot! Subdivision of a compute host e.g. 1 slot per CPU / GB of RAM Alloca:on! Number of slots currently allocated to an applicaaon Consumer! Grouping of grid applicaaons, generally by business line
Our Splunk Topology
Splunk Topology Searching ReporAng AlerAng Dashboards Global Search Head 12 Regional Indexers 1000s of Forwarders
Data Sources Logs from: Grid Daemons Grid ApplicaAons OS Performance Counters: CPU, Memory, Network AcAvity etc. Scripted Inputs: Grid API calls DB Queries 13
User Dashboards
User Dashboards for Interac:ng with Grid script soamclient /grid:ea_dev /ViewSessions /AppName:MC_bladefarm_dev_tuscan /SessionState:Open script soamclient /grid:ea_dev /AppName:MC_bladefarm_dev_griffith /SessionID:239507 /Terminate
sourcetype="grid- Symphony:consumer_demand" CLUSTER_NAME="$cluster$" CONSUMER_NAME="$consumer$" dedup _:me, CONSUMER_NAME join type=outer CONSUMER_NAME [ search earliest=- 24hr latest=now sourcetype=grid- Symphony:egoclient_ViewResourcePlan CLUSTER_NAME="$cluster$" CONSUMER_NAME="$consumer$" eval Guarantee=IF(SHARE_LIMIT<PLANNED_OWN,SHARE_LIMIT,PLANNED_OWN) stats sum(guarantee) as Guarantee by CONSUMER_NAME ] :mechart span=5min sum(used) as Alloca:on, sum(max_requested) as Demand, sum(guarantee) as Guarantee! Grid Scheduler metrics, retrieved via frequent API call (script) executed by Splunk Forwarder! Subsearch to retrieve Guarantee from resource plan data
sourcetype="grid- Symphony:consumer_resource_allocaAon" CLUSTER_NAME="$cluster$ RESOURCE_GROUP="$resourcegroup$" CONSUMER_NAME="$consumer$" Amechart span=5m max(allocated) as AllocaAon by CONSUMER_NAME! Grid Scheduler metrics, retrieved via frequent API call (script) executed by Splunk Forwarder
sourcetype="wmi:cputime" lookup grid_inventory host OUTPUT Grid ResourceGroup WHERE Grid="LON.PROD" AND ResourceGroup="ComputeHostsCSW5" timechart avg(percentprocessortime)! OS metrics, retrieved via Splunk Forwarder WMI input! 100s of hosts, ~100,000 events runs in <10s
sourcetype=grid- ProcMetrics CPUUAl<=100 APP_NAME=$appname$ eval host=upper(host) lookup grid_inventory host OUTPUT Grid Resource- Group as ResourceGroup WHERE Grid="$grid$" AND ResourceGroup="$resourcegroup$" bucket _Ame span=5min eval NumHosts=[ inputlookup grid_inventory rename Resource- Group as ResourceGroup WHERE Grid="$grid$" AND ResourceGroup="$resourcegroup$" stats count as query ] stats sum(eval(cpuual/numhosts)) as CPUUAl by _Ame,host,APP_NAME Amechart limit=12 span=5min sum(cpuual) as CPUUAl by APP_NAME! Process metrics, retrieved via Splunk Forwarder scripted input running custom exe.! 100s Of hosts, 10s of apps, ~2,000,000 events, runs in ~30s.! Subsearch to get number of hosts for Resource Group, used as denominator.
sourcetype=grid- ProcMetrics CPUUAl<=100 APP_NAME=*$appname$ eval host=upper(host) lookup grid_inventory host OUTPUT Grid Resource- Group as ResourceGroup WHERE Grid="$grid$" AND ResourceGroup="$resourcegroup$" eval Grid_ResourceGroup=Grid."_".ResourceGroup bucket _Ame span=5min eval NumHosts=[ inputlookup grid_inventory rename Resource- Group as ResourceGroup WHERE Grid="$grid$" AND ResourceGroup="$resourcegroup$" stats count as query ] stats sum(eval(cpuual/numhosts)) as CPUUAl by _Ame,Grid_ResourceGroup join _Ame, Grid_ResourceGroup [ search sourcetype="grid- Symphony:egoclient_ViewAllocaAons" [ inputlookup cluster2grid WHERE GridName="$grid$" return CLUSTER_NAME ] RESOURCE_GROUP="$resourcegroup$" APP_NAME="*$appname$" lookup cluster2grid CLUSTER_NAME OUTPUT GridName as Grid eval Grid_ResourceGroup=Grid."_".RESOURCE_GROUP eval AllAlloc=(unassigned + USED) join type=outer Grid_ResourceGroup [ search earliest=- 24hr latest=now sourcetype=grid- Symphony:egoclient_ViewResourcePlan RESOURCE_GROUP="$resourcegroup$" CONSUMER_NAME="/" eval numslots=planned_own*- 1 lookup AdminGridName2InvGridName AdminGridName as CLUSTER_ALIAS OUTPUT InvGridName eval Grid_ResourceGroup=InvGridName."_".RESOURCE_GROUP stats first(numslots) as TotalSlots by Grid_ResourceGroup ] eval Allocated=(AllAlloc/TotalSlots)*100 eval Used=(USED/TotalSlots)*100 bucket _Ame span=5m stats avg(allocated) as Allocated, avg(used) as Used by _Ame, Grid_ResourceGroup ] Amechart span=5mins avg(allocated) as Allocated, avg(used) as Used, avg(cpuual) as CPUU:l! Join between Process Metrics & Grid Scheduler Metrics! Subsearches to retrieve details from Hardware Inventory & Resource Plan! Extremely valuable as used to analyse & improve Grid App efficiency
Cost Transparency! System Usage Metrics converted into $ via unit cost lookup table! Helps business managers idenafy inefficiencies 21
Grid Team Dashboards
SoRware Consistency Checks! Scripted input on forwarders run regular md5sum checks of key folders! Lookup against a table which maps known md5sums to versions! Crucial to ensure we re running the grid sorware we think we are!
Infrastructure Health! Based on indexed data and reconciliaaon against Inventory lookup table! Dashboard checked every morning! Helps us maintain health of 1000s of hosts spread around the globe 24
Data Centre Temperatures! HW Temperatures collected by Splunk forwarders using WMI! Lookup against server locaaon table to add Data Centre and Cabinet info
Component of SoRware Deployment System
Grid Package Deployment! Many Grid Apps => Many SoRware Packages deployed! Needed Self Service & Fully AutomaAc SoRware Deployment! App Teams deliver packages to drop- off folder! Automated Change Control review! Fully Automated deployment to 1000s of hosts globally! >20TB Per week
Architecture! Deployment triggered using MulAcast message! Agents report status to Splunk via UDP! Deployment System Daemon queries Splunk using REST API! Proceeds to next step when 80% of hosts have downloaded package PACKAGE MANAGER DAEMON REST API: Runs Search Macro MulAcast Trigger Grid Hosts (1000s) UDP Message 28
29
Advice 30
Advice! Use lookups to enrich your data! Graph data together Get your data into the same units, for example % Dual Axis Chart Overlay! Make good use of Scripted Inputs! Make good use of Custom Search Scripts! Think outside the box You re only limited by your own imaginaaon!
THANK YOU