Introduction to Splunk Dashboards for Service Oriented Architecture Monitoring at SurveyMonkey Michael Sela, Engineering Manager #splunkconf
Agenda Introduction! Current Applications Architecture and Challenges! Dashboards! Summary!
Mike Sela, Engineering Manager Programming computers for nearly 35 years! No, never used punch cards! Enterprise Middleware Specialist turned Data Wrangler turned Manager! Likes puppies and long walks on the beach!
SurveyMonkey At a Glance World s leading provider of web-based survey solutions! Founded in 1999! Dave Goldberg joined as CEO in 2009! Freemium business with 15 million+ customers worldwide! 2 million+ survey responses per day! 250+ monkeys! HQ in Palo Alto with offices in Portland, Seattle, Portugal, and Luxembourg!
SurveyMonkey Applications Architecture - 2010.NET! Load Balancer!! Cache! SQL Server DB!
SurveyMonkey Log Processing 2010
SurveyMonkey Current Applications Architecture
Why Would We Do This? Not Easier! Not Faster! BUT Allows us to scale as an engineering organization! Creates a SurveyMonkey platform for partners!
SurveyMonkey Log Processing Early 2012
The Log Problem Releases were blind and occurred most days! How do we monitor the health of dozens! of components?! We went from two log files to ~50! Very few engineers had production access! Engineering was last to know about problems! Did not want to code dozens of different solutions!
What Tells Me That a Component is Healthy? Volume of requests it is handling! Response time! Status codes! How do I easily find this information for all my components?
SurveyMonkey Current Architecture Most applications were similar and based on the same framework!
The Solution Splunk, obviously Gives access to applications logs securely (no more blind releases) Enables most everyone to do fancy log analysis Nginx! Configurable, robust, open-source web-server/router
SurveyMonkey Current Applications Architecture
Nginx Routes ALL requests, both front-end and back-end! Can log all sorts of metadata for each request:! Timestamp URL Duration Headers Status Referrer User agent Length, etc
Nginx.conf Snippet: Splunk-friendly Logging http {! include /etc/nginx/mime.types;!! log_format sm 'time=$time_local,! rtime=$request_time,status=$status,! addr=$remote_addr,request=$request';! access_log /var/log/nginx/access.log sm;!
Application Log Sample from Nginx time=29/aug/2013:21:17:21-0700, rtime=0.006, status=200, addr=10.10.4.8, request=post / profilesvc/v1/get_user_info HTTP/1.1! time=29/aug/2013:21:17:22-0700, rtime=0.009, status=200, addr=10.10.4.8, request=post / profilesvc/v1/get_user_info HTTP/1.1! time=29/aug/2013:21:17:23-0700, rtime=0.023, status=200, addr=10.10.4.8, request=post / profilesvc/v1/update_user_info HTTP/1.1!
Typical Daily Splunk Dashboard Content For each of my web-services, a dashboard is built with the following: Volume for each page/api Last 24 hours and a week ago Processing time for each page/api Last 24 hours and a week ago Status codes for each page/api Last 24 hours and a week ago
Example Daily Dashboard: Volume
Example Daily Dashboard: Request Time
Example Daily Dashboard: Status Codes
Splunk Dashboard Queries index="surveymonkey" source="*nginx/jobsvc*" exportjob rex field=_raw "request=(post GET) (? <page>.+) " timechart count by page! index="surveymonkey" source="*nginx/jobsvc*" exportjob rex field=_raw "request=(post GET) (? <page>.+) " timechart span=30m median(rtime) by page! index="surveymonkey" source="*nginx/jobsvc*" exportjob timechart count by status!
Dashboard XML Snippet: Easy Replication Splunk> Manager >> User interface >> Views >> DashboardJobSvc! <?xml version='1.0' encoding='utf-8'?>! <dashboard>! <label>jobsvc 24 Hour Dashboard</label>! <row>! <chart>! <searchstring>index="surveymonkey" source="*nginx/jobsvc*" exportjob rex field=_raw "request=(post GET) (?<page>.+) " timechart count by page</searchstring>! <title>call volume by endpoint - Last 24 hours</title>! <earliesttime>-24h</earliesttime>! <option name="charting.chart">column</option>! <option name="charting.chart.stackmode">stacked</option>! <option name="count">10</option>! <option name="displayrownumbers">true</option>! </chart>! </row>!!
But Wait, There s More! Every nginx log line gets stamped with the machine name: time=30/aug/2013:09:19:11-0700, rtime=0.012, status=200, addr=10.10.4.8, request=post /profilesvc/v1/get_user_info HTTP/1.1! host=sjc-pyweb09 sourcetype=syslog source=/var/log/nginx/profilesvc.access.log! Hardware statistics exist in other files: memtotalmb memfreemb memusedmb memfreepct memusedpct pgpageout swapusedpct pgswapout cswitches interrupts forks processes threads loadavg1mi! 32226 14341 17885 44.5 55.5 216778516 0.0 0 2572893770 3952532792 4558405 482 1596 0.02! host=sjc-pyweb09 Options sourcetype=vmstat Options source=vmstat Options!
Hardware Monitoring by Software Component Splunk correlates applications with hardware health Pick a timeframe Pick a component View real-time stats: Memory (free and used) Load average (~CPU) Swap And much much more
Hardware Monitoring by Software Component Splunk correlation in action for jobsvc
Hardware Query by Component XML Snippet <chart>! <title>load Average</title>! <option name="charting.chart">line</option>! <searchtemplate>index=surveymonkey sourcetype=vmstat $time$ [search index="surveymonkey" source="*nginx/$service$*" earliest=-20m dedup host table host] timechart avg(loadavg1mi) by host</ searchtemplate>! </chart>! </row>!
Summary Splunk lets me access logs while keeping production! machines secure! Generate Splunk-friendly logs from a common layer of your architecture that sees all requests (e.g. Nginx)! Use Splunk to correlate across various sources of machine data including log files to simplify monitoring and increase visibility! Generate dashboards that confirm health in seconds!
Questions?
The End