Copyright 2014 Splunk Inc. Telemetry: The Customer Experience Simon Warrington Senior Program Manager, Microso@
Disclaimer During the course of this presentagon, we may make forward- looking statements regarding future events or the expected performance of the company. We caugon you that such statements reflect our current expectagons and esgmates based on factors currently known to us and that actual events or results could differ materially. For important factors that may cause actual results to differ from those contained in our forward- looking statements, please review our filings with the SEC. The forward- looking statements made in the this presentagon are being made as of the Gme and date of its live presentagon. If reviewed a@er its live presentagon, this presentagon may not contain current or accurate informagon. We do not assume any obligagon to update any forward- looking statements we may make. In addigon, any informagon about our roadmap outlines our general product direcgon and is subject to change at any Gme without nogce. It is for informagonal purposes only, and shall not be incorporated into any contract or other commitment. Splunk undertakes no obligagon either to develop the features or funcgonality described or to include any such feature or funcgonality in a future release. 2
Agenda! Telemetry Defined! The Splunk Journey! Architecture! Impact of Ops on User Experience 3
Microso@ Xbox Microso@ $77.8B FY 2013 99K+ employees Xbox One All- in- one entertainment console 5 MM+ sold Xbox Entertainment Studio Video- based applicagons for sports, live events and original narragve content 4
About Me! Senior Program Manager! Microso@ employee for 2 years! IBM Enterprise Architect for 13 years! Lives and works out of Vancouver BC, Canada 5
Xbox Entertainment Studio (XES) Charter! Showcase Xbox capabiliges and break new ground! Provide content and experiences only found on Xbox One! Influence sales and console usage 6
Telemetry?
What is Telemetry?! Telemetry is the highly automated communicagons process by which measurements are made and other data collected at remote or inaccessible points and transmihed to receiving equipment for monitoring 8
Understanding the Customer Experience! Gain last mile insights in real- Gme! Correlate errors or performance characterisgcs across Xbox and cloud ecosystems! Gain visibility into occurrence and source of outages The Importance of Telemetry 9
Telemetry: The So What? Test Thin Client Thick Client Monitor Server Logs What are users doing out there??? 10
What Needs to be Monitored? 11
The Splunk Journey
The Splunk Journey Console Games Legacy System Splunk Storm Splunk Enterprise T < 2012 2012 2013 2014 13
Previous BI SoluGon: Key Challenges Homegrown, brittle BI solution Schema-driven, very rigid Difficult to accommodate changes Needed more stability and reliability Difficult to ingest data Drain on engineering resources Disjointed picture of the customer experience 14
The Splunk Promise Telemetry logs Just fire the hose at Splunk and it has the intelligence to understand what those key value pairs mean. Cri5cal informa5on available in real- 5me Powerful 5me- series analy5cs Access to granular data Robust aler5ng 15
Architecture
Legacy System App0 App8 Capture Layer Buffer level WWW Asp.net REST 2012: Legacy System Cloud Cloud Service Live Telemetry Azure Database Replication VHD blobs BP On- Site Proxy Studio SQL DB0 Staging SQL DB1 VM- 01 VM- 02 Vm- 03 Vm- 04 Vertica Reports Alerts Splunk Storm Forwarder Svc ABC Omniture Svc DEF Purchasing Svc Purchasing Data DEF Dashboard Svc SSIS Purchase Data REST API Refund Tool Reports System 2 SQL Azure Proxy Staging SQL DB2 SSIS Data Mart App1 App2 App3 App4 App5 App6 Capture Service ABC DEF GHI Blobs ABC DEF Blobs GHI BlobsBI Subscriber (temp) Azure Connect ABC... SQL DB 5 SQL DB 6 Staging Report Server bp- bisql05 Reports query BI Team App7 Catalog 1 Catalog 2 SQL Cluster 17
2013 Architectural Context Splunk Storm + Apache Storm + Ducksboard + Hadoop Splunk Storm Limited dashboard access Limited real- Gme queries Splunk was unproven: we needed some redundancy Other systems allowed Splunk to focus on troubleshoogng Service Bus Azure Storage Hadoop Splunk Storm Partner Feeds Apache Storm Azure Cloud Services 18
High Level Architecture Splunk Enterprise in Microso@ Cloud Azure Service Bus Ops Team Azure Cloud Services 19
Cluster Topology Splunk Enterprise in Microso@ Cloud Azure Region 1 Region 2 Ops Team Region 3 Azure6 Cloud Services Services Team 20
Splunk Specs! Data sources XBOX 360 Telemetry logs XBOX One Telemetry logs Smartglass Telemetry logs Win 8 / Win Phone 8 Cloud Services! Indexing Average: 75G/day Peak: 250G/day! Stakeholders Engineering, BI, IT OperaGons Leadership teams 21
2014 Architectural Context Splunk Enterprise + Hadoop Splunk Enterprise OperaGons all up Splunk capabiliges proven Simultaneously supports mulgple teams & data perspecgves. Access to data near real- Gme Ever growing Xbox One, Xbox 360, Win 8, Win Phone 8, Cloud Services & Smartglass Infrastructure dramagcally simplified! Service Bus Azure Cloud Services Hadoop Splunk 6.0 Partner Feeds 22
Impact of Ops on User Experience
Xbox Telemetry: Splunk- eye- view 2014-09-06 01:34:14.9-0000 / XXXXXXXXXXXXXXXX / nfl_x1 / video_heartbeat / message_id=a4095c7a5e204ffa869e776a7312d924, appsource=nfl_x1, clientip=xx.xx.xx, log_time=2014-09-06 01:34:15.2-0000, event_type=video_heartbeat, video_clip_type=clip, video_name="underrated week 1 matchups", device_type=durangoapp, video_session_id=498c8dee-6f02-4f8c-b55a-bbb0172d8bef, event_name=video_heartbeat, session_id=c0f903b2-ac5b-4cc9-996d-4a00247df6b3, content_version=1.8.1.44986, video_buffer_seconds=59, video_length=182.624, video_progress=162.8571785, video_bitrate=583000, app_channel=nflnow, video_avg_receive_rate=6045924, video_buffer_progress_percent=100, video_min_bitrate=100000, video_receive_rate=3834704, tx_sequence=5454, video_total_dropped_frames=0, primary_video_id=2c42dcd2-c98f-4e2c-9995- e7a1f34dec64, video_playback_speed=1, mode=viewfullscreenlandscape, video_player_state=playing, video_start_bitrate=600000, video_channel=nflnow, heartbeat_type=secondary, video_render_fps=59.9460334777832, authentication=n, video_max_bitrate=600001, date_time=2014-09-06t01:34:17, video_dropped_fps=0, video_cc_enabled=n, video_id=dd49c4bb-b47c-4582-aea6-cc32a6c5e698, video_stream_url="http:// fvodhstream-vh.akamaihd.net/i/films/2014/nfl_com/fantasy/reg/ 01/140905_fantasy_live_wk1_underrated_matchups_\,180k\,320k\,500k\,700k\,1200k\,2000k\,3200k\, 5000k\,.mp4.csmil/master.m3u8", primary_video_category=clip, sub_session_id=0, build_version=1.8.0.44559, redzone_authentication=n, PartitionKey="q-s-splunkXXX-64" 24
Xbox Telemetry: Ops- eye- view Number of users In ApplicaGon Watching Video Playing Fantasy Football Error Percentages over CCU External dependency resolugon 25
Same data, different perspecgves Different PerspecGves for different stakeholders 26
Real- Gme OperaGonal Insights! PopulaGon PerspecGve Concurrent users/sessions within an applicagon/video event % of errors by type and by populagon ê Trigger threshold alerts! Individual PerspecGve User session- level troubleshoogng! System PerspecGves Customer issues correlated with cloud service issues 27
Real- Gme OperaGonal VerificaGon! Release Management Process Can deploy new applicagons and watch adopgon rates on mulgple plaworms in real Gme Can update any exisgng applicagon ê Validate Telemetry ê Verify no user drop off ê Quickly assess before and a@er behavior (change in error rates etc.) 28
Fast TroubleshooGng Examples Problem Symptoms Partner Customer Internal Xbox Gtle not working Spike in external dependency failure events Poor video quality for UFC PPV Video buffering Gme- out excepgons across user sessions Issue ISP outage in NY Simultaneous Newlix viewing ConfiguraGon defects Concurrent Video Viewer trend line drops off Misconfigured video end- point Resolu5on 15min: old=days 5min: old=hours 5min: old=hours 29
Results with Splunk Proac5ve Aler5ng and Troubleshoo5ng BeDer Resource Alloca5on Opera5onal Insights Issue nogficagon: from 15-20 minutes to seconds Eliminated false posigve alerts Faster to determine source of issues: partner or internal Maintenance: from 16 hours per DAY to 8 hours per month BI team no longer responsible for troubleshoogng Engineers focused on building applicagons CreaGng dashboards: from 1 week to hours Capacity forecasgng App usage and trends 30
Key Takeaways! Cloud based Splunk implementagon Cloud Disk I/O Universal Forwarder vs other integragon opgons Load Balancing across the index cluster Load Balancing Search heads for different stakeholders Other LimitaGons 31
THANK YOU