Splunk for Data Science



Similar documents
How To Use Splunk For Android (Windows) With A Mobile App On A Microsoft Tablet (Windows 8) For Free (Windows 7) For A Limited Time (Windows 10) For $99.99) For Two Years (Windows 9

Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS

Stream Deployments in the Real World: Enhance Opera?onal Intelligence Across Applica?on Delivery, IT Ops, Security, and More

Splunk for Networking and SDN

Introducing Data Visualiza2on Cloud Service

Accelera'ng Your Solu'on Development with Splunk Reference Apps

The Right BI Tool for the Job in a non- SAP Applica9on Environment

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

MSc Data Science at the University of Sheffield. Started in September 2014

Blue Medora VMware vcenter Opera3ons Manager Management Pack for Oracle Enterprise Manager

BENCHMARKING V ISUALIZATION TOOL

DNS Big Data

Architec;ng Splunk for High Availability and Disaster Recovery

Architec;ng Splunk for High Availability and Disaster Recovery

.nl ENTRADA. CENTR-tech 33. November 2015 Marco Davids, SIDN Labs. Klik om de s+jl te bewerken

Incident Response Using Splunk for State and Local Governments

Data Stream Algorithms in Storm and R. Radek Maciaszek

Big Data. The Big Picture. Our flexible and efficient Big Data solu9ons open the door to new opportuni9es and new business areas

Data Science And Big Data Analytics Course

Pu?ng B2B Research to the Legal Test

How to Use Splunk To Detect and Defeat Fraud, TheK And Abuse

More Than A Buzzword: Big Data in the Environmental Arena

Protec'ng Communica'on Networks, Devices, and their Users: Technology and Psychology

Effec%ve AX 2012 Upgrade Project Planning and Microso< Sure Step. Arbela Technologies

DTCC Data Quality Survey Industry Report

An Open Dynamic Big Data Driven Applica3on System Toolkit

End- to- End Monitoring Unified Performance Dashboard (UPD)

ANALYTICAL TECHNIQUES FOR DATA VISUALIZATION

CMMI for High-Performance with TSP/PSP

Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis

Data Mining. SPSS Clementine Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

Protec'ng Informa'on Assets - Week 8 - Business Continuity and Disaster Recovery Planning. MIS 5206 Protec/ng Informa/on Assets Greg Senko

Tax Fraud in Increasing

Splunk and Big Data for Insider Threats

Unified Monitoring with AppDynamics

Splunk Company Overview

Tableau Your Data! Wiley. with Tableau Software. the InterWorks Bl Team. Fast and Easy Visual Analysis. Daniel G. Murray and

DDOS Mi'ga'on in RedIRIS. SIG- ISM. Vienna

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

BIG DATA AND INVESTIGATIVE ANALYTICS

Adventures in Bouncerland. Nicholas J. Percoco Sean Schulte Trustwave SpiderLabs

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

MAXIMIZING THE SUCCESS OF YOUR E-PROCUREMENT TECHNOLOGY INVESTMENT. How to Drive Adop.on, Efficiency, and ROI for the Long Term

Bayesian networks - Time-series models - Apache Spark & Scala

INCREMENTAL, APPROXIMATE DATABASE QUERIES AND UNCERTAINTY FOR EXPLORATORY VISUALIZATION. Danyel Fisher Microso0 Research

Program Model: Muskingum University offers a unique graduate program integra6ng BUSINESS and TECHNOLOGY to develop the 21 st century professional.

DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

This presenta,on covers the essen,al informa,on about IT services and facili,es which all new students will need to get started.

ECBDL 14: Evolu/onary Computa/on for Big Data and Big Learning Workshop July 13 th, 2014 Big Data Compe//on

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Sisense. Product Highlights.

Azure Machine Learning, SQL Data Mining and R

April 2016 JPoint Moscow, Russia. How to Apply Big Data Analytics and Machine Learning to Real Time Processing. Kai Wähner.

Defending Against Web App A0acks Using ModSecurity. Jason Wood Principal Security Consultant Secure Ideas

Using RDBMS, NoSQL or Hadoop?

Big Data Use Cases. At Salesforce.com. Narayan Bharadwaj Director, Product Management

Machine Learning with MATLAB David Willingham Application Engineer

WHITE PAPER SPLUNK SOFTWARE AS A SIEM

Big Data Integration: A Buyer's Guide

PROJECT PORTFOLIO SUITE

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

Applying Machine Learning to Network Security Monitoring. Alex Pinto Chief Data Scien2st

TLD Data Analysis. ICANN Tech Day, Dublin. October 19th 2015 Maarten Wullink, SIDN. Klik om de s+jl te bewerken

A Tutorial Introduc/on to Big Data. Hands On Data Analy/cs over EMR. Robert Grossman University of Chicago Open Data Group

Oracle Big Data Discovery Unlock Potential in Big Data Reservoir

Machine Learning Capacity and Performance Analysis and R

Exploiting IT Log Analytics to Find and Fix Problems Before They Become Outages

Data Mining. Supervised Methods. Ciro Donalek Ay/Bi 199ab: Methods of Sciences hcp://esci101.blogspot.

Hadoop & SAS Data Loader for Hadoop

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

FRESCO: Modular Composable Security Services for So;ware- Defined Networks

Transcription:

Copyright 2014 Splunk Inc. Splunk for Data Science Tom LaGa=a Data Scien@st, Splunk Olivier de Garrigues Sr Prof Services Consultant, Splunk

Disclaimer During the course of this presenta@on, we may make forward- looking statements regarding future events or the expected performance of the company. We cau@on you that such statements reflect our current expecta@ons and es@mates based on factors currently known to us and that actual events or results could differ materially. For important factors that may cause actual results to differ from those contained in our forward- looking statements, please review our filings with the SEC. The forward- looking statements made in the this presenta@on are being made as of the @me and date of its live presenta@on. If reviewed arer its live presenta@on, this presenta@on may not contain current or accurate informa@on. We do not assume any obliga@on to update any forward- looking statements we may make. In addi@on, any informa@on about our roadmap outlines our general product direc@on and is subject to change at any @me without no@ce. It is for informa@onal purposes only, and shall not be incorporated into any contract or other commitment. Splunk undertakes no obliga@on either to develop the features or func@onality described or to include any such feature or func@onality in a future release. 2

3 Key Takeaways 1 2 3 Data Science is about extrac:ng ac:onable insights from data. Splunk is great for doing Data Science! Splunk complements other tools in the Data Science toolkit. 3

About Us! Tom LaGaAa, Data Scien:st Tom joined Splunk in Spring 2014 as a Data Scien@st specializing in Probability and Sta@s@cs. Tom is an expert on the mathema@cs of inference, and he enjoys func@onal programming in languages like Clojure, Haskell & R. At Splunk, Tom is helping to develop our internal and external Data Science program and curriculum. Tom has a Ph.D. in Mathema@cs from the University of Arizona, and un@l recently was a Courant Instructor at the Courant Ins@tute at New York University. Tom is based in New York City.! Olivier de Garrigues, Senior Professional Services Consultant Olivier is based in London on the EMEA Professional Services team and has helped out more than 40 customers in 10 countries on various Splunk projects in the past year and a half. Prior to this, he worked as a quan@ta@ve analyst with extensive use of MATLAB and R. He developed a keen interest in machine learning and enjoys dreaming about how to make Splunk be=er for data scien@sts, and helped develop the R Project App. Olivier holds an MS in Mathema@cs of Finance from Columbia University. 4

Splunk for Data Science

What is Data Science? Data Science is about extrac@ng ac@onable insights from data.! Helps people make be=er decisions! Can be used for automated decision- making! Data Science is cross- func@onal, and blends techniques & theories from: CS / Programming Math and Sta@s@cs Machine Learning Data Mining / Databases Data Visualiza@on! Don t be afraid of Data Science! Substan@ve / Domain Exper@se Social Science Communica@on and Presenta@on Accoun@ng, Finance and KPIs Business Analy@cs 6

Data Science & Analy@cs Teams There is no one size fits all data scien@st. Data Science & Analy@cs teams are made up of people with complementary skill sets. Source: Schu= & O Neil. Doing Data Science. 2013 7

Splunk for Data Science Splunk is great for doing Data Science!! Integrate, query & visualize all the data: Plalorm for machine data Connects with any other data source! Easy- to- use Analy@cs capabili@es! Powerful algorithms out- of- the- box! Sharp visualiza@ons and dashboards! Deliver results to both IT & Business users! Complements other Data Science tools (next slide) 8

Splunk and Data Science Tools Splunk complements other tools in the Data Science toolkit:! Hadoop: the workhorse of the Data Science world. Using Hunk, you can integrate Hadoop & HDFS seamlessly into Splunk.! R & Python: the preferred languages of Data Science. Execute R & Python scripts in your Splunk queries using the R Project App & SDK for Python! SQL & other RDBMS: valuable stores for customer & product data. Use Splunk s DB Connect App to mash rela@onal data up with machine data.! External tools: export finalized data from Splunk using the ODBC Driver Tip: do all your data processing in Splunk/Hunk, and export only the final results! D3 Custom Visualiza@ons: sharp dashboards & reports using Splunk 9

Splunk and Data Science Use Cases Splunk is a powerful tool for lots of Data Science use cases: Green Use Cases (easy out of the box) Yellow Use Cases (needs @nkering) Trend Forecas@ng D3 Custom Visualiza@ons A/B Tes@ng Predic@ve Modeling Root Cause Analysis Sen@ment Analysis Anomaly Detec@on Conversion Funnel/Pathing Market Segmenta@on More Algorithms via R & Python Topic Modeling Capacity Planning Correlate Data from 2+ Sources Data Munging & Normaliza@on KPIs & Execu@ve Dashboards 10

Data Science Use Cases

Use Case: Trend Forecas@ng Trend Forecas@ng: given past & real@me data, predict future values & events.! Common applica@ons: Forecast revenue & other KPIs Web server traffic & product downloads Customer conversion rates Es@mate MTTR & server outages Resource & capacity planning (AWS App) Security threats (Enterprise Security App)! The true course of events can (and will) take only one of many divergent paths. But which one?! Be mindful of rare events & black swans! 12

predict command: forecast future trajectories of @me series! Implements a Kalman filter to iden@fy seasonal trends! Gives an uncertainty envelope as a buffer around the trend! Tip: Always run the predict command on LOTS of past data. Capture low- frequency and high- frequency trends Splunk Solu@on: predict!! Remember: the future is always uncertain 13

Splunk Solu@on: Predict App David Carasso s Predict App: forecast future values of individual events. 8 minute walkthrough: h=ps://www.youtube.com/watch?v=rovaqjignfg! Implements a Naïve Bayes classifier! You have to train models!! Train a model to predict any target field using any reference field(s): fields ref1, ref2,..., target train my_model from target!! Guess target field for incoming events: guess my_model into target!! Temporal or non- temporal predic@on (include _@me among reference fields) 14

Concept: Supervised Learning & Classifica@on Supervised learning: use observed training data to classify values of unknown tes1ng data! predict command (Kalman filter): Training data = @mechart of past & real@me values. Tes@ng data = @me range for future values! Predict App (Naïve Bayes classifier): Training data = events with reference & target fields. Tes@ng data = events with reference fields but not target field! Tip: only deploy models & algorithms a2er extensive tes@ng & evalua@on! More powerful learning algorithms using R Project App or SDK for Python 15

Demo: Predict App! Train a model to predict movie Ra@ng based on MovieID, UserID, Genre, Tag index=movielens Timestamp < 1199188800 UserID=593* eval original_rating = case(rating<3,"dislike", Rating=3,"Neutral", Rating>3,"Like") fields original_rating MovieID UserID Genre Tag train rating_model from original_rating!! Guess Ra@ng for test data based on trained model index=movielens Timestamp > 1199188800 UserID=593* guess rating_model into guessed_rating top original_rating guessed_rating!! Accuracy of model: correct on 97.6% of values! Tip: always train on LOTS of training data! Evaluate before deploying 16

Use Case: Sen@ment Analysis Sen@ment Analysis: the assignment of emo@onal labels to textual data! Can be simple +1 vs. - 1, or more sophis@cated: happy, angry, sad, etc.! Analyze tweets, emails, news ar@cles, logs or any other textual data! Social data correlates with other factors! Typically done via supervised learning: Train a model on labeled corpus of text Test the model on incoming text data! Read more about Sen@ment Analysis: Chapter 14 of Big Data Analy1cs Using Splunk (pp. 255-282) Michael Wilde & David Carasso. Social Media & Sen1ment Analysis..conf2012 r=.79 17% 1.8% 10% 36% 19% 3 rd 8 th 4 th 1 st 2 nd 2011 Irish General Elec@on 17

Splunk Solu@on: Sen@ment Analysis App David Carasso s Sen@ment Analysis App assigns binary sen@ment values to textual data (logs, tweets, email, etc.)! Naïve Bayes classifier under the hood! Twi=er & IMDB models out of the box! Can guess language of authorship, and heat, a measure of emo@onal charge! Tip: compare rela@ve sen@ment changes across @me & groups! How to train your own models: h=p://answers.splunk.com/answers/59743 18

Demo: Sen@ment Analysis App 19

Use Case: Anomaly Detec@on! An anomaly (or outlier) is an event which is vastly dissimilar to other events! Anomaly Detec@on is one of Splunk s most common use cases. Examples: Transac@ons which occur faster than humanly possible DDoS a=acks from IP address ranges High- value customer purchase pa=erns! Quick techniques for finding sta@s@cal outliers: Non- average outliers: more than 2*stdev from the avg Non- typical outliers: more than 1.5*IQR above perc75 or below perc25! Tip: save these as even=ypes for automated outlier detec@on! Once anomalies have been found, dig deeper to discover root causes 20

Splunk Solu@on: cluster! Anomalies are dissimilar to other events (by defini@on)! We can use clustering algorithms to help us detect anomalies: Non- anomalous events typically form a few large clusters Anomalous events typically form lots of small clusters! Cluster your data, sort ascending: cluster showcount=true labelonly=true sort cluster_count cluster_label!! Remember: there is no right way to find all anomalies. Explore your data! 21

Concept: Unsupervised Learning & Clustering! A clustering algorithm is any process which groups together similar things (events, people, etc), and separates dissimilar things (events, people, etc.)! Clustering is unsupervised: choose labels based on pa=erns in the data! Clustering is in the eye of the beholder: Lots of different clustering algorithms Lots of different similarity func@ons! Do not confuse with: Computer cluster: a group of computers working together as a single system Splunk cluster: a group of Splunk indexers replica@ng indexes & external data 22

Demo: cluster! 23

Splunk Solu@on: Other Commands! anomalies: Assigns an unexpectedness score to each event! anomalousvalue: Assigns an anomaly score to events with anomalous values! outlier: Removes or truncates outliers! kmeans: Powerful clustering algorithm. You choose k = # of clusters 24

Splunk Solu@on: Prelert (Partner App)! Manages Anomaly Detec@on directly Pre- built dashboards, alerts, API. Use cases: Security, IT Ops / APM, DevOps Godfrey Sullivan: "beau@fully adjacent and complimentary to what Splunk does! Can download from Splunk Apps May save you @me with Anomaly Detec@on Can also be good source of inspira@on for your own Anomaly Detec@on dashboards! Keep in mind Prelert is a paid app: Cost: $225/month @ 5GB 25

Use Case: Market Segmenta@on! Market Segmenta@on: group customers according to common needs and priori@es, and develop strategies to target them Market segments are internally homogeneous, and externally heterogeneous i.e., market segments are clusters of customers! Many reasons for Market Segmenta@on: Different market segments require different strategies Customers in same segment have similar product preferences. Different segments, different preferences Segments should be reasonably stable, to allow for historical analysis (good for Data Science)! Use Splunk s clustering algorithms to iden@fy and label market segments! 26

Data Visualiza@ons

Intro to Data Visualiza@on!! Data Visualiza@on is the crea@on and study of the visual representa@on of data, and is a vital part of Data Science The goal of data visualiza@on is to communicate informa1on: Visualiza@ons communicate complex ideas with clarity, precision, and efficiency Transmission speed of the op@c nerve is about 9Mb/sec fast image processing Pa=ern matching, edge detec@on Visualiza@ons pack lots of informa@on into small spaces. More than text alone! 28

Telling Stories with Data Visualiza@ons! We process data in linear narra@ves: even dashboards go top- to- bo=om! Visualiza@ons help pierce the monotony of text, number & data streams Think about the story you re telling:! Empathize with the viewer What s their takeaway?! A good visualiza@on tells its own story: Island Na@on Obtains Favourable Balance of Trade; Goes On To Rule The World! Weave mul@ple visualiza@ons together to tell more effec@ve stories 29 William Playfair (1786)

Splunk Source: New York Times. May 17, 2012 30

Splunk Source: New York Times. May 17, 2012 31

Tips for Effec@ve Data Visualiza@ons! #1 @p: Plot the most important keys on x & y axes You choose most important. You might need >1 visualiza@on.! Manipulate size, color and shape to convey addi@onal informa@on! Annotate, label and add icons! Use chart overlay to correlate data sources. Mix histograms & line charts! Manipulate numerical scale: linear vs. log scales (previous 2 slides)! Read more about Data Visualiza@on: Tableau s whitepaper, Visual Analysis Best Prac1ces (2013) Edward TuRe s The Visual Display of Quan1ta1ve Informa1on (2001) 32

D3 Custom Visualiza@ons in Splunk! Splunk now supports D3 visualiza@ons with some minor customiza@on! Satoshi s talk: I want that cool viz in Splunk!! Resources for Custom Visualiza@ons: Splunk Web Framework Toolkit h=ps://apps.splunk.com/app/1613/ Splunk 6.x Dashboard Examples h=ps://apps.splunk.com/app/1603/ Custom SimpleXML Extensions h=p://apps.splunk.com/app/1772/ Lots more D3 visualiza@ons for h=ps://github.com/mbostock/d3/wiki/gallery 33

Demo: Sankey Chart 34

How- to for Sankey Charts! Install the Custom SimpleXML Extensions app: h=p://apps.splunk.com/app/1772/! Create your own app, and install Sankey chart components: Drop autodiscover.js in $SPLUNK_HOME/etc/apps/<YOURAPP>/appserver/sta@c Copy & paste /sankeychart/ subfolder into $SPLUNK_HOME/etc/apps/<YOURAPP>/ appserver/sta@c/components Restart Splunk! In your dashboard: Include script="autodiscover.js" in <form> or <dashboard> opening tag Insert XML snippet from 2- or 3- node Sankey dashboard example Change 2 instances of custom_simplexml_extensions to <YOURAPP> Update search and data- op@ons parameters (nodes) in XML to reflect your data 35

Know Your Audience! Finally, keep in mind your audience: who are they, what ques@ons do they care about, and how do they want to consume the data? Execu@ve: KPIs, charts, tables with icons Marke@ng Analyst: KPIs & metrics. Sharp images for their own reports & decks. Tableau Data Scien@st: output clean data to organized data stores (Hunk, HDFS, SQL, NoSQL) Sysadmin: sparklines, gauges for ac@vity & MTTR, tables with highlighted anomalies Security Ops: maps with detailed overlays, drill down on anomalous events.! Bring it back to the business problem & use 36

3 Key Takeaways 1 2 3 Data Science is about extrac:ng ac:onable insights from data. Splunk is great for doing Data Science! Splunk complements other tools in the Data Science toolkit. 37

List of References Good books on Data Science:! Schu= & O Neil. Doing Data Science. O Reilly 2013! Provost & Fawce=. Data Science for Business. O Reilly 2013! Max Shron. Thinking With Data. O Reilly 2014! Edward TuRe. The Visual Display of Quan1ta1ve Informa1on. Graphics Press 2001! Zumel & Mount. Prac1cal Data Science with R. Manning 2014! Has@e et al. Elements of Sta1s1cal Learning. Springer- Verlag 2009 (free PDF!) Using Splunk for Data Science:! Zadrozny, Kodali (and Stout). Big Data Analy1cs Using Splunk. Apress 2013! David Carasso. Exploring Splunk. CITO Research 2012! David Carasso. Data Mining with Splunk..conf2012! Michael Wilde & David Carasso. Social Media & Sen1ment Analysis..conf2012 Good free references:! Tableau. Visual Analysis Best Prac1ces. Tableau 2013! King & Magoulas. 2013 Data Science Salary Survey. O Reilly 2013! DJ Pa@l. Building Data Science Teams. O Reilly 2013! Cathy O Neil. On Being A Data Skep1c. O Reilly 2013 38

THANK YOU