Data science for the datacenter: processing logs with Apache Spark. William Benton Red Hat Emerging Technology

Similar documents

Log management with Logstash and Elasticsearch. Matteo Dessalvi

Unified Big Data Processing with Apache Spark. Matei

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Spark Application Carousel. Spark Summit East 2015

Moving From Hadoop to Spark

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

The Hadoop Eco System Shanghai Data Science Meetup

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Kafka & Redis for Big Data Solutions

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Information Retrieval Elasticsearch

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Apache Flink Next-gen data analysis. Kostas

Logitoring : log driven monitroing. the Rocket science. and. Eugene Istomin. IT Architect. e.istomin@edss.ee. Cone Center,Tallinn

Simba Apache Cassandra ODBC Driver

Spark: Cluster Computing with Working Sets

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

Architectures for massive data management

Real Time Data Processing using Spark Streaming

Log infrastructure & Zabbix. logging tools integration

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Bayesian networks - Time-series models - Apache Spark & Scala

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Scaling Pinterest. Yash Nelapati Ascii Artist. Pinterest Engineering. Saturday, August 31, 13

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Package RCassandra. R topics documented: February 19, Version Title R/Cassandra interface

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

Using Logstash and Elasticsearch analytics capabilities as a BI tool

WEBAPP PATTERN FOR APACHE TOMCAT - USER GUIDE

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Logging on a Shoestring Budget

Apache Flink. Fast and Reliable Large-Scale Data Processing

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Quantifind s story: Building custom interactive data analytics infrastructure

Challenges for Data Driven Systems

VMware Identity Manager Connector Installation and Configuration

CS312 Solutions #6. March 13, 2015

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Big Data Research in the AMPLab: BDAS and Beyond

Hadoop Ecosystem B Y R A H I M A.

Big Data and Scripting Systems build on top of Hadoop

Ankush Cluster Manager - Hadoop2 Technology User Guide

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Hadoop and Map-Reduce. Swati Gore

Hadoop & Spark Using Amazon EMR

Spark. Fast, Interactive, Language- Integrated Cluster Computing

ITG Software Engineering

Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc

NoSQL replacement for SQLite (for Beatstream) Antti-Jussi Kovalainen Seminar OHJ-1860: NoSQL databases

HADOOP. Revised 10/19/2015

Linux A first-class citizen in Windows Azure. Bruno Terkaly bterkaly@microsoft.com Principal Software Engineer Mobile/Cloud/Startup/Enterprise

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Workshop on Hadoop with Big Data

Large-Scale Test Mining

Big Data Analytics with Cassandra, Spark & MLLib

Chase Wu New Jersey Ins0tute of Technology

Hadoop: The Definitive Guide

Spark Job Server. Evan Chan and Kelvin Chu. Date

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Efficient Management of System Logs using a Cloud Radoslav Bodó, Daniel Kouřil CESNET. ISGC 2013, March 2013

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Trend Micro Big Data Platform and Apache Bigtop. 葉祐欣 (Evans Ye) Big Data Conference 2015

Streaming items through a cluster with Spark Streaming

Graylog2 Lennart Koopmann, OSDC /

Using the VCDS Application Monitoring Tool

Java Monitoring. Stuff You Can Get For Free (And Stuff You Can t) Paul Jasek Sales Engineer

Processing millions of logs with Logstash

COURSE CONTENT Big Data and Hadoop Training

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Introduction to Spark

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Log Analysis with the ELK Stack (Elasticsearch, Logstash and Kibana) Gary Smith, Pacific Northwest National Laboratory

Creating Big Data Applications with Spring XD

Apache Spark and Distributed Programming

HDP Hadoop From concept to deployment.

MapReduce with Apache Hadoop Analysing Big Data

Building Your Big Data Team

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Native Connectivity to Big Data Sources in MSTR 10

Transcription:

Data science for the datacenter: processing logs with Apache Spark William Benton Red Hat Emerging Technology

BACKGROUND

Challenges of log data

Challenges of log data SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg) FROM LOGS WHERE level='crit' AND msg LIKE '%failure%' GROUP BY hostname, hour

Challenges of log data SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg) FROM LOGS WHERE level='crit' AND msg LIKE '%failure%' GROUP BY hostname, hour 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00

Challenges of log data postgres WARN CRIT DEBUG httpd GET GET GET POST GET (404) syslog WARN WARN (ca. 2000)

Challenges of log data postgres WARN CRIT Cassandra CRIT httpd GET GET GET POST nginx GET POST PUT POST syslog WARN Rails WARN CouchDB CRIT redis CRIT httpd GET GET (404) POST httpd PUT (500) GET PUT 2016) Django WARN syslog (ca. WARN haproxy k8s WARN DEBUG CRIT WARN WARN

Challenges of log data postgres WARN CRIT Cassandra CRIT postgres WARN CRIT Cassandra CRIT httpd GET GET GET POST nginx GET POST PUT POST httpd GET GET GET POST nginx GET POST PUT POST syslog WARN Rails WARN syslog WARN Rails WARN CouchDB CRIT redis CRIT CouchDB CRIT redis CRIT httpd GET GET (404) POST httpd PUT (500) GET PUT httpd GET GET (404) POST httpd PUT (500) GET PUT Django WARN syslog WARN Django WARN syslog WARN haproxy postgres httpd syslog k8s WARN CRIT GET GET GET POST WARN WARN DEBUG CRIT WARN WARN Cassandra nginx Rails CRIT GET POST PUT POST WARN haproxy postgres httpd syslog k8s WARN CRIT GET GET GET POST WARN WARN DEBUG CRIT WARN WARN Cassandra nginx Rails CRIT GET POST PUT POST WARN How many services are CouchDB httpd GET CRIT GET (404) POST redis CRIT httpd PUT (500) GET PUT CouchDB httpd GET CRIT GET (404) POST redis CRIT httpd PUT (500) GET PUT generating logs in your Django WARN syslog WARN Django WARN syslog WARN haproxy WARN DEBUG CRIT haproxy WARN DEBUG CRIT k8s WARN WARN k8s WARN WARN datacenter today? postgres WARN CRIT Cassandra CRIT postgres WARN CRIT Cassandra CRIT httpd GET GET GET POST nginx GET POST PUT POST httpd GET GET GET POST nginx GET POST PUT POST syslog WARN Rails WARN syslog WARN Rails WARN CouchDB CRIT redis CRIT CouchDB CRIT redis CRIT httpd GET GET (404) POST httpd PUT (500) GET PUT httpd GET GET (404) POST httpd PUT (500) GET PUT Django WARN syslog WARN Django WARN syslog WARN haproxy WARN DEBUG CRIT haproxy WARN DEBUG CRIT k8s WARN WARN k8s WARN WARN

Apache Spark Graph Query ML Streaming Spark core (distributed collections, scheduler) ad hoc Mesos YARN

DATA INGEST

Collecting log data collecting Ingesting live log data via rsyslog, logstash, fluentd

Collecting log data collecting normalizing Ingesting live log data via rsyslog, logstash, fluentd Reconciling log record metadata across sources

Collecting log data collecting normalizing warehousing Ingesting live log data via rsyslog, logstash, fluentd Reconciling log record metadata across sources Storing normalized records in ES indices

Collecting log data warehousing Storing normalized records in ES indices

Collecting log data warehousing analysis Storing normalized records in ES indices cache warehoused data as Parquet files on Gluster volume local to Spark cluster

Schema mediation

Schema mediation

Schema mediation

Schema mediation timestamp, level, host, IP addresses, message, &c. rsyslog-style metadata, like app name, facility, &c.

Structured queries at scale logs.select("level").distinct.as[string].collect

Structured queries at scale logs.select("level").distinct.as[string].collect debug, notice, emerg, err, warning, crit, info, severe, alert

Structured queries at scale logs.select("level").distinct.as[string].collect debug, notice, emerg, err, warning, crit, info, severe, alert logs.groupby($"level", $"rsyslog.app_name").agg(count("level").as("total")).orderby($"total".desc)

Structured queries at scale logs.select("level").distinct.as[string].collect debug, notice, emerg, err, warning, crit, info, severe, alert logs.groupby($"level", $"rsyslog.app_name").agg(count("level").as("total")).orderby($"total".desc) kubelet 17933574 kube-proxy 10961117 ERR journal 6867921 systemd 5184475

FEATURE ENGINEERING

From log records to vectors What does it mean for two log records to be similar?

From log records to vectors What does it mean for two log records to be similar? red green blue orange

From log records to vectors What does it mean for two log records to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001

From log records to vectors What does it mean for two log records to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns

From log records to vectors What does it mean for two log records to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> 10000 -> 01000 -> 00100 -> 00001 -> 00000 -> 00010

From log records to vectors What does it mean for two log records to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> 10000 -> 01000 -> 00100 -> 00001 -> 00000 -> 00010 red pancakes orange waffles

From log records to vectors What does it mean for two log records to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> 10000 -> 01000 -> 00100 -> 00001 -> 00000 -> 00010 red pancakes orange waffles -> 00010000 -> 00101000

From log records to vectors What does it mean for two log records to be similar?

From log records to vectors What does it mean for two log records to be similar? {level:, hostname: fred, timestamp: "2016-05-06 00:00:01",...} {level: DEBUG, hostname: barney, timestamp: "2016-05-06 00:00:02",...} {level:, hostname: fred, timestamp: "2016-05-06 00:00:03",...} {level: ERR, hostname: barney, timestamp: "2016-05-06 00:00:03",...} {level: ALERT, hostname: wilma, timestamp: "2016-05-06 00:00:03",...}

From log records to vectors What does it mean for two log records to be similar? {level:, hostname: fred, timestamp: "2016-05-06 00:00:01",...} {level: DEBUG, hostname: barney, timestamp: "2016-05-06 00:00:02",...} {level:, hostname: fred, timestamp: "2016-05-06 00:00:03",...} {level: ERR, hostname: barney, timestamp: "2016-05-06 00:00:03",...} {level: ALERT, hostname: wilma, timestamp: "2016-05-06 00:00:03",...} -> 00000100 -> 00100000 -> 00000100 -> 00001000 -> 10000000

Similarity and distance

Similarity and distance

Similarity and distance (q - p) (q - p)

Similarity and distance n p i - q i i=1

Similarity and distance n p i - q i i=1

Similarity and distance p q p q

Similarity and distance

Similarity and distance

Similarity and distance

Similarity and distance 10 10 3 =.7

Other interesting features host01 WARN WAR host02 WARN DEBUG host03 WARN WARN

Other interesting features host01 WARN host02 DEBUG WARN DEBUG host03 FO WARN

Other interesting features host01 DEBUG WARN BUG host02 WARN host03 WARN WARN WARN WARN

Other interesting features : Everything is great! Just checking in to let you know I m OK.

Other interesting features : Everything is great! Just checking in to let you know I m OK. CRIT: No requests in last hour; suspending running app containers.

Other interesting features : Everything is great! Just checking in to let you know I m OK. CRIT: No requests in last hour; suspending running app containers. : Phoenix datacenter is on fire; may not rise from ashes.

NATURAL LANGUAGE and LOG DATA

Introducing word2vec Madrid Spain France

Introducing word2vec Madrid Spain France

Introducing word2vec Madrid Spain France

Introducing word2vec Madrid Spain France

Introducing word2vec Madrid Spain France

Introducing word2vec Madrid Spain Paris France v("madrid") - v("spain") + v("france") v("paris")

Preprocessing text What is a word? How are log words different from words found in conventional prose?

Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot

Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot systemd etcd ZooKeeper

Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot systemd /dev/null etcd ZooKeeper

Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot systemd /dev/null OutOfMemoryError Kernel::exec etcd ZooKeeper

Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot ENOENT systemd OPEN_MAX /dev/null OutOfMemoryError Kernel::exec etcd ZooKeeper

Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot ENOENT systemd /dev/null OPEN_MAX OutOfMemoryError 192.168.0.1 Kernel::exec etcd ZooKeeper

Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot ENOENT QEMU systemd /dev/null OPEN_MAX OutOfMemoryError 192.168.0.1 Kernel::exec etcd ZooKeeper

// assume messages is an RDD of log message texts val oneletter = new scala.util.matching.regex(".*([a-za-z]).*") def tokens(s: String, post: String=>String): Seq[String] = collapsewhitespace(s).split(" ").map(s => post(strippunctuation(s))).collect { case token @ oneletter(_) => token } val tokenseqs = messages.map(line => tokens(line), identity[string]) val w2v = new org.apache.spark.mllib.feature.word2vec val model = w2v.fit(tokenseqs)

Insights from log messages

Insights from log messages nova

Insights from log messages containers glance nova vm images instances

Insights from log messages password containers glance nova vm images instances

Insights from log messages opened password publickey accepted containers glance nova IPMI instances vm images

Insights from log messages opened password publickey accepted containers glance nova IPMI instances vm images sh

Insights from log messages opened password publickey accepted containers glance nova IPMI vm images instances AUDIT_SESSION /usr/bin/bash sh

VISUALIZING STRUCTURE and FINDING OUTLIERS

Multidimensional data

Multidimensional data [4,7]

Multidimensional data [4,7]

Multidimensional data [4,7] [2,3,5]

Multidimensional data [4,7] [2,3,5]

Multidimensional data [7,1,6,5,12, [4,7] [2,3,5] 8,9,2,2,4, 7,11,6,1,5]

Multidimensional data [7,1,6,5,12, [4,7] [2,3,5] 8,9,2,2,4, 7,11,6,1,5]

A linear approach: PCA 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1

A linear approach: PCA 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1

A nonlinear approach: t-sne

A nonlinear approach: t-sne p( )

A nonlinear approach: t-sne p( ) p( )

Tree-based approaches

Tree-based approaches if red yes if orange if!red no if!gray yes if!orange if!gray no

Tree-based approaches yes yes yes yes no no no no no yes no yes if red yes yes no yes no yes yes yes yes if orange if!red no no no no yes no no no yes yes no yes no if!gray yes yes no yes no yes no yes no if!orange no yes no yes yes no yes no no if!gray

Tree-based approaches yes yes yes yes no no no no no yes no yes if red yes yes no yes no yes yes yes yes if orange if!red no no no no yes no no no yes yes no yes no if!gray yes yes no yes no yes no yes no if!orange no yes no yes yes no yes no no if!gray

Self-organizing maps

Self-organizing maps

Training SOMs

Training SOMs

Training SOMs

Training SOMs

Self-organizing maps

Self-organizing maps

Self-organizing maps

Self-organizing maps

Self-organizing maps

Self-organizing maps

Self-organizing maps

Self-organizing maps

Self-organizing maps

Finding outliers with SOMs

Finding outliers with SOMs

Finding outliers with SOMs

Finding outliers with SOMs

Finding outliers with SOMs

Outliers in log data

Outliers in log data 0.95

Outliers in log data 0.95 0.97

Outliers in log data 0.95 0.97 0.92

Outliers in log data 0.95 0.97 0.92 0.37 0.94 0.89 0.91 An outlier is any record whose best match was at least 4σ below the mean. 0.93 0.96

Out of 310 million log records, we identified 0.0012% as outliers.

Thirty most extreme outliers 10 Can not communicate with power supply 2. 9 Power supply 2 failed. 8 Power supply redundancy is lost. 1 Drive A is removed. 1 Can not communicate with power supply 1. 1 Power supply 1 failed.

LESSONS LEARNED and KEY TAKEAWAYS

Spark and ElasticSearch Data locality is an issue and caching is even more important than when running from local storage. The best practice is to warehouse ES indices to Parquet files and write Spark jobs against them.

Structured queries in Spark Always program defensively: mediate schemas, explicitly convert null values, etc. Use the Dataset API whenever possible to minimize boilerplate and benefit from query planning without (entirely) forsaking type safety.

Natural language tips Conventional preprocessing pipelines probably won t get you the best results on log message texts. Consider the sorts of words and word variants you ll want to recognize with your preprocessing pipeline. Use an approximate whitelist for unusual words.

Memory and partitioning Large JVM heaps can lead to appalling GC pauses. You probably won t be able to use all of your memory in a single JVM without executor timeouts. Consider memory budget at the driver when choosing a partitioning strategy.

Feature engineering Design your feature engineering pipeline so you can translate feature vectors back to factor values. Favor feature engineering effort over complex or novel learning algorithms. Prefer approaches that train interpretable models.

THANKS! @willb willb@redhat.com https://chapeau.freevariable.com