Data science for the datacenter: processing logs with Apache Spark. William Benton Red Hat Emerging Technology

Transcription

1 Data science for the datacenter: processing logs with Apache Spark William Benton Red Hat Emerging Technology

2 BACKGROUND

3 Challenges of log data

4 Challenges of log data SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg) FROM LOGS WHERE level='crit' AND msg LIKE '%failure%' GROUP BY hostname, hour

5 Challenges of log data SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg) FROM LOGS WHERE level='crit' AND msg LIKE '%failure%' GROUP BY hostname, hour 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00

6 Challenges of log data postgres WARN CRIT DEBUG httpd GET GET GET POST GET (404) syslog WARN WARN (ca. 2000)

7 Challenges of log data postgres WARN CRIT Cassandra CRIT httpd GET GET GET POST nginx GET POST PUT POST syslog WARN Rails WARN CouchDB CRIT redis CRIT httpd GET GET (404) POST httpd PUT (500) GET PUT 2016) Django WARN syslog (ca. WARN haproxy k8s WARN DEBUG CRIT WARN WARN

8 Challenges of log data postgres WARN CRIT Cassandra CRIT postgres WARN CRIT Cassandra CRIT httpd GET GET GET POST nginx GET POST PUT POST httpd GET GET GET POST nginx GET POST PUT POST syslog WARN Rails WARN syslog WARN Rails WARN CouchDB CRIT redis CRIT CouchDB CRIT redis CRIT httpd GET GET (404) POST httpd PUT (500) GET PUT httpd GET GET (404) POST httpd PUT (500) GET PUT Django WARN syslog WARN Django WARN syslog WARN haproxy postgres httpd syslog k8s WARN CRIT GET GET GET POST WARN WARN DEBUG CRIT WARN WARN Cassandra nginx Rails CRIT GET POST PUT POST WARN haproxy postgres httpd syslog k8s WARN CRIT GET GET GET POST WARN WARN DEBUG CRIT WARN WARN Cassandra nginx Rails CRIT GET POST PUT POST WARN How many services are CouchDB httpd GET CRIT GET (404) POST redis CRIT httpd PUT (500) GET PUT CouchDB httpd GET CRIT GET (404) POST redis CRIT httpd PUT (500) GET PUT generating logs in your Django WARN syslog WARN Django WARN syslog WARN haproxy WARN DEBUG CRIT haproxy WARN DEBUG CRIT k8s WARN WARN k8s WARN WARN datacenter today? postgres WARN CRIT Cassandra CRIT postgres WARN CRIT Cassandra CRIT httpd GET GET GET POST nginx GET POST PUT POST httpd GET GET GET POST nginx GET POST PUT POST syslog WARN Rails WARN syslog WARN Rails WARN CouchDB CRIT redis CRIT CouchDB CRIT redis CRIT httpd GET GET (404) POST httpd PUT (500) GET PUT httpd GET GET (404) POST httpd PUT (500) GET PUT Django WARN syslog WARN Django WARN syslog WARN haproxy WARN DEBUG CRIT haproxy WARN DEBUG CRIT k8s WARN WARN k8s WARN WARN

9 Apache Spark Graph Query ML Streaming Spark core (distributed collections, scheduler) ad hoc Mesos YARN

10 DATA INGEST

11 Collecting log data collecting Ingesting live log data via rsyslog, logstash, fluentd

12 Collecting log data collecting normalizing Ingesting live log data via rsyslog, logstash, fluentd Reconciling log record metadata across sources

13 Collecting log data collecting normalizing warehousing Ingesting live log data via rsyslog, logstash, fluentd Reconciling log record metadata across sources Storing normalized records in ES indices

14 Collecting log data warehousing Storing normalized records in ES indices

15 Collecting log data warehousing analysis Storing normalized records in ES indices cache warehoused data as Parquet files on Gluster volume local to Spark cluster

16 Schema mediation

17 Schema mediation

18 Schema mediation

19 Schema mediation timestamp, level, host, IP addresses, message, &c. rsyslog-style metadata, like app name, facility, &c.

20 Structured queries at scale logs.select("level").distinct.as[string].collect

21 Structured queries at scale logs.select("level").distinct.as[string].collect debug, notice, emerg, err, warning, crit, info, severe, alert

22 Structured queries at scale logs.select("level").distinct.as[string].collect debug, notice, emerg, err, warning, crit, info, severe, alert logs.groupby($"level", $"rsyslog.app_name").agg(count("level").as("total")).orderby($"total".desc)

23 Structured queries at scale logs.select("level").distinct.as[string].collect debug, notice, emerg, err, warning, crit, info, severe, alert logs.groupby($"level", $"rsyslog.app_name").agg(count("level").as("total")).orderby($"total".desc) kubelet kube-proxy ERR journal systemd

24 FEATURE ENGINEERING

25 From log records to vectors What does it mean for two log records to be similar?

26 From log records to vectors What does it mean for two log records to be similar? red green blue orange

27 From log records to vectors What does it mean for two log records to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001

28 From log records to vectors What does it mean for two log records to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns

29 From log records to vectors What does it mean for two log records to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> > > > > > 00010

30 From log records to vectors What does it mean for two log records to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> > > > > > red pancakes orange waffles

31 From log records to vectors What does it mean for two log records to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> > > > > > red pancakes orange waffles -> >

32 From log records to vectors What does it mean for two log records to be similar?

33 From log records to vectors What does it mean for two log records to be similar? {level:, hostname: fred, timestamp: " :00:01",...} {level: DEBUG, hostname: barney, timestamp: " :00:02",...} {level:, hostname: fred, timestamp: " :00:03",...} {level: ERR, hostname: barney, timestamp: " :00:03",...} {level: ALERT, hostname: wilma, timestamp: " :00:03",...}

34 From log records to vectors What does it mean for two log records to be similar? {level:, hostname: fred, timestamp: " :00:01",...} {level: DEBUG, hostname: barney, timestamp: " :00:02",...} {level:, hostname: fred, timestamp: " :00:03",...} {level: ERR, hostname: barney, timestamp: " :00:03",...} {level: ALERT, hostname: wilma, timestamp: " :00:03",...} -> > > > >

35 Similarity and distance

37 Similarity and distance (q - p) (q - p)

38 Similarity and distance n p i - q i i=1

39 Similarity and distance n p i - q i i=1

40 Similarity and distance p q p q

44 Similarity and distance =.7

45 Other interesting features host01 WARN WAR host02 WARN DEBUG host03 WARN WARN

46 Other interesting features host01 WARN host02 DEBUG WARN DEBUG host03 FO WARN

47 Other interesting features host01 DEBUG WARN BUG host02 WARN host03 WARN WARN WARN WARN

48 Other interesting features : Everything is great! Just checking in to let you know I m OK.

49 Other interesting features : Everything is great! Just checking in to let you know I m OK. CRIT: No requests in last hour; suspending running app containers.

50 Other interesting features : Everything is great! Just checking in to let you know I m OK. CRIT: No requests in last hour; suspending running app containers. : Phoenix datacenter is on fire; may not rise from ashes.

51 NATURAL LANGUAGE and LOG DATA

52 Introducing word2vec Madrid Spain France

57 Introducing word2vec Madrid Spain Paris France v("madrid") - v("spain") + v("france") v("paris")

58 Preprocessing text What is a word? How are log words different from words found in conventional prose?

59 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot

60 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot systemd etcd ZooKeeper

61 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot systemd /dev/null etcd ZooKeeper

62 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot systemd /dev/null OutOfMemoryError Kernel::exec etcd ZooKeeper

63 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot ENOENT systemd OPEN_MAX /dev/null OutOfMemoryError Kernel::exec etcd ZooKeeper

64 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot ENOENT systemd /dev/null OPEN_MAX OutOfMemoryError Kernel::exec etcd ZooKeeper

65 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot ENOENT QEMU systemd /dev/null OPEN_MAX OutOfMemoryError Kernel::exec etcd ZooKeeper

66 // assume messages is an RDD of log message texts val oneletter = new scala.util.matching.regex(".*([a-za-z]).*") def tokens(s: String, post: String=>String): Seq[String] = collapsewhitespace(s).split(" ").map(s => post(strippunctuation(s))).collect { case oneletter(_) => token } val tokenseqs = messages.map(line => tokens(line), identity[string]) val w2v = new org.apache.spark.mllib.feature.word2vec val model = w2v.fit(tokenseqs)

67 Insights from log messages

68 Insights from log messages nova

69 Insights from log messages containers glance nova vm images instances

70 Insights from log messages password containers glance nova vm images instances

71 Insights from log messages opened password publickey accepted containers glance nova IPMI instances vm images

72 Insights from log messages opened password publickey accepted containers glance nova IPMI instances vm images sh

73 Insights from log messages opened password publickey accepted containers glance nova IPMI vm images instances AUDIT_SESSION /usr/bin/bash sh

74 VISUALIZING STRUCTURE and FINDING OUTLIERS

75 Multidimensional data

76 Multidimensional data [4,7]

77 Multidimensional data [4,7]

78 Multidimensional data [4,7] [2,3,5]

79 Multidimensional data [4,7] [2,3,5]

80 Multidimensional data [7,1,6,5,12, [4,7] [2,3,5] 8,9,2,2,4, 7,11,6,1,5]

81 Multidimensional data [7,1,6,5,12, [4,7] [2,3,5] 8,9,2,2,4, 7,11,6,1,5]

82 A linear approach: PCA

83 A linear approach: PCA

84

85 A nonlinear approach: t-sne

86 A nonlinear approach: t-sne p( )

87 A nonlinear approach: t-sne p( ) p( )

88

89 Tree-based approaches

90 Tree-based approaches if red yes if orange if!red no if!gray yes if!orange if!gray no

91 Tree-based approaches yes yes yes yes no no no no no yes no yes if red yes yes no yes no yes yes yes yes if orange if!red no no no no yes no no no yes yes no yes no if!gray yes yes no yes no yes no yes no if!orange no yes no yes yes no yes no no if!gray

92 Tree-based approaches yes yes yes yes no no no no no yes no yes if red yes yes no yes no yes yes yes yes if orange if!red no no no no yes no no no yes yes no yes no if!gray yes yes no yes no yes no yes no if!orange no yes no yes yes no yes no no if!gray

93 Self-organizing maps

95 Training SOMs

96 Training SOMs

97 Training SOMs

98 Training SOMs

108 Finding outliers with SOMs

113 Outliers in log data

114 Outliers in log data 0.95

117 Outliers in log data An outlier is any record whose best match was at least 4σ below the mean

118

119 Out of 310 million log records, we identified % as outliers.

120

121

122 Thirty most extreme outliers 10 Can not communicate with power supply 2. 9 Power supply 2 failed. 8 Power supply redundancy is lost. 1 Drive A is removed. 1 Can not communicate with power supply 1. 1 Power supply 1 failed.

123 LESSONS LEARNED and KEY TAKEAWAYS

124 Spark and ElasticSearch Data locality is an issue and caching is even more important than when running from local storage. The best practice is to warehouse ES indices to Parquet files and write Spark jobs against them.

125 Structured queries in Spark Always program defensively: mediate schemas, explicitly convert null values, etc. Use the Dataset API whenever possible to minimize boilerplate and benefit from query planning without (entirely) forsaking type safety.

126 Natural language tips Conventional preprocessing pipelines probably won t get you the best results on log message texts. Consider the sorts of words and word variants you ll want to recognize with your preprocessing pipeline. Use an approximate whitelist for unusual words.

127 Memory and partitioning Large JVM heaps can lead to appalling GC pauses. You probably won t be able to use all of your memory in a single JVM without executor timeouts. Consider memory budget at the driver when choosing a partitioning strategy.

128 Feature engineering Design your feature engineering pipeline so you can translate feature vectors back to factor values. Favor feature engineering effort over complex or novel learning algorithms. Prefer approaches that train interpretable models.

129