1.5 Million Log Lines per Second Building and maintaining Flume flows at Conversant

Size: px

Start display at page:

Download "1.5 Million Log Lines per Second Building and maintaining Flume flows at Conversant"

Lisa McDonald
10 years ago
Views:

1 1.5 Million Log Lines per Second Building and maintaining Flume flows at Conversant Big Data Everywhere Chicago 2014 Mike Keane

2 SLA for Event Driven Logging R with Flume Quicker insight into production data Reduce complexity of administering/managing new servers, data centers, etc. Scalable No data loss or duplication Replace TSV files with Avro objects Able to be monitored by Network Operations Center (NOC) Able to recover from downtime quickly

Scalable No data loss or duplication Replace TSV files with Avro objects Able to

3 Simplistic Flume Overview R A Flume Flow is a series of flume agents data follows from origination to final destination Data on a Flume Flow is packaged in FlumeEvent Avro objects A FlumeEvent is composed of Headers A map of string value pairs Body A byte array A FlumeEvent is an atomic unit of data FlumeEvents are sent in batches When a batch of FlumeEvents only partially makes it to the next flume agent in the flow, the entire batch is resent resulting in duplicates

string value pairs Body A byte array A FlumeEvent is an atomic unit of data FlumeEvents are sent in batches When a

4 Simplistic Flume Overview R Flume Agent

5 Simplistic Flume Overview R EmbeddedAgent Compressor Agent Landing Agent

6 Overview of existing network topology 3 data centers divided into 12 lanes participating in the OpenRTB market 6 lanes in the east coast data center 4 lanes in the west coast data center 2 lanes in the European data center Each lane has approximately 75 servers handling OpenRTB operations. 30 different logs Over 60,000,000,000 log lines per day

the west coast data center 2 lanes in the European data center Each lane has

7 Overview of existing network topology.

8 P.O.C. Can Flume handle our log volume reliably? 2 Server Flume Flow from East Coast (IAD) to Chicago (ORD) with over 250K TSV lines per second No Data Loss Failover Compression performance

9 P.O.C. Overview

10 P.O.C. passes Larger Batch sizes helped, but could not reach 250K per second Multiple TSV lines Per FlumeEvent hits over 360K per second Failover passed with duplicates Compression passed but needed to parallelize 7X sinks

11 Taking Flume to Production Embedding the EmbeddedAgent in existing servers Modify EmbeddedAgent Properties from existing infrastructure Implement Monitoring Create Flume Implementation of proprietary logging interface Replace POJO to TSV with Avro to AvroDataFile Preventing duplicates, not removing Add LogType header

Monitoring Create Flume Implementation of proprietary logging interface Replace

12 Taking Flume to Production Custom Sink for AvroDataFile body (based on HDFSEventSink) Check if UUID header is in HBase Yes increment duplicate count metric No Write AvroDataFile body to HDFS using Custom Writer Put UUID to HBase

HBase Yes increment duplicate count metric No Write

13 Taking Flume to Production Custom Selector based on MultiplexingChannelSelector Route FlumeEvents to channels by log type or groups of log types Bifurcate to multiple locations each log and each location with its own percentage of data to bifurcate

log type or groups of log types Bifurcate to multiple

14 Configuring Flume Flows Configuring Flume can be tedious, use a templating engine In Q Conversant expanded from 7 lanes in 2 data centers to 12 lanes in 3 data centers (~400 more servers to configure). Static headers useful for tracking flows 15 minutes to configure all Q2 expansion CompressorLane('iad6', [CompressorAgent("dtiad06flm01p"), CompressorAgent("dtiad06flm02p"), CompressorAgent("dtiad06flm03p")]) compressor.list = dtiad06flm01p, dtiad06flm02p,dtiad06flm03p

Static headers useful for tracking flows 15 minutes to configure all Q2 expansion CompressorLane('iad6',

15 Monitoring the Flume Flows Flume metrics are available by JMX or Json over HTTP Metrics to monitor ChannelFillPercentage Rate of change on EventDrainSuccessCount on failover sinks FLUME-2307 File channel deletes fail after timeout (fixed 1.5) Publishing metrics to TSDB provides great visual insight

EventDrainSuccessCount on failover sinks FLUME-2307 File channel deletes

16 Monitoring the Flume Flows ChannelFillPercentage

17 Monitoring the Flume Flows Rate of taking events off Critical Logs file channel

18 Monitoring the Flume Flows Rate of Flume Events by data center East Coast, West Coast, Europe

19 Monitoring the Flume Flows Monitoring by Groups

20 Benefits of migrating to Flume Business has insight into data in under 10 minutes Configuring expansion trivial Failover enables automatic recovery from down time Bifurcation enables scaled constant regression lane(s) Subset of data to analytics development cluster

automatic recovery from down time Bifurcation enables scaled

21 Benefits of migrating to Flume 5 minute aggregations to business within 10 minutes

22 Gotchas Scaling for Compression Auto reloading of properties inconsistent It is recommended (though not required) to use a separate disk for the File Channel checkpoint. RAID-6 raid array, Force Write Back Bad configurations not easy to see, not always clear in log file. NetcatSource Not too useful beyond trivial usage

23 Gotchas POM file edits JUnits are not deterministic Hadoop jars added to classpath by startup script IDE Avoiding cost of Avro schema evolution

24 What is next Upgrade to Flume 1.5 Bifurcate to micro batch (Storm? Spark?) Disable sink switch

Headline Goes Here Hari Shreedharan Speaker Name or Subhead Goes Here

Real Time Data Ingest into Hadoop DO NOT USE PUBLICLY using Flume PRIOR TO 10/23/12 Headline Goes Here Hari Shreedharan Speaker Name or Subhead Goes Here CommiDer and PMC Member, Apache Flume SoIware Engineer,