Big Data Pipeline and Analytics Platform

Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Software Sudhir Tonse (@stonse) Danny Yuan (@g9yuayon)

Netflix is a log generating company that also happens to stream movies - Adrian Cockroft photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/o/in/photostream/!

Data Is the most important asset at Netflix

If all the data is easily available to all teams, it can be leveraged in new and exciting ways

~1000 Device Types ~500 Apps/Web Services ~100 Billion Events/Day! 3.2M messages per second at peak time! 3GB per second at peak time Dashboard

Type of Events User Interface Events Search Event ( Matrix using PS3 ) Star Ra>ng Event (HoC : 5 stars, Xbox, US, )! Infrastructural Events RPC Call (API - > Billing Service, /bill/.., 200, ) Log Errors (NPE, Movie is null,, )! Other Events!

Making Sense of Billions of Events

http://netflix.github.io + Druid ElasticSearch

A Humble Beginning

Evolution Scale!

Application Application Application Application Application Application Application Application Application Application

We Want to Process App Data in Hadoop

Our Hadoop Ecosystem

@NetflixOSS Big Data Tools

Hadoop as a Service

Pig Scripting on Steroids

Pig Married to Clojure

S3MPER S3mper is a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index. S3mper is a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index.

Efficient ETL with Cassandra Cassandra

Offline Analysis

Evolution Speed!

We Want to Aggregate, Index, and Query Data in Real Time

Interactive Exploration

Let s walk through some use cases

* client activity event /name = moviestarts

Pipeline Challenges App owners: send and forget Data scientists: validation, ETL, batch processing DevOps: stream processing, targeted search

Message Routing

We Want to Consume Data Selectively in Different Ways

Message broker! High-throughput! Persistent and replicated

There Is More

Intelligent Alerts

Guided Debugging in the Right Context

What We Need Ad-hoc query with different dimensions! Quick aggregations and Top-N queries! Time series with flexible filters! Quick access to raw data using boolean queries

Druid Rapid exploration of high dimensional data! Fast ingestion and querying! Time series

Real-time indexing of event streams! Killer feature: boolean search! Great UI: Kibana

The Old Pipeline

The New Pipeline

There Is More

It s Not All About Counters and Time Series

Status:200 RequestId Parent Id Node Id Service Name Status 4965-4a74 0 123 Edge Service 200 4965-4a74 123 456 Gateway 200 4965-4a74 456 789 Service A 200 4965-4a74e 456 abc Service B 200

Distributed Tracing

A System that Supports All These

A Data Pipeline To Glue Them All

Make It Simple

Message Producing Simple and Uniform API messagebus.publish(event)

Consumption Is Simple Too consumer.observe().subscribe(new Subscriber<>() { @Override public void onnext(ackable<incomingmessage> ackable) { process(ackable.getentity(myeventtype.class)); ackable.ack(); } });! consumer.pause(); consumer.resume()

RxJava Functional reactive programming model! Powerful streaming API! Separation of logic and threading model

Design Decisions Top Priority: app stability and throughput Asynchronous operations Aggressive buffering Drops messages if necessary

Anything Can Fail

Cloud Resiliency

Fault Tolerance Features Write and forward with auto-reattached EBS (Amazon s Elastic Block Storage) disk-backed queue: big-queue Customized scaling down

There s More to Do Contribute to @NetflixOSS! Join us :-)

Summary http://netflix.github.io + Druid ElasticSearch

You can build your own web-scale data pipeline using open source components

Thank You! Sudhir Tonse http://www.linkedin.com/in/sudhirtonse Twitter: @stonse Danny Yuan http://www.linkedin.com/pub/dannyyuan/4/374/862 Twitter: @g9yuayon