Kafka & Redis for Big Data Solutions

Similar documents

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Big Data With Hadoop

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Data Structures for Big Data: Bloom Filter. Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014.

Putting Apache Kafka to Use!

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

NOT IN KANSAS ANY MORE

Streaming items through a cluster with Spark Streaming

Big Data Analytics - Accelerated. stream-horizon.com

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Hadoop implementation of MapReduce computational model. Ján Vaňo

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Introduction to Apache Kafka And Real-Time ETL. for Oracle DBAs and Data Analysts

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

HiBench Introduction. Carson Wang Software & Services Group

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

Big Data Analytics - Accelerated. stream-horizon.com

How To Create A Data Visualization With Apache Spark And Zeppelin

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Real Time Data Processing using Spark Streaming

I Logs. Apache Kafka, Stream Processing, and Real-time Data Jay Kreps

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Hadoop Ecosystem B Y R A H I M A.

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

How To Scale Out Of A Nosql Database

Big Data Course Highlights

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Large scale processing using Hadoop. Ján Vaňo

Apache Flink Next-gen data analysis. Kostas

Data Pipeline with Kafka

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Hadoop & Spark Using Amazon EMR

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Chase Wu New Jersey Ins0tute of Technology

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Workshop on Hadoop with Big Data

Unified Big Data Processing with Apache Spark. Matei

How To Use Big Data For Telco (For A Telco)

How To Write A Data Processing Pipeline In R

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

MapReduce with Apache Hadoop Analysing Big Data

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Introduction to Spark

Upcoming Announcements

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

Constructing a Data Lake: Hadoop and Oracle Database United!

Introduction to Big Data with Apache Spark UC BERKELEY

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

HDP Hadoop From concept to deployment.

Introduction to Big Data the four V's

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

NoSQL Data Base Basics

Hypertable Goes Realtime at Baidu. Yang Dong Sherlock Yang(

Creating Big Data Applications with Spring XD

Comparing SQL and NOSQL databases

Hadoop and Map-Reduce. Swati Gore

Apache Kafka Your Event Stream Processing Solution

Hadoop Job Oriented Training Agenda

GigaSpaces Real-Time Analytics for Big Data

FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

Big Data and Data Science: Behind the Buzz Words

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hypertable Architecture Overview

Structured Data Storage

Hadoop. Sunday, November 25, 12

Comparing Scalable NOSQL Databases

Luncheon Webinar Series May 13, 2013

Play with Big Data on the Shoulders of Open Source

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Cloudera Certified Developer for Apache Hadoop

Machine- Learning Summer School

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

BIG DATA ANALYTICS For REAL TIME SYSTEM

Open Source Technologies on Microsoft Azure

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

TRAINING PROGRAM ON BIGDATA/HADOOP

Moving From Hadoop to Spark

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Real-time Map Reduce: Exploring Clickstream Analytics with: Kafka, Spark Streaming and WebSockets. Andrew Psaltis

CSE-E5430 Scalable Cloud Computing Lecture 2

NoSQL Database Options

How Companies are! Using Spark

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

Transcription:

Kafka & Redis for Big Data Solutions Christopher Curtin Head of Technical Research @ChrisCurtin

About Me 25+ years in technology Head of Technical Research at Silverpop, an IBM Company (14 + years at Silverpop) Built a SaaS platform before the term SaaS was being used Prior to Silverpop: real-time control systems, factory automation and warehouse management Always looking for technologies and algorithms to help with our challenges

About Silverpop Provider of Internet Marketing Solutions 15+ years old, Acquired by IBM in May 2014 Multi-tenant Software as a Service Thousands of clients with 10's of billions of events per month

Agenda Kafka Redis Bloom Filters

Batch is Bad Data Egress and Ingress is very expensive 1000's of customers mean thousands of hourly or daily export and import requests Often same exports are being requested to feed different business systems Prefer Event Pushes to subscribed end points

Apache Kafka Apache Kafka is a distributed publish-subscribe messaging system. It is designed to support the following Persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages. High-throughput: even with very modest hardware Kafka can support hundreds of thousands of messages per second. Explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining perpartition ordering semantics. Support for parallel data load into Hadoop.

Apache Kafka Why? Data Integration

Point to Point Integration (thanks to LinkedIn for slide)

What we'd really like (thanks to LinkedIn for slide)

Apache Kafka Kafka changes the messaging paradigm Kafka doesn't keep track of who consumed which message Kafka keeps all messages until you tell it not to A consumer can ask for the same message over and over again

Consumption Management Kafka leaves the management of what was consumed to the consumer Each message has a unique identifier (within a topic and partition) Consumers ask for the specific message, or the next one All messages are written to disk, so asking for an old message means finding it on disk and starting to stream from there

What is a commit log? (thanks to LinkedIn for slide)

Big Data Use Cases Data Pipeline Buffer between event driven systems and batch Batch Load Parallel Streaming Event Replay

Data Pipeline (thanks to LinkedIn for slide)

Buffer Event generating systems MAY give you a small downtime buffer before they start losing events Most will not. How do you not lose these events when you need downtime?

Batch Load Sometimes you only need to load data once an hour or once a day Have a task wake up and stream the events since the last time to a file, mark where it last picked up and exit

Parallel Streaming A lot of use cases require different systems to process the same raw data Spark, Storm, Infosphere Streams, Spring XD Write to a different Kafka Topic and Partition Batch load into HDFS hourly Batch load into a traditional data warehouse or database daily Push to external systems daily Example: abandoned shopping carts

Event Replay What if you could take 'production' data and run it through a debugger? Or through a non-optimized, 'SPEW' mode logging to see what it does? Or through a new version of your software that has /dev/null sinks Note: Developers HAVE NOT made a local copy of the data Though with a sanitized copy for QA 'load testing' is simplified.

What about Flume etc? Very similar use cases, very different implementation Kafka keeps the events around Flume can do much more (almost a CEP system) Kafka feeding Flume is a reasonable combination

Big Data Egress Big data applications often produce lots of data Nightly batch jobs Spark Streaming alerts Write to Kafka Same benefits when feeding big data can be applied to consumers of our output Or...

What about Hbase or other 'Hadoop' answers For Egress? Perfect for many cases Real-world concerns though Is the YARN cluster 5-9s? Is it cost effective to build a fast response Hbase, HDFS etc. for the needs? Will the security teams allow real-time access to YARN clusters from web-tier business application?

Redis What is it? From redis.io: "Redis is an open source, BSD licensed, advanced key-value cache and store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets, sorted sets, bitmaps and hyperloglogs."

Features Unlike typical key-value stores, you can send commands to edit the value on the server vs. reading back to the client, updating and pushing to the server pub/sub TTL on keys Clustering and automatic fail-over Lua scripting client libraries for just about any language you can think of

So Why did we start looking at NoSQL? For the cost of an Oracle Enterprise license I can give you 64 cores and 3 TB of memory

Redis Basics In Memory-only key-value store Single Threaded. Yes, Single Threaded No Paging, no reading from disk CS 101 data structures and operations 10's of millions of keys isn't a big deal How much RAM defines how big the store can get

Hashes Hashes - collection of key-value pairs with a single name - useful for storing data under a common name - values can only be strings or numeric. No hash of lists http://redis.io/commands/hget

Web-time access Write the output of a Spark job to Redis so it can be accessed in webtime Publish the results of recommendation engines for a specific user to Redis in a Hash by User or current page Using a non-persistent storage model, very low cost to load millions of matches into memory Something goes wrong? Load it again from the big data outputs New information for 1 user or page, replace the value for that one only

Spark RDDs Over time Spark can generate a lot of RDDs. Where are they? What is in them? Some people use a naming convention, but that gets problematic quickly Instead, create a Redis Set by 'key' you are interested in finding the RDDs later. Date, source system, step in algorithm on date etc. Since Redis Set is only a reference to the RDD, can easily create lots of Sets pointing to same data (almost like an index)

Bloom Filters From WikiPedia (Don't tell my kid's teacher!) "A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not, thus a Bloom filter has a 100% recall rate"

Hashing Apply 'x' hash functions to the key to be stored/queried Each function returns a bit to set in the bitset Mathematical equations to determine how big to make the bitset, how many functions to use and your acceptable error level http://hur.st/bloomfilter?n=4&p=1.0e-20

Example

False Positives Perfect hash functions aren't worth the cost to develop Sometimes existing bits for a key are set by many other keys Make sure you understand the business impact of a false positive Remember, never a false negative

Why were we interested in Bloom Filters? Found a lot of places we went to the database to find the data didn't exist Found lots of places where we want to know if a user DIDN'T do something

Persistent Bloom Filters We needed persistent Bloom Filters for lots of user stories Found Orestes-BloomFilter on GitHub that used Redis as a store and enhanced it Added population filters Fixed a few bugs

Benefits Filters are stored in Redis Only bitset/bitget calls to server Reads and updates of the filter from set of application servers Persistence has a cost, but a fraction of the RDBMS costs Can load a BF created offline and begin using it

Remember For the cost of an Oracle License Thousands of filters Dozens of Redis instances TTL on a Redis key makes cleanup of old filters trivial

Use Cases White List/Black List Nightly rebuilds, real time updates as CEP finds something Ad suggestions Does the system already know about this visitor? Maybe? More expensive processing. No? Defaults Content Recommendations Batch identification hourly/nightly Real-time updates as 'hot' pages/content change the recommendation

Use Case Cardinality Estimation Client side joins are really difficult. Hadoop, Spark, MongoDB how do you know which side to 'drive' from? We created a Population Bloom Filter that counts unique occurrences of a key using a Bloom Filter. Build a filter per 'source' file as the data is being collected (Kafka batches, Spark RDD saves etc.) Now query the filter to see if the keys you are looking for are (possibly) in the data. Then get the count and see which side to drive from. Not ideal, low population may not mean it is the best driver, but could be

Use Case A/B Testing Use different algorithms and produce different result sets Load into Redis in different keys (remember 10MM keys is no big deal) Have the application tier A/B test the algorithm results

Conclusion Batch exports from SaaS (or any volume system) is bad Traditional messaging systems are being stressed by the volume of data Redis is a very fast, very simple and very powerful name value store Data structure server Bloom Filters have lots of applications when you want to quickly look up if one of millions of 'things' happened Redis-backed BloomFilters make updatable bloom filters trivial to use

References Redis.io Kafka.apache.org https://github.com/baqend/orestes-bloomfilter http://www.slideshare.net/chriscurtin @ChrisCurtin on twitter

Questions?