Architectures for massive data management



Similar documents
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Openbus Documentation

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Hadoop IST 734 SS CHUNG

Data-Intensive Computing with Map-Reduce and Hadoop

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Apache Kafka Your Event Stream Processing Solution

Putting Apache Kafka to Use!

Hadoop Architecture. Part 1

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Real-time Big Data Analytics with Storm

NOT IN KANSAS ANY MORE

A Survey on Big Data Concepts and Tools

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Using Kafka to Optimize Data Movement and System Integration. Alex

Hadoop implementation of MapReduce computational model. Ján Vaňo

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Amazon EC2 Product Details Page 1 of 5

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

CSE-E5430 Scalable Cloud Computing Lecture 2

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Big Data Analytics - Accelerated. stream-horizon.com

Apache Hadoop. Alexandru Costan

Hadoop Ecosystem B Y R A H I M A.

Big Data With Hadoop

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

A Brief Outline on Bigdata Hadoop

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Introducing Storm 1 Core Storm concepts Topology design

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Amazon Kinesis and Apache Storm

Big + Fast + Safe + Simple = Lowest Technical Risk

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Large scale processing using Hadoop. Ján Vaňo

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Apache Hadoop: Past, Present, and Future

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

CS54100: Database Systems

Getting to know Apache Hadoop

BIG DATA ANALYTICS For REAL TIME SYSTEM

Data Pipeline with Kafka

Real Time Big Data Processing

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

GraySort and MinuteSort at Yahoo on Hadoop 0.23

How To Scale Out Of A Nosql Database

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Application Development. A Paradigm Shift

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

UNDERSTANDING THE BIG DATA PROBLEMS AND THEIR SOLUTIONS USING HADOOP AND MAP-REDUCE

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Jeffrey D. Ullman slides. MapReduce for data intensive computing

HiBench Introduction. Carson Wang Software & Services Group

Integrating Big Data into the Computing Curricula

Introduction to MapReduce and Hadoop

BIG DATA TRENDS AND TECHNOLOGIES

Parallel Processing of cluster by Map Reduce

Hybrid Software Architectures for Big

Hadoop and Map-reduce computing

Hadoop & its Usage at Facebook

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

Unified Big Data Processing with Apache Spark. Matei

HDFS. Hadoop Distributed File System

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

ANALYTICS ON BIG FAST DATA USING REAL TIME STREAM DATA PROCESSING ARCHITECTURE

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Google File System. Web and scalability

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Kafka: a Distributed Messaging System for Log Processing

Networking in the Hadoop Cluster

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Big Data Analysis: Apache Storm Perspective

A Performance Analysis of Distributed Indexing using Terrier

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Unified Big Data Analytics Pipeline. 连 城

Map Reduce & Hadoop Recommended Text:

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Transcription:

Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet albert.bifet@telecom-paristech.fr October 20, 2015

Stream Engine Motivation

Digital Universe EMC Digital Universe with Research & Analysis by IDC The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things April 2014

Digital Universe Figure: EMC Digital Universe, 2014

Digital Universe Memory unit Size Binary size kilobyte (kb/kb) 10 3 2 10 megabyte (MB) 10 6 2 20 gigabyte (GB) 10 9 2 30 terabyte (TB) 10 12 2 40 petabyte (PB) 10 15 2 50 exabyte (EB) 10 18 2 60 zettabyte (ZB) 10 21 2 70 yottabyte (YB) 10 24 2 80

Digital Universe Figure: EMC Digital Universe, 2014

Digital Universe Figure: EMC Digital Universe, 2014

Big Data 6V s Volume Variety Velocity Value Variability Veracity

Hadoop Figure: Hadoop architecture deals with datasets, not data streams

Requirements We should have some ways of coupling programs like garden hose screw in another segment when it becomes when it becomes necessary to massage data in another way. This is the way of IO also. Our loader should be able to do link-loading and controlled establishment. Our library filing scheme should allow for rather general indexing, responsibility, generations, data path switching. It should be possible to get private system components (all routines are system components) for buggering around with.

Requirements We should have some ways of coupling programs like garden hose screw in another segment when it becomes when it becomes necessary to massage data in another way. This is the way of IO also. Our loader should be able to do link-loading and controlled establishment. Our library filing scheme should allow for rather general indexing, responsibility, generations, data path switching. It should be possible to get private system components (all routines are system components) for buggering around with. M. D. McIlroy 1968

Unix Pipelines

Unix pipelines cat f i l e. t x t t r s [ [ : punct : ] [ : space : ] ] s o r t uniq c s o r t rn head n 5 cat f i l e. t x t t r s [ [ : punct : ] [ : space : ] ] s o r t uniq c sort rn head n 5

Unix Pipelines Figure: Apache Kafka, Samza, and the Unix Philosophy of Distributed Data: Martin Kleppmann, Confluent

Unix Pipelines Figure: Apache Kafka, Samza, and the Unix Philosophy of Distributed Data: Martin Kleppmann, Confluent

Unix Pipelines Figure: Apache Kafka, Samza, and the Unix Philosophy of Distributed Data: Martin Kleppmann, Confluent

Unix Pipelines Figure: Apache Kafka, Samza, and the Unix Philosophy of Distributed Data: Martin Kleppmann, Confluent

Real Time Processing Jay Kreps, LinkedIn The Log: What every software engineer should know about real-time data s unifying abstraction https://engineering.linkedin.com/distributed-systems/ log-what-every-software-engineer-should-know-about-real-time-datas-unifying

The Log Figure: Jay Kreps, LinkedIn

The Log Figure: Jay Kreps, LinkedIn

The Log Figure: Jay Kreps, LinkedIn

The Log Figure: Jay Kreps, LinkedIn

Apache Kafka

Apache Kafka from LinkedIn Apache Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system.

Apache Kafka from LinkedIn Components of Apache Kafka topics: categories that Kafka uses to maintains feeds of messages producers: processes that publish messages to a Kafka topic consumers: processes that subscribe to topics and process the feed of published messages broker: server that is part of the cluster that runs Kafka

Apache Kafka from LinkedIn The Kafka cluster maintains a partitioned log. Each partition is an ordered, immutable sequence of messages that is continually appended to a commit log. The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each message within the partition.

Apache Kafka from LinkedIn Figure: A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups.

Apache Kafka from LinkedIn Guarantees: Messages sent by a producer to a particular topic partition will be appended in the order they are sent. A consumer instance sees messages in the order they are stored in the log. For a topic with replication factor N, Kafka tolerates up to N-1 server failures without losing any messages committed to the log.

Kafka API class kafka. javaapi. consumer. SimpleConsumer { / * * * Fetch a set of messages from a topic. * * @param request specifies the topic name, topic partition, starting byte offset, maximum bytes to be fetched * @return a set of fetched messages */ public FetchResponse fetch ( kafka. javaapi. FetchRequest request ) ; / * * * Fetch metadata for a sequence of topics. * * @param request specifies the versionid, c l i e n t I d, sequence of topics. * @return metadata for each topic in the request. */ public kafka. j a v a a pi. TopicMetadataResponse send ( kafka. j a v a a p i. TopicMetadataRequest request ) ; / * * * Get a l i s t of valid offsets ( up to maxsize ) before the given time. * * @param request a [ [ kafka. j a v a a p i. OffsetRequest ] ] object. * @return a [ [ kafka. j a v a a p i. OffsetResponse ] ] object. */ public kafak. javaapi. OffsetResponse getoffsetsbefore ( OffsetRequest request ) ; / * * * Close the SimpleConsumer. */ public void close ( ) ; }

Apache Samza

Samza Samza is a stream processing framework with the following features: Simple API: it provides a very simple callback-based process message API comparable to MapReduce. Managed state: Samza manages snapshotting and restoration of a stream processor s state. Fault tolerance: Whenever a machine fails, Samza works with YARN to transparently migrate your tasks to another machine. Durability: Samza uses Kafka to guarantee that messages are processed in the order they were written to a partition, and that no messages are ever lost.

Samza Samza is a stream processing framework with the following features: Scalability: Samza is partitioned and distributed at every level. Kafka provides ordered, partitioned, replayable, fault-tolerant streams. YARN provides a distributed environment for Samza containers to run in. Pluggable: Samza provides a pluggable API that lets you run Samza with other messaging systems and execution environments. Processor isolation: Samza works with Apache YARN

Apache Samza from LinkedIn Storm and Samza are fairly similar. Both systems provide: 1 a partitioned stream model, 2 a distributed execution environment, 3 an API for stream processing, 4 fault tolerance, 5 Kafka integration

Samza Samza components: Streams: A stream is composed of immutable messages of a similar type or category Jobs: code that performs a logical transformation on a set of input streams to append output messages to set of output streams Samza parallel Components: Partitions: Each stream is broken into one or more partitions. Each partition in the stream is a totally ordered sequence of messages. Tasks: A job is scaled by breaking it into multiple tasks. The task is the unit of parallelism of the job, just as the partition is to the stream.

Samza Figure: Dataflow Graphs

Samza and Yarn

Samza Figure: Samza, Yarn and Kafka integration

Samza API package com. example. samza ; public class MyTaskClass implements StreamTask { public void process ( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator ) { // process message } }

Samza API # Job job. factory. class =org. apache. samza. job. local. ThreadJobFactory job. name= hello world # Task task. class =samza. task. example. StreamTask task. inputs=example system. example stream # S e r i a l i z e r s serializers. registry. json. class =org. apache. samza. serializers. JsonSerdeFactory serializers. registry. string. class =org. apache. samza. serializers. StringSerdeFactory # Systems systems. example system. samza. factory=samza. stream. example. ExampleConsumerFactory systems. example system. samza. key. serde=string systems. example system. samza.msg. serde=json

Apache Storm

Apache S4 from Yahoo Not longer an active project.

Apache Storm Stream, Spout, Bolt, Topology

Storm Storm cluster nodes: Nimbus node (master node, similar to the Hadoop JobTracker): Uploads computations for execution Distributes code across the cluster Launches workers across the cluster Monitors computation and reallocates workers as needed ZooKeeper nodes: coordinates the Storm cluster Supervisor nodes: communicates with Nimbus through Zookeeper, starts and stops workers according to signals from Nimbus

Storm Storm Abstractions: Tuples: an ordered list of elements. Streams: an unbounded sequence of tuples. Spouts: sources of streams in a computation Bolts: process input streams and produce output streams. They can: run functions; filter, aggregate, or join data; or talk to databases. Topologies: the overall calculation, represented visually as a network of spouts and bolts

Storm Main Storm Groupings: Shuffle grouping: Tuples are randomly distributed but each bolt is guaranteed to get an equal number of tuples. Fields grouping: The stream is partitioned by the fields specified in the grouping. Partial Key grouping: The stream is partitioned by the fields specified in the grouping, but are load balanced between two downstream bolts. All grouping: The stream is replicated across all the bolt s tasks. Global grouping: The entire stream goes to the task with the lowest id.

Storm API TopologyBuilder builder = new TopologyBuilder ( ) ; builder. setspout ( spout, new RandomSentenceSpout ( ), 5 ) ; builder. setbolt ( split, new SplitSentence ( ), 8 ). shufflegrouping ( spout ) ; builder. setbolt ( count, new WordCount ( ), 12). fieldsgrouping ( split, new Fields ( word ) ) ; Config conf = new Config ( ) ; StormSubmitter. submittopologywithprogressbar ( args [ 0 ], conf, builder. createtopology ( ) ) ;

Storm API public s t a t i c class SplitSentence extends S h e l l B o l t implements I R i c h B o l t { public SplitSentence ( ) { super ( python, s p l i t s e n t e n c e. py ) ; } @Override public void declareoutputfields ( O u tputfieldsdeclarer d e c l a r e r ) { declarer. declare (new Fields ( word ) ) ; } @Override public Map<String, Object> getcomponentconfiguration ( ) { return null ; } }

Storm API public s t a t i c class WordCount extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer >(); @Override public void execute ( Tuple tuple, BasicOutputCollector collector ) { String word = tuple. getstring ( 0 ) ; Integer count = counts. get (word ) ; i f ( count == n u l l ) count = 0 ; count ++; counts. put ( word, count ) ; collector. emit (new Values (word, count ) ) ; } @Override public void declareoutputfields ( O utputfieldsdeclarer d e c l a r e r ) { d e c l a r e r. declare (new F i e l d s ( word, count ) ) ; } }

Apache Storm Storm characteristics for real-time data processing workloads: 1 Fast 2 Scalable 3 Fault-tolerant 4 Reliable 5 Easy to operate

Twitter Heron

Twitter Heron Heron includes these features: 1 Off the shelf scheduler 2 Handling spikes and congestion 3 Easy debugging 4 Compatibility with Storm 5 Scalability and latency

Twitter Heron Figure: Heron Architecture

Twitter Heron Figure: Topology Architecture

Twitter Heron Figure: Throughput with acks enabled

Twitter Heron Figure: Latency with acks enabled

Twitter Heron Twitter Heron Highlights: 1 Able to re-use the code written using Storm 2 Efficient in terms of resource usage 3 3x reduction in hardware 4 Not open-source