Apache Flink Training. System Overview

Similar documents
Apache Flink. Fast and Reliable Large-Scale Data Processing

Apache Flink Next-gen data analysis. Kostas

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Unified Big Data Analytics Pipeline. 连 城

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

A Brief Introduction to Apache Tez

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Hadoop Ecosystem B Y R A H I M A.

Introduc8on to Apache Spark

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Unified Big Data Processing with Apache Spark. Matei

Big Data Research in Berlin BBDC and Apache Flink

Architectures for massive data management

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Scaling Out With Apache Spark. DTL Meeting Slides based on

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

AT&T Global Network Client for Windows Product Support Matrix January 29, 2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Apache Hadoop. Alexandru Costan

CSE-E5430 Scalable Cloud Computing Lecture 2

Apache HBase. Crazy dances on the elephant back

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Hadoop & Spark Using Amazon EMR

An Open Source Memory-Centric Distributed Storage System

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Managing large clusters resources

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

The Stratosphere Big Data Analytics Platform

Moving From Hadoop to Spark

Real-time Big Data Analytics with Storm

Big Data With Hadoop

Internals of Hadoop Application Framework and Distributed File System

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Real Time Data Processing using Spark Streaming

Data-intensive computing systems

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Spark and the Big Data Library

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Using Summingbird for aggregating eye tracking data to find patterns in images in a multi-user environment

Introduction to Cloud Computing

Hadoop IST 734 SS CHUNG

Unified Batch & Stream Processing Platform

Big Data looks Tiny from the Stratosphere

Introduction to Spark

INTRODUCING APACHE IGNITE An Apache Incubator Project

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

YARN Apache Hadoop Next Generation Compute Platform

Apache Hama Design Document v0.6

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Architectures for massive data management

Big Data Course Highlights

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

An Open-Source Streaming Machine Learning and Real-Time Analytics Architecture

Business Intelligence for Big Data

Big Data Processing with Google s MapReduce. Alexandru Costan

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

DRIVING INNOVATION THROUGH DATA ACCELERATING BIG DATA APPLICATION DEVELOPMENT WITH CASCADING

Introducing Storm 1 Core Storm concepts Topology design

Case Study : 3 different hadoop cluster deployments

Big Data and Cloud Computing for GHRSST

<Insert Picture Here> Big Data

COURSE CONTENT Big Data and Hadoop Training

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Management & Analysis of Big Data in Zenith Team

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Hybrid Software Architectures for Big

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

GigaSpaces Real-Time Analytics for Big Data

6. How MapReduce Works. Jari-Pekka Voutilainen

Chapter 7. Using Hadoop Cluster and MapReduce

Map Reduce & Hadoop Recommended Text:

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Shark Installation Guide Week 3 Report. Ankush Arora

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

COMPARISON OF FIXED & VARIABLE RATES (25 YEARS) CHARTERED BANK ADMINISTERED INTEREST RATES - PRIME BUSINESS*

COMPARISON OF FIXED & VARIABLE RATES (25 YEARS) CHARTERED BANK ADMINISTERED INTEREST RATES - PRIME BUSINESS*

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy

Large scale processing using Hadoop. Ján Vaňo

Big Data Analytics Hadoop and Spark

Big Data Analysis: Apache Storm Perspective

Conquering Big Data with BDAS (Berkeley Data Analytics)

Beyond Hadoop with Apache Spark and BDAS

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

MAPREDUCE Programming Model

Azure Data Lake Analytics

Ali Ghodsi Head of PM and Engineering Databricks

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Big Data Are You Ready? Thomas Kyte

Big Data Analytics. Chances and Challenges. Volker Markl

Transcription:

Apache Flink Training System Overview

Apache Flink A stream processor with many applications Streaming dataflow runtime 2

1 year of Flink - code April 2014 April 2015 Hadoop M/R Python Gelly Table ML flow MRQL Table SAMOA flow Set API (Java/Scala) Flink core Set (Java/Scala) Flink core Stream (Java/Scala) Local Remote Yarn Local Remote Yarn Tez Embedded

Flink Community 120 #unique contributor ids by git commits 100 80 60 40 In top 5 of Apache's big data projects after one year in the Apache Software Foundation 20 0 Aug-10 Feb-11 Sep-11 Apr-12 Oct-12 May-13 Nov-13 Jun-14 Dec-14 Jul-15

The Apache Way Flink is an Apache top-level project Independent, non-profit organization Community-driven open source software development approach Public communication and open to new contributors 5

What is Apache Flink? Hadoop M/R Table Gelly ML flow/beam MRQL Cascading Table SAMOA flow/beam Set (Java/Scala/Python) Stream (Java/Scala) Streaming dataflow runtime Local Remote Yarn Tez Embedded 6

Native workload support Heavy batch jobs Machine Learning at scale Streaming topologies Flink How can an engine natively support all these workloads? And what does native mean? 7

E.g.: Non-native iterations for (int i = 0; i < maxiterations; i++) { // Execute MapReduce job } Client Step Step Step Step Step 8

E.g.: Non-native streaming stream discretizer while (true) { // get next few records // issue batch job } Job Job Job Job 9

Native workload support Batch processing Machine Learning at scale Stream processing Graph Analysis Flink How can an engine natively support all these workloads? And what does "native" mean? 10

Flink Engine 1. Execute everything as streams 2. Iterative (cyclic) dataflows 3. Mutable state State + Computa4on 4. Operate on managed memory 5. Special code paths for batch GroupRed sort hash-part [0,1] Combine Join Hybrid Hash buildht probe GroupRed sort forward Join Hybrid Hash buildht probe broadcast Map Filter Source orders.tbl forward Source lineitem.tbl hash-part [0] hash-part [0] Map Filter Source orders.tbl 11 Source lineitem.tbl

What is a Flink Program? 12

Flink stack Hadoop M/R Table Gelly ML flow MRQL Cascading (WiP) Table SAMOA flow (WiP) Set (Java/Scala/Python) Stream (Java/Scala) Streaming dataflow runtime Local Remote Yarn Tez Embedded

Basic API Concept Source Stream Opera4on Stream Sink Source Set Opera4on Set Sink How do I write a Flink program? 1. Bootstrap sources 2. Apply operations 3. Output to source 14

Batch & Stream Processing Set API Example: Map/Reduce paradigm b h 2 1 3 5 Map Reduce a 12 7 4 Stream API Example: Live Stock Feed Stock Feed Name Price Google 516 Alert if Microso< > 120 Microso. 124 Write event to database Microso< 124 Apple 235 Google 516 Apple 235 Apple 235 Sum every 10 seconds Microso. 124 Google 516 Alert if sum > 10000 15

Streaming & Batch Streaming Batch Input infinite finite transfer pipelined blocking or pipelined Latency low high 16

Scaling out 17 Set Opera4on Set Source Sink Set Opera4on Set Source Sink Set Opera4on Set Source Sink Set Opera4on Set Source Sink Set Opera4on Set Source Sink Set Opera4on Set Source Sink Set Opera4on Set Source Sink Set Opera4on Set Source Sink

Scaling up 18

Sources (selection) Collection-based fromcollection fromelements File-based TextInputFormat CsvInputFormat Other SocketInputFormat KafkaInputFormat bases 19

Sinks (selection) File-based TextOutputFormat CsvOutputFormat PrintOutput Others SocketOutputFormat KafkaOutputFormat bases 20

Hadoop Integration Out of the box Access HDFS Yarn Execution (covered later) Reuse data types (Writables) With a thin wrapper Reuse Hadoop input and output formats Reuse functions like Map and Reduce 21

What s the Lifecycle of a Program? 22

From Program to flow case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: Set[Path] => val next = paths.join(edges).where("to").equalto("from") { (path, edge) => Path(path.from, edge.to) }.union(paths).distinct() next } Program Type extraction stack Optimizer Pre-flight (Client) Map Filter Source orders.tbl build HT GroupRed sort forward Join Hybrid Hash probe hash-part [0] hash-part [0] Sourc e lineitem.tbl flow Graph flow metadata deploy operators Task scheduling Master track intermediate results Workers 23

Architecture Overview Client Master (Job Manager) Worker (Task Manager) Client Job Manager Task Manager Task Manager Task Manager 24

Client Optimize Construct job graph Pass job graph to job manager Retrieve job results case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: Set[Path] => val next = paths.join(edges).where("to").equalto("from") { (path, edge) => Path(path.from, edge.to) }.union(paths).distinct() next } Type extraction Optimizer Map Filter build HT GroupRed sort forward Join Hybrid Hash probe hash-part [0] hash-part [0] Sourc e lineitem.tbl Job Manager Client Source orders.tbl 25

Job Manager Parallelization: Create Execution Graph Scheduling: Assign tasks to task managers State: Supervise the execution Task Manager Map Filter Source orders.tbl build HT GroupRed sort forwar d Join Hybrid Hash prob e hash-part [0] hash-part [0] Sour ce lineitem.tbl GroupRed GroupRed GroupRed sort GroupRed sort sort sort forwar forward forward forward d Join Join Hybrid Hash Join Hybrid Hash Hybrid Join buildhash prob Hybrid buildhash HT prob e buildht prob e buildht prob e HT e hash-part [0] hash-part [0] hash-part [0] hash-part [0] hash-part [0] hash-part Sour [0] hash-part [0] Map hash-part Sour [0] Map Source Map Sourlineitem.tbl ce Map Filter lineitem.tbl ce Filter lineitem.tbl ce Filter lineitem.tbl Filter Source Source orders.tbl Source orders.tbl Source orders.tbl orders.tbl Task Manager Task Manager Job Manager Task Manager 26

Task Manager Operations are split up into tasks depending on the specified parallelism Each parallel instance of an operation runs in a separate task slot The scheduler may run several tasks from different operators in one task slot S l o t S l o t S l o t Task Manager Task Manager Task Manager 27

Execution Setups 28

Ways to Run a Flink Program Hadoop M/R Table Gelly ML flow/beam MRQL Cascading (WiP) Table SAMOA flow/beam Set (Java/Scala/Python) Stream (Java/Scala) Streaming dataflow runtime Local Remote Yarn Tez Embedded 29

Local Execution Starts local Flink cluster All processes run in the same JVM Behaves just like a regular Cluster Very useful for developing and debugging Task Manager Task Manager Job Manager Task Manager Task Manager JVM 30

Embedded Execution Runs operators on simple Java collections Lower overhead Does not use memory management Useful for testing and debugging 31

Remote Execution Client Submit job Job Manager Submit a Job remotely Monitor the status of a job Task Manager Task Manager Task Manager Task Manager Cluster 32

YARN Execution Multi-user scenario Resource sharing Uses YARN containers to run a Flink cluster Easy to setup Node Manager Job Manager Node Manager Resource Manager Node Manager Task Manager Node Manager Client Task Manager Other Applica4on YARN Cluster 33

Execution Leverages Apache Tez s runtime Built on top of YARN Good YARN citizen Fast path to elastic deployments Slower than native Flink 34

Flink compared to other projects 35

Batch & Streaming projects Batch only Streaming only Hybrid 36

Batch comparison API low-level high-level high-level Transfer batch batch pipelined & batch Memory Management IteraHons disk-based JVM-managed Ac4ve managed file system cached in-memory cached streamed Fault tolerance task level task level job level Good at massive scale out data explora4on Libraries many external built-in & external heavy backend & itera4ve jobs evolving built-in & external 37

Streaming comparison Streaming true mini batches true API low-level high-level high-level Fault tolerance tuple-level ACKs RDD-based (lineage) coarse checkpoin4ng State not built-in external internal Exactly once at least once exactly once exactly once Windowing not built-in restricted flexible Latency low medium low Throughput medium high high 38

Thank you for listening! 39