Apache Flink. Fast and Reliable Large-Scale Data Processing

Similar documents

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Apache Flink Next-gen data analysis. Kostas

The Stratosphere Big Data Analytics Platform

Unified Big Data Analytics Pipeline. 连城

Big Data Research in Berlin BBDC and Apache Flink

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Big Data looks Tiny from the Stratosphere

Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis

Moving From Hadoop to Spark

Real Time Data Processing using Spark Streaming

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Beyond Hadoop with Apache Spark and BDAS

Hadoop Ecosystem B Y R A H I M A.

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data and Apache Hadoop s MapReduce

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Big Data and Scripting Systems beyond Hadoop

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Hadoop IST 734 SS CHUNG

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Scaling Out With Apache Spark. DTL Meeting Slides based on

Trend Micro Big Data Platform and Apache Bigtop. 葉祐欣 (Evans Ye) Big Data Conference 2015

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Hadoop & Spark Using Amazon EMR

HDP Hadoop From concept to deployment.

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Architectures for massive data management

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

How Companies are! Using Spark

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Apache Kylin Introduction Dec 8,

HiBench Introduction. Carson Wang Software & Services Group

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Big Data Analytics. Lucas Rego Drumond

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Unified Big Data Processing with Apache Spark. Matei

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Dell In-Memory Appliance for Cloudera Enterprise

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Upcoming Announcements

Hadoop in Social Network Analysis - overview on tools and some best practices - Headline Goes Here

Big Data Frameworks: Scala and Spark Tutorial

Architectures for massive data management

BIG DATA What it is and how to use?

How To Scale Out Of A Nosql Database

Implement Hadoop jobs to extract business value from large and varied data sets

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Big Data With Hadoop

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

A Brief Introduction to Apache Tez

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Streaming items through a cluster with Spark Streaming

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

The Internet of Things and Big Data: Intro

Reference Architecture, Requirements, Gaps, Roles

Big Data and Scripting Systems build on top of Hadoop

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

CSE-E5430 Scalable Cloud Computing Lecture 2

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

HDP Enabling the Modern Data Architecture

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop Job Oriented Training Agenda

Ali Ghodsi Head of PM and Engineering Databricks

Spark: Making Big Data Interactive & Real-Time

Managing large clusters resources

Kafka & Redis for Big Data Solutions

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Apache Mahout's new DSL for Distributed Machine Learning. Sebastian Schelter GOTO Berlin 11/06/2014

Unified Batch & Stream Processing Platform

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Developing MapReduce Programs

Machine- Learning Summer School

Openbus Documentation

Big Data Analysis: Apache Storm Perspective

Big Data Analytics with Cassandra, Spark & MLLib

Hadoop-based Open Source ediscovery: FreeEed. (Easy as popcorn)

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Data Security in Hadoop

Transcription:

Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1

What is Apache Flink? Distributed Data Flow Processing System Focused on large-scale data analytics Real-time stream and batch processing Easy and powerful APIs (Java / Scala) Robust execution backend 2

What is Flink good at? It s a general-purpose data analytics system Real-time stream processing with flexible windows Complex and heavy ETL jobs Analyzing huge graphs Machine-learning on large data sets... 3

Table API Gelly Library ML Library Apache MRQL Dataflow Apache SAMOA Flink in the Hadoop Ecosystem Libraries Flink Core DataSet API (Java/Scala) Optimizer Runtime DataStream API (Java/Scala) Stream Builder Environments Embedded Local Cluster Yarn Apache Tez Data Sources HDFS HCatalog Hadoop IO JDBC Apache HBase Apache Kafka Apache Flume S3 RabbitMQ... 4

Flink in the ASF Flink entered the ASF about one year ago 04/2014: Incubation 12/2014: Graduation Strongly growing community 120 100 80 60 40 20 0 Nov.10 Apr.12 Aug.13 Dec.14 #unique git committers (w/o manual de-dup) 5

Where is Flink moving? A "use-case complete" framework to unify batch & stream processing Data Streams Kafka RabbitMQ... Historic data HDFS JDBC... Flink Analytical Workloads ETL Relational processing Graph analysis Machine learning Streaming data analysis Goal: Treat batch as finite stream 6

Programming Model & APIs HOW TO USE FLINK? 7

Unified Java & Scala APIs Fluent and mirrored APIs in Java and Scala Table API for relational expressions Batch and Streaming APIs almost identical...... with slightly different semantics in some cases 8

DataSets and Transformations Input filter First map Second ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet<String> input = env.readtextfile(input); DataSet<String> first = input.filter (str -> str.contains( Apache Flink )); DataSet<String> second = first.map(str -> str.tolowercase()); second.print(); env.execute(); 9

Expressive Transformations Element-wise map, flatmap, filter, project Group-wise groupby, reduce, reducegroup, combinegroup, mappartition, aggregate, distinct Binary join, cogroup, union, cross Iterations iterate, iteratedelta Physical re-organization rebalance, partitionbyhash, sortpartition Streaming Window, windowmap, comap,... 10

Rich Type System Use any Java/Scala classes as a data type Tuples, POJOs, and case classes Not restricted to key-value pairs Define (composite) keys directly on data types Expression Tuple position Selector function 11

Counting Words in Batch and Stream case class Word (word: String, frequency: Int) DataSet API (batch): val lines: DataSet[String] = env.readtextfile(...) lines.flatmap {line => line.split(" ").map(word => Word(word,1))}.groupBy("word").sum("frequency").print() DataStream API (streaming): val lines: DataStream[String] = env.fromsocketstream(...) lines.flatmap {line => line.split(" ").map(word => Word(word,1))}.window(Count.of(1000)).every(Count.of(100)).groupBy("word").sum("frequency").print() 12

Table API Execute SQL-like expressions on table data Tight integration with Java and Scala APIs Available for batch and streaming programs val orders = env.readcsvfile( ).as('oid, 'odate, 'shipprio).filter('shipprio === 5) val items = orders.join(lineitems).where('oid === 'id).select('oid, 'odate, 'shipprio, 'extdprice * (Literal(1.0f) - 'discnt) as 'revenue) val result = items.groupby('oid, 'odate, 'shipprio).select('oid, 'revenue.sum, 'odate, 'shipprio) 13

Libraries are emerging As part of the Apache Flink project Gelly: Graph processing and analysis Flink ML: Machine-learning pipelines and algorithms Libraries are built on APIs and can be mixed with them Outside of Apache Flink Apache SAMOA (incubating) Apache MRQL (incubating) Google DataFlow translator 14

Processing Engine WHAT IS HAPPENING INSIDE? 15

System Architecture Client (pre-flight) Master Flink Program Type extraction stack Cost-based optimizer Recovery metadata Task scheduling Workers Coordination Memory manager Data serialization stack...... Out-of-core algos Pipelined or Blocking Data Transfer 16

Cool technology inside Flink Batch and Streaming in one system Memory-safe execution Built-in data flow iterations Cost-based data flow optimizer Flexible windows on data streams Type extraction and serialization utilities Static code analysis on user functions and much more... 17

Pipelined Data Transfer STREAM AND BATCH IN ONE SYSTEM 18

Stream and Batch in one System Most systems are either stream or batch systems In the past, Flink focused on batch processing Flink s runtime has always done stream processing Operators pipeline data forward as soon as it is processed Some operators are blocking (such as sort) Stream API and operators are recent contributions Evolving very quickly under heavy development 19

Pipelined Data Transfer Pipelined data transfer has many benefits True stream and batch processing in one stack Avoids materialization of large intermediate results Better performance for many batch workloads Flink supports blocking data transfer as well 20

Pipelined Data Transfer Program Large Input map Interm. DataSet Small Input join Result Pipelined Large Input map Pipeline 2 No intermediate materialization! Execution Small Input Pipeline 1 Build HT Probe HT join Result 21

Memory Management and Out-of-Core Algorithms MEMORY SAFE EXECUTION 22

Memory-safe Execution Challenge of JVM-based data processing systems OutOfMemoryErrors due to data objects on the heap Flink runs complex data flows without memory tuning C++-style memory management Robust out-of-core algorithms 23

Managed Memory Active memory management Workers allocate 70% of JVM memory as byte arrays Algorithms serialize data objects into byte arrays In-memory processing as long as data is small enough Otherwise partial destaging to disk Benefits Safe memory bounds (no OutOfMemoryError) Scales to very large JVMs Reduced GC pressure 24

Going out-of-core Single-core join of 1KB Java objects beyond memory (4 GB) Blue bars are in-memory, orange bars (partially) out-of-core 25

Native Data Flow Iterations GRAPH ANALYSIS 26

Native Data Flow Iterations Many graph and ML algorithms require iterations Flink features native data flow iterations Loops are not unrolled But executed as cyclic data flows Two types of iterations Bulk iterations Delta iterations 0.5 1 3 0.1 0.4 0.2 2 4 0.7 0.3 0.9 5 Performance competitive with specialized systems 27

Iterative Data Flows Flink runs iterations natively as cyclic data flows Operators are scheduled once Data is fed back through backflow channel Loop-invariant data is cached Operator state is preserved across iterations! Replace initial result interm. result join reduce interm. result result other datasets 28

# of elements updated Delta Iterations 45000000 40000000 35000000 30000000 Delta iteration computes Delta update of solution set Work set for next iteration 25000000 20000000 15000000 10000000 5000000 0 1 6 11 16 21 26 31 36 41 46 5 # of iterations Work set drives computations of next iteration Workload of later iterations significantly reduced Fast convergence Applicable to certain problem domains Graph processing 29

Iteration Performance 30 Iterations 61 Iterations (Convergence) PageRank on Twitter Follower Graph 30

Roadmap WHAT IS COMING NEXT? 31

Flink s Roadmap Mission: Unified stream and batch processing Exactly-once streaming semantics with flexible state checkpointing Extending the ML library Extending graph library Interactive programs Integration with Apache Zeppelin (incubating) SQL on top of expression language And much more 32

tl;dr What s worth to remember? Flink is general-purpose analytics system Unifies streaming and batch processing Expressive high-level APIs Robust and fast execution engine 34

I Flink, do you? ;-) If you find this exciting, get involved and start a discussion on Flink s ML or stay tuned by subscribing to news@flink.apache.org or following @ApacheFlink on Twitter 35

36

BACKUP 37

Data Flow Optimizer Database-style optimizations for parallel data flows Optimizes all batch programs Optimizations Task chaining Join algorithms Re-use partitioning and sorting for later operations Caching for iterations 38

Data Flow Optimizer val orders = val lineitems = val filteredorders = orders.filter(o => dataformat.parse(l.shipdate).after(date)).filter(o => o.shipprio > 2) val lineitemsoforders = filteredorders.join(lineitems).where( orderid ).equalto( orderid ).apply((o,l) => new SelectedItem(o.orderDate, l.extdprice)) val pricesums = lineitemsoforders.groupby( orderdate ).sum( l.extdprice ); 39

Data Flow Optimizer Reduce sort[0,1] hash-part [0,1] Combine partial sort[0,1] Join Hybrid Hash Best plan depends on relative sizes of input files Reduce sort[0,1] Join Hybrid Hash buildht probe buildht probe broadcast forward hash-part [0] hash-part [0] Filter DataSource orders.tbl DataSource lineitem.tbl Filter DataSource orders.tbl DataSource lineitem.tbl 40