Big Data Research in Berlin BBDC and Apache Flink



Similar documents
Towards a Thriving Data Economy: Open Data, Big Data, and Data Ecosystems

Apache Flink Next-gen data analysis. Kostas

Big Data Analytics. Chances and Challenges. Volker Markl

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Welcome to the first Workshop on Big data Open Source Systems (BOSS)

Apache Flink. Fast and Reliable Large-Scale Data Processing

Big Data looks Tiny from the Stratosphere

How Companies are! Using Spark

Unified Big Data Processing with Apache Spark. Matei

The Stratosphere Big Data Analytics Platform

Hadoop Ecosystem B Y R A H I M A.

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Challenges for Data Driven Systems

Moving From Hadoop to Spark

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Unified Big Data Analytics Pipeline. 连 城

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Brave New World: Hadoop vs. Spark

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Scaling Out With Apache Spark. DTL Meeting Slides based on

HDP Hadoop From concept to deployment.

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Spark and the Big Data Library

How To Get The Most Out Of Big Data

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

Managing large clusters resources

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Big Data. Lyle Ungar, University of Pennsylvania

Large scale processing using Hadoop. Ján Vaňo

Architectures for massive data management

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Implement Hadoop jobs to extract business value from large and varied data sets

Ali Ghodsi Head of PM and Engineering Databricks

The Internet of Things and Big Data: Intro

Hadoop and Map-Reduce. Swati Gore

How To Scale Out Of A Nosql Database

Beyond Hadoop with Apache Spark and BDAS

BIG DATA What it is and how to use?

Spark: Cluster Computing with Working Sets

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

How To Create A Data Visualization With Apache Spark And Zeppelin

Hadoop & Spark Using Amazon EMR

Big Data and Data Science: Behind the Buzz Words

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Big Data With Hadoop

Integrating a Big Data Platform into Government:

Big Data Explained. An introduction to Big Data Science.

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Advanced Big Data Analytics with R and Hadoop

Big Data Research in the AMPLab: BDAS and Beyond

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

ANALYTICS CENTER LEARNING PROGRAM

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

CSE-E5430 Scalable Cloud Computing Lecture 2

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

The Database Systems and Information Management Group at Technische Universität Berlin

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Upcoming Announcements

What s next for the Berkeley Data Analytics Stack?

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Roadmap Talend : découvrez les futures fonctionnalités de Talend

Hadoop implementation of MapReduce computational model. Ján Vaňo

Case Study : 3 different hadoop cluster deployments

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Unified Batch & Stream Processing Platform

Big Data and Scripting Systems build on top of Hadoop

Real Time Data Processing using Spark Streaming

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Big Data and Scripting Systems beyond Hadoop

Next-Gen Big Data Analytics using the Spark stack

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Introduction to Spark

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Apache Mahout's new DSL for Distributed Machine Learning. Sebastian Schelter GOTO Berlin 11/06/2014

HADOOP. Revised 10/19/2015

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Big Data and Analytics: Challenges and Opportunities

Architectures for Big Data Analytics A database perspective

HiBench Introduction. Carson Wang Software & Services Group

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Play with Big Data on the Shoulders of Open Source

HADOOP IN ENTERPRISE FUTURE-PROOF YOUR BIG DATA INVESTMENTS WITH CASCADING. Supreet Oberoi Nov. 4-6, 2014 Big Data Expo Santa Clara

DRIVING INNOVATION THROUGH DATA ACCELERATING BIG DATA APPLICATION DEVELOPMENT WITH CASCADING

The basic data mining algorithms introduced may be enhanced in a number of ways.

YARN Apache Hadoop Next Generation Compute Platform

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Transcription:

Big Data Research in Berlin BBDC and Apache Flink Tilmann Rabl rabl@tu-berlin.de dima.tu-berlin.de bbdc.berlin 1 2013 Berlin Big Data Center All Rights Reserved DIMA 2015

Agenda About Data Management, Big Data, and Data Scientists About the Berlin Big Data Center Apache Flink the Next Generation of Big Data Analytics Technologies a technology-driven perspective 2 2 DIMA 2015

ML ML Data & Analysis: Increasingly Complex! scalability DM data volume too large data rate too fast data too heterogeneous Volume Velocity Variability Reporting Ad-Hoc Queries ETL/ELT aggregation, selection SQL, XQuery MapReduce DM scalability algorithms data too uncertain Veracity Data Mining MATLAB, R, Python Predictive/Prescriptive MATLAB, R, Python Data Analysis algorithms 3 2013 Berlin Big Data Center All Rights Reserved 3 DIMA 2015

Simple Analysis Deep Analytics Deep Analysis of Big Data Small Data Big Data (3V) 4 2013 Berlin Big Data Center All Rights Reserved 4 DIMA 2015

Databases Big Data Tables Tables and unstructured files Schema on read Parallel More parallel, commodity, shared clusters Mid-query fault tolerance, resource allocation SQL SQL and Java, Scala, Python, you name it General object manipulation Data Warehousing Logs, ML, Graphs, also DW Iterative processing, user-defined functions 5 5 DIMA 2015 5

Data Scientist Jack of All Trades! Domain Expertise (e.g., Industry 4.0, Medicine, Physics, Engineering, Energy, Logistics) Mathematical Programming Linear Algebra Stochastic Gradient Descent Error Estimation Active Sampling Regression Monte Carlo Statistics Sketches Hashing Convergence Decoupling Iterative Algorithms Curse of Dimensionality Application Data Science Relational Algebra / SQL Data Warehouse/OLAP NF 2 /XQuery Resource Management Hardware Adaptation Fault Tolerance Memory Management Parallelization Scalability Memory Hierarchy Data Analysis Language Compiler Query Optimization Indexing Data Flow Control Flow Real-Time 6 2013 Berlin Big Data Center All Rights Reserved 6 DIMA 2015

Big Data Analytics Requires Systems Programming Data Analysis Statistics Algebra Optimization Machine Learning NLP Signal Processing Image Analysis Audio-,Video Analysis Information Integration Information Extraction Data Value Chain Data Analysis Process Predictive Analytics R/Matlab: 3 million users Big Data is now where database systems were in the 70s (prior to relational algebra, query optimization and a SQL-standard)! Hadoop: 100,000 users People with Big Data Analytics Skills Indexing Parallelization Communication Memory Management Query Optimization Efficient Algorithms Resource Management Fault Tolerance Numerical Stability Declarative languages to the rescue! 7 7 DIMA 2015

Beuth Hochschule ZIB Berlin Christof Schütte Informationbased Medicine Alexander Reinefeld File Systems, Supercomputing Stefan Edlich Software Engineering Fritz-Haber-Institut, Max-Planck-Gesellschaft Technische Universität Berlin, Data Analytics Lab DFKI Matthias Scheffler Material Science Volker Markl Data Management Klaus R. Müller Machine Learning Anja Feldmann Computer Networks Odej Kao Distributed Systems Thomas Wiegand Video Mining Hans Uszkoreit Language Technology 8 2013 Berlin Big Data Center All Rights Reserved 8 DIMA 2015

Machine Learning + Data Management = X Mathematical Programming Linear Algebra Error Estimation Active Sampling Regression Monte Carlo Feature Engineering Representation Algorithms (SVM, GPs, etc.) Statistic Sketches Hashing Isolation Convergence Curse of Dimensionality Iterative Algorithms Control flow ML Technology X Think ML-algorithms in a scalable way declarative Process iterative algorithms in a scalable way DM Goal: Data Analysis without System Programming! Relational Algebra/SQL Data Warehouse/OLAP NF 2 /XQuery Scalability Hardware adaption Fault Tolerance Resource Management Declarative Languages Automatic Adaption Scalable processing Parallelization Data Analysis Language Query Optimization Dataflow Compiler Memory Management Memory Hierarchy Indexing 9 2013 Berlin Big Data Center All Rights Reserved 9 DIMA 2015

X = Big Data Analytics System Programming! ( What, not How ) Data Analyst Larger human base of data scientists Reduction of human latencies Cost reduction Machine Description of What? (declarative specification) Technology X Think ML-algorithms in a scalable way Analysis of data in motion Multimodal analysis Numerical stability Technology X Declarative specification Automatic optimization, parallelization and hardware adaption of dataflow and control flow with user-defined functions, iterations and distributed state Scalable algorithms and debugging Description of How? (State of the art in scalable data analysis) Hadoop, MPI Algorithmic fault tolerance Consistent intermediate results Software-defined networking process Iterative algorithms in a scalable way 10 10 DIMA 2015

Application Examples: Technology Drivers and Validation Application example: Marketplace for information text data flows economics-based Technology X windowing multimodal data Application Example: Health society-based integration: video, images, text Think ML-algorithms in a scalable way Declarative Process iterative algorithms in a scalable way hierarchical numerical simulation data numerical stability Application Example: Material science science-based 11 2013 Berlin Big Data Center All Rights Reserved 11 DIMA 2015

Introducing Apache Flink A Success Story from Berlin, Germany, and Europe Project started under the name Stratosphere late 2008 as a DFG funded research unit, comprised of TU Berlin, HU Berlin, and the Hasso Plattner Institute Potsdam. Apache Open Source Incubation since April 2014, Apache Top Level Project since December 2014 Fast growing community of open source users and developers in Europe and worldwide, in academia (e.g., SICS/KTH, INRIA, ELTE) and companies (e.g., Researchgate, Spotify, Amadeus) More information: http://flink.apache.org 12 12 DIMA 2015

Apache Flink: General Purpose Programming + Database Execution Draws on Database Technology Add Draws on MapReduce Technology Declarativity Query optimization Robust out-of-core Iterations Advanced Dataflows General APIs Scalability User-defined functions Complex data types Schema on read Alexandrov et al.: The Stratosphere Platform for Big Data Analytics, VLDB Journal 5/2014 http://flink.apache.org 13 DIMA 2015 13

What can it be used for? Batch processing Machine Learning at scale Stream processing Graph Analysis Flink An engine that can natively support all these workloads. 14 14 DIMA 2015

Flink in the Analytics Ecosystem Applications Hive Mahout Cascading Pig Giraph Crunch Data processing engines MapReduce Spark Storm Flink Tez App and resource management Yarn Mesos Storage, streams HDFS HBase Kafka 15 15 15 2013 Berlin Big Data Center All Rights Reserved 15 DIMA 2015

Flink Program Program Compiler Data Flow Execution Plan Flink Optimizer Picks data shipping and local strategies, operator order Job Graph Execution Graph Runtime Hash- and sort-based out-ofcore operator implementations, memory management Parallel Runtime Task scheduling, network data transfers, resource allocation 16 DIMA 2015

Built-in vs. driver-based looping Client Loop outside the system, in driver program Step Step Step Step Step Client Iterative program looks like many independent jobs Step Step Step Step Step red. Dataflows with feedback edges Flink map join join System is iterationaware, can optimize the job 17 17 DIMA 2015

Effect of optimization Execution Plan A Hash vs. Sort Partition vs. Broadcast Caching Reusing partition/sort Run on a sample on the laptop Execution Plan B Run on large files on the cluster Execution Plan C Run a month later after the data evolved 18 18 2013 Berlin Big Data Center All Rights Reserved 18 DIMA 2015

Flink Optimizer Transitive Closure replace Co-locate DISTINCT + JOIN Iterate Forward HDF S Hybrid Hash Join Group new Distinc Reduce (Sorted (on [0])) Paths Join Union t Hash Partition on [1] Co-locate JOIN + UNION Hash Partition on [1] Step function Hash Partition on [0] paths Loop-invariant data cached in memory What you write is not what is executed No need to hardcode execution strategies Flink Optimizer decides: Pipelines and dam/barrier placement Sort- vs. hash- based execution Data exchange (partition vs. broadcast) Data partitioning steps In-memory caching 19 19 DIMA 2015

Native Streaming Flink execution engine is pipelined (streaming) can implement true streaming & batch Usually distributed execution engines materializes intermediate results (batch) can only do micro-batch Streaming Batch Micro-Batch 20 20 DIMA 2015

Micro Batching vs Native Streaming Discretized Streams (D-Streams) while (true) { // get next few records // issue batch computation } Stream discretizer Job Job Job Job Native streaming while (true) { // process next record } Long-standing operators 21 21 DIMA 2015

Stream Discretization Data is unbound Interested in a (recent) part of it e.g. last 10 days Most common windows around: time, and count Mostly in sliding, fixed, and tumbling form Need for data-driven window definitions e.g., user sessions (periods of user activity followed by inactivity), price changes, etc. The world beyond batch: Streaming 101, Tyler Akidau https://beta.oreilly.com/ideas/the-world-beyond-batchstreaming-101 Great read! 22 22 DIMA 2015

Community Flink started as the Stratosphere project in in 2009, led by TU Berlin. Enterer incubation April 2014 graduated on December 2014. Now one of the most active big data projects after over a year in the Apache Software Foundation. 23 23 DIMA 2015 23

Flink Forward 2015 is the first conference to bring together the Apache Flink developer and user community Companies Use cases: Bouygues Telecom, Huawei, Telefonica, Amadeus Travel Intelligence, Ericsson, Otto Group, ResearchGate Companies analyse and compare Flink with other tools: CapitalOne, NFLabs IBM, Zalando, MongoDB Open source projects integrated with Flink: Docker, Apache Mahout, Apache SAMOA, Apache Zeppelin, Apache BigTop, Cascading, MongoDB, Apache Storm, Google Cloud Dataflow. Training Sessions: DataStream API, DataSet API, Gelly School: Large-scale graph analysis 24 24 DIMA 2015

25 25 DIMA 2015

http://flink.apache.org 26 26 DIMA 2015

Thank You Contact: Tilmann Rabl rabl@tu-berlin.de 27 2013 Berlin Big Data Center All Rights Reserved DIMA 2015