Welcome to the first Workshop on Big data Open Source Systems (BOSS)

Size: px

Start display at page:

Download "Welcome to the first Workshop on Big data Open Source Systems (BOSS)"

Elijah Hawkins
10 years ago
Views:

1 Welcome to the first Workshop on Big data Open Source Systems (BOSS) September 4th, 2015 Co-located with VLDB 2015 Tilmann Rabl

2 Hands on Big Data 8 parallel tutorials 8 systems Open source Publicly available Presenters System experts Hands on This is not a demo! You can pick two!

3 But why? Initial idea: Malu Castellanos Mike Carey Doing It On Big Data: a Tutorial/Workshop Driving force Other people involved Volker Markl Norman Patton Lipyeow Lim Kerstin Forster Experiment Tell us what you think [email protected] MOAR SYSTEMS!

4 Presented Systems Apache AsterixDB Apache Spark Apache Flink Padres Apache Reef rasdaman Apache Singa SciDB

5 Massively Parallel Program Bulk Synchronous Parallel Introduction & Flash Session Tutorial 1 Tutorial Part 1 2 Tutorial Part 1 3 Tutorial Part 1 4 Tutorial Part 1 5 Tutorial Part 1 6 Tutorial Part 1 7 Tutorial Part 1 8 Part 1 Coffee Break Tutorials Part 2 Lunch Panel

Tutorial Part 1 4 Tutorial Part 1 5 Tutorial Part 1 6 Tutorial

6 Runtime Environment You are here

7 Panel Big Data and Exascale Panel Chair Chaitanya Baru, San Diego Supercomputing Center Panelists Arie Shoshani, LBNL Guy Lohmann, IBM Mike Carey, UC Irvine Paul G. Brown, Paradigm4 Peter Baumann, Jacobs University Volker Markl, TU Berlin

Guy Lohmann, IBM Mike Carey, UC Irvine Paul G.

8 Apache AsterixDB (Incubating)

9 AsterixDB: One Size Fits a Bunch! Wish-list: Able to manage data Semistructured data management Flexible data model Full query capability Continuous data ingestion Efficient and robust parallel runtime Cost proportional to task at hand Support today s Big Data data types Parallel DB systems First- gen BD analysis tools

model Full query capability Continuous data ingestion Efficient and robust

10 Apache Flink

11 Apache Flink : Stream and Batch processing at Scale MartonBalassi (ELTE/ SZTAKI, Hungary) ParisCarbone (KTH, Stockholm,Sweden) Gyula For a (SICS,Stockholm, Sweden) Vasia Kalavri (KTH, Stockholm,Sweden) Asterios Katsifodimos(TU Berlin, Germany)

Stockholm,Sweden) Gyula For a (SICS,Stockholm, Sweden) Vasia

12 What is Flink? Applications Hive Mahout Cascading Pig Crunch Giraph Data processing engines MapReduce Spark Storm Flink Tez App and resource management Yarn Mesos Storage, streams HDFS HBase Kafka 2

13 What can I do with Flink? Batch processing Machine Learning at scale Stream processing Graph Analysis Flink An engine that can natively support all these workloads. 3

14 But what will I do with Flink today? Graph processing ETL on Datasets Graph creation & analysis Stream Processing Rolling Aggregates Windows & Alerts

15 Agenda Introduction 15 Overview 15' Gelly (Graph)API 30' Break Graph Processing 20' DataSet/Gelly Hands-on Stream processing with Flink 10 DataStream API 15 Fault Tolerance Demo 45' Streaming Hands-on 5

16 Apache Reef

17 Deep Dive into Apache REEF (Incubating) Byung-Gon Chun, Brian Cho (Seoul National University) BOSS 2015 Sep. 4, 2015 Interactive Query A meta-framework that eases the development of Big Data applications atop resource managers such as YARN and Mesos Business Intelligence Stream Processing REEF Resource Manager Distributed File System Machine Learning Data Processing Lib (REEF, Third-party) Batch Processing Reusable control plane for coordinating data plane tasks Adaptation layer for resource managers Container and state reuse across tasks from heterogeneous frameworks Simple and safe configuration management Scalable local, remote event handling Java and C# (.NET) support In production use (Microsoft Azure)

Processing REEF Resource Manager Distributed File System Machine Learning Data Processing Lib (REEF, Third-party) Batch Processing Reusable control plane for coordinating data

18 Deep Dive into Apache REEF (Incubating) Byung-Gon Chun, Brian Cho (Seoul National University) BOSS 2015 Sep. 4, 2015 Tutorial 1. What is REEF? 2. Install REEF 3. Run your first REEF job: HelloREEF 4. Why would you want REEF? 5. Create your own Task Scheduler with REEF Client Contact: Byung-Gon Chun Brian Cho

Run your first REEF job: HelloREEF 4. Why would you want REEF? 5.

19 Apache Singa

20 A General Distributed Deep Learning Platform Motivation Deep learning is effective for classification tasks, e.g., image recognition Training code is complex to write from the scratch Training is time consuming, e.g., 10 days or weeks Goals Easy to use General to support popular deep learning models Extensible for users to do customization, e.g., training new models Scalable Reduce training time with more computation resources, e.g. machines Improve efficiency of one training iteration by synchronous training Reduce total number of training iterations by asynchronous training 20

g. machines Improve efficiency of one training iteration by synchronous training Reduce total number of training iterations by asynchronous

21 Apache Spark

22 Spark Tutorial Reynold Sep 4, VLDB BOSS 2015

23 Apache Spark Fast & general distributed data processing engine, with APIs in SQL, Scala, Java, Python, and R 800+ contributo rs and many academic papers Largest open source project in (big) data & at Apache

24 A Brief History Berkeley NDSI (RDD) SIGMOD Dem o (Shark) OSDI (Grap hx) HotCloud SIGMOD (Shark) SOSP (Streaming) Datab ricks started Donated to ASF SIGMOD (Spark SQL)

25 Users com panies Distributors + Apps 50+ com panies

26 Our Goal for Spark Unified engine across data workloads and platform s Streaming SQL ML Graph Batch

27 Agenda Today Spark 101: RDD Fundamentals Spark 102: DataFrames Spark 201: Understanding Spark Internals (with exercises in Databricks notebooks)

28 PADRES

29 Presenter: Kaiwen Zhang University of Toronto Pub/Sub is a communication paradigm / middleware Communication between information producers (publisher) and consumers (subscriber) is mediated by a set of brokers (p2p overlay). Features Content-based routing Composite subscription (event P 1 and event P Broker 2 occurred within 2s) Load balancing (offload clients to less loaded brokers) Overlay Fault tolerance (maintains integrity of broker network) S Historic S P access S (subscribe SMatching 1 S 1 to past events) S S 1 S 1 P P 1 1 S S 1 S S System monitoring (overlay Engine monitoring & visualization) Composite P Atomic events S 1 Broker + S1 event S Broker Overlay S c Broker Balancer subscribe publish P 2 S S 1 S 1 S P Overlay P Routing Table S P S P P S Overlay S 2 S S S Middleware Systems Research Group 29

30 rasdaman

31 the Array Database the pioneer Array DBMS: analytics on n-d dense/sparse arrays optimization & parallel QP on multicore, cloud, modern hw scalable from cubesat to datacenter federations seamless integration with R, python,... operationally deployed on Petascale, basis for ISO Array SQL

32 SciDB

33 SciDB: No cute animals No 5 color marketing brochure just an Open Source, Transactional, Massively Parallel, Array DBMS with A Scalable Analytic Query Engine.

34 Let s go! Intro & Panel

Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas

Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research