Welcome to the first Workshop on Big data Open Source Systems (BOSS)



Similar documents
Apache Flink Next-gen data analysis. Kostas

Big Data Research in Berlin BBDC and Apache Flink

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Moving From Hadoop to Spark

Ali Ghodsi Head of PM and Engineering Databricks

How To Create A Data Visualization With Apache Spark And Zeppelin

How Companies are! Using Spark

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Hadoop Ecosystem B Y R A H I M A.

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

Unified Big Data Analytics Pipeline. 连 城

Unified Big Data Processing with Apache Spark. Matei

Hadoop & Spark Using Amazon EMR

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Scaling Out With Apache Spark. DTL Meeting Slides based on

Why Spark on Hadoop Matters

What s next for the Berkeley Data Analytics Stack?

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

HDP Hadoop From concept to deployment.

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Roadmap Talend : découvrez les futures fonctionnalités de Talend

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Dell In-Memory Appliance for Cloudera Enterprise

Databricks. A Primer

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Databricks. A Primer

Big Data Research in the AMPLab: BDAS and Beyond

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

Big Data and Industrial Internet

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Beyond Hadoop with Apache Spark and BDAS

Next-Gen Big Data Analytics using the Spark stack

A Brief Introduction to Apache Tez

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Real Time Data Processing using Spark Streaming

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

I/O Considerations in Big Data Analytics

Apache Flink. Fast and Reliable Large-Scale Data Processing

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Introduction to Spark

Unified Batch & Stream Processing Platform

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Brave New World: Hadoop vs. Spark

BIG DATA TRENDS AND TECHNOLOGIES

Architectures for massive data management

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Making big data simple with Databricks

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Upcoming Announcements

Shark Installation Guide Week 3 Report. Ankush Arora

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Managing large clusters resources

Big Data Processing. Patrick Wendell Databricks

Real Time Big Data Processing

Spark and the Big Data Library

#TalendSandbox for Big Data

Big Data Analytics. Lucas Rego Drumond

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Big Data Explained. An introduction to Big Data Science.

Dominik Wagenknecht Accenture

Conquering Big Data with BDAS (Berkeley Data Analytics)

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

HADOOP. Revised 10/19/2015

Bringing Big Data to People

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

HPC ABDS: The Case for an Integrating Apache Big Data Stack

CSE-E5430 Scalable Cloud Computing Lecture 11

Deploying Hadoop with Manager

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Workshop on Hadoop with Big Data

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Trend Micro Big Data Platform and Apache Bigtop. 葉 祐 欣 (Evans Ye) Big Data Conference 2015

Reference Architecture, Requirements, Gaps, Roles

Berkeley Data Analytics Stack:! Experience and Lesson Learned

Tomáš Müller IT Architekt 21/04/2010 ČVUT FEL: SOA & Enterprise Service Bus IBM Corporation

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Hadoop IST 734 SS CHUNG

Large scale processing using Hadoop. Ján Vaňo

INTRODUCING APACHE IGNITE An Apache Incubator Project

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Big Data Storage: Should We Pop the (Software) Stack? Michael Carey Information Systems Group CS Department UC Irvine. #AsterixDB

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

HYBRID CLOUD SUPPORT FOR LARGE SCALE ANALYTICS AND WEB PROCESSING. Navraj Chohan, Anand Gupta, Chris Bunch, Kowshik Prakasam, and Chandra Krintz

Case Study : 3 different hadoop cluster deployments

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data on Microsoft Platform

Transcription:

Welcome to the first Workshop on Big data Open Source Systems (BOSS) September 4th, 2015 Co-located with VLDB 2015 Tilmann Rabl

Hands on Big Data 8 parallel tutorials 8 systems Open source Publicly available Presenters System experts Hands on This is not a demo! You can pick two!

But why? Initial idea: Malu Castellanos Mike Carey Doing It On Big Data: a Tutorial/Workshop Driving force Other people involved Volker Markl Norman Patton Lipyeow Lim Kerstin Forster Experiment Tell us what you think Email: rabl@tu-berlin.de MOAR SYSTEMS!

Presented Systems Apache AsterixDB Apache Spark Apache Flink Padres Apache Reef rasdaman Apache Singa SciDB

Massively Parallel Program Bulk Synchronous Parallel Introduction & Flash Session Tutorial 1 Tutorial Part 1 2 Tutorial Part 1 3 Tutorial Part 1 4 Tutorial Part 1 5 Tutorial Part 1 6 Tutorial Part 1 7 Tutorial Part 1 8 Part 1 Coffee Break Tutorials Part 2 Lunch Panel

Runtime Environment You are here

Panel Big Data and Exascale Panel Chair Chaitanya Baru, San Diego Supercomputing Center Panelists Arie Shoshani, LBNL Guy Lohmann, IBM Mike Carey, UC Irvine Paul G. Brown, Paradigm4 Peter Baumann, Jacobs University Volker Markl, TU Berlin

Apache AsterixDB (Incubating)

AsterixDB: One Size Fits a Bunch! Wish-list: Able to manage data Semistructured data management Flexible data model Full query capability Continuous data ingestion Efficient and robust parallel runtime Cost proportional to task at hand Support today s Big Data data types Parallel DB systems First- gen BD analysis tools

Apache Flink

Apache Flink : Stream and Batch processing at Scale MartonBalassi (ELTE/ SZTAKI, Hungary) ParisCarbone (KTH, Stockholm,Sweden) Gyula For a (SICS,Stockholm, Sweden) Vasia Kalavri (KTH, Stockholm,Sweden) Asterios Katsifodimos(TU Berlin, Germany)

What is Flink? Applications Hive Mahout Cascading Pig Crunch Giraph Data processing engines MapReduce Spark Storm Flink Tez App and resource management Yarn Mesos Storage, streams HDFS HBase Kafka 2

What can I do with Flink? Batch processing Machine Learning at scale Stream processing Graph Analysis Flink An engine that can natively support all these workloads. 3

But what will I do with Flink today? Graph processing ETL on Datasets Graph creation & analysis Stream Processing Rolling Aggregates Windows & Alerts

Agenda Introduction 15 Overview 15' Gelly (Graph)API 30' Break Graph Processing 20' DataSet/Gelly Hands-on Stream processing with Flink 10 DataStream API 15 Fault Tolerance Demo 45' Streaming Hands-on 5

Apache Reef

Deep Dive into Apache REEF (Incubating) Byung-Gon Chun, Brian Cho (Seoul National University) BOSS 2015 Sep. 4, 2015 Interactive Query A meta-framework that eases the development of Big Data applications atop resource managers such as YARN and Mesos Business Intelligence Stream Processing REEF Resource Manager Distributed File System Machine Learning Data Processing Lib (REEF, Third-party) Batch Processing Reusable control plane for coordinating data plane tasks Adaptation layer for resource managers Container and state reuse across tasks from heterogeneous frameworks Simple and safe configuration management Scalable local, remote event handling Java and C# (.NET) support In production use (Microsoft Azure)

Deep Dive into Apache REEF (Incubating) Byung-Gon Chun, Brian Cho (Seoul National University) BOSS 2015 Sep. 4, 2015 Tutorial 1. What is REEF? 2. Install REEF 3. Run your first REEF job: HelloREEF 4. Why would you want REEF? 5. Create your own Task Scheduler with REEF Client Contact: Byung-Gon Chun bgchun@gmail.com Brian Cho chobrian@gmail.com

Apache Singa

A General Distributed Deep Learning Platform Motivation Deep learning is effective for classification tasks, e.g., image recognition Training code is complex to write from the scratch Training is time consuming, e.g., 10 days or weeks Goals Easy to use General to support popular deep learning models Extensible for users to do customization, e.g., training new models Scalable Reduce training time with more computation resources, e.g. machines Improve efficiency of one training iteration by synchronous training Reduce total number of training iterations by asynchronous training 20

Apache Spark

Spark Tutorial Reynold Xin @rxin Sep 4, 2015 @ VLDB BOSS 2015

Apache Spark Fast & general distributed data processing engine, with APIs in SQL, Scala, Java, Python, and R 800+ contributo rs and many academic papers Largest open source project in (big) data & at Apache

A Brief History started @ Berkeley NDSI (RDD) SIGMOD Dem o (Shark) OSDI (Grap hx) 2009 2010 2011 2012 2013 2014 2015 HotCloud SIGMOD (Shark) SOSP (Streaming) Datab ricks started Donated to ASF SIGMOD (Spark SQL)

Users 1000+ com panies Distributors + Apps 50+ com panies

Our Goal for Spark Unified engine across data workloads and platform s Streaming SQL ML Graph Batch

Agenda Today Spark 101: RDD Fundamentals Spark 102: DataFrames Spark 201: Understanding Spark Internals (with exercises in Databricks notebooks)

PADRES

Presenter: Kaiwen Zhang University of Toronto Pub/Sub is a communication paradigm / middleware Communication between information producers (publisher) and consumers (subscriber) is mediated by a set of brokers (p2p overlay). Features Content-based routing Composite subscription (event P 1 and event P Broker 2 occurred within 2s) Load balancing (offload clients to less loaded brokers) Overlay Fault tolerance (maintains integrity of broker network) S Historic S P access S (subscribe SMatching 1 S 1 to past events) S S 1 S 1 P P 1 1 S S 1 S S System monitoring (overlay Engine monitoring & visualization) Composite P Atomic events S 1 Broker + S1 event S Broker Overlay S c Broker Balancer subscribe publish P 2 S S 1 S 1 S P Overlay P Routing Table S P S P P S Overlay S 2 S S S 08.10.2015 Middleware Systems Research Group 29

rasdaman

the Array Database the pioneer Array DBMS: analytics on n-d dense/sparse arrays optimization & parallel QP on multicore, cloud, modern hw scalable from cubesat to datacenter federations seamless integration with R, python,... operationally deployed on Petascale, basis for ISO Array SQL www.rasdaman.org

SciDB

SciDB: No cute animals No 5 color marketing brochure just an Open Source, Transactional, Massively Parallel, Array DBMS with A Scalable Analytic Query Engine.

Let s go! Intro & Panel