Ali Ghodsi Head of PM and Engineering Databricks

Similar documents
Unified Big Data Processing with Apache Spark. Matei

Databricks. A Primer

Moving From Hadoop to Spark

How Companies are! Using Spark

How To Create A Data Visualization With Apache Spark And Zeppelin

Databricks. A Primer

Beyond Hadoop with Apache Spark and BDAS

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Big Data Processing. Patrick Wendell Databricks

Hadoop & Spark Using Amazon EMR

Large scale processing using Hadoop. Ján Vaňo

Hadoop Ecosystem B Y R A H I M A.

Spark: Making Big Data Interactive & Real-Time

Unified Big Data Analytics Pipeline. 连 城

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Conquering Big Data with BDAS (Berkeley Data Analytics)

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Information Builders Mission & Value Proposition

An Open Source Memory-Centric Distributed Storage System

Hadoop implementation of MapReduce computational model. Ján Vaňo

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Why Spark on Hadoop Matters

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Dell In-Memory Appliance for Cloudera Enterprise

Making big data simple with Databricks

Architectures for massive data management

Big Data Research in the AMPLab: BDAS and Beyond

Streaming items through a cluster with Spark Streaming

Customer Case Study. Automatic Labs

Native Connectivity to Big Data Sources in MSTR 10

Shark Installation Guide Week 3 Report. Ankush Arora

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

CS 294: Big Data System Research: Trends and Challenges

Big Analytics in the Cloud. Matt Winkler PM, Big

Customer Case Study. Sharethrough

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

HiBench Introduction. Carson Wang Software & Services Group

Apache Flink Next-gen data analysis. Kostas

Application Development. A Paradigm Shift

The Future of Data Management

From Spark to Ignition:

The Internet of Things and Big Data: Intro

Implement Hadoop jobs to extract business value from large and varied data sets

Spark and the Big Data Library

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Brave New World: Hadoop vs. Spark

Big Data and Industrial Internet

Big Data on Google Cloud

Information Architecture

Case Study : 3 different hadoop cluster deployments

CSE-E5430 Scalable Cloud Computing Lecture 11

Real Time Big Data Processing

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Oracle Big Data SQL Technical Update

BIG DATA ANALYTICS For REAL TIME SYSTEM

BIG DATA TRENDS AND TECHNOLOGIES

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

How To Scale Out Of A Nosql Database

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

HDP Hadoop From concept to deployment.

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Hadoop IST 734 SS CHUNG

The Inside Scoop on Hadoop

Roadmap Talend : découvrez les futures fonctionnalités de Talend

A Reference Architecture and Road map for Enabling E- commerce on Apache Spark

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

What s next for the Berkeley Data Analytics Stack?

The Future of Data Management with Hadoop and the Enterprise Data Hub

Apache Spark and the future of big data applica5ons. Eric Baldeschwieler

HADOOP. Revised 10/19/2015

Next-Gen Big Data Analytics using the Spark stack

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Machine- Learning Summer School

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

Hadoop in the Enterprise

HDP Enabling the Modern Data Architecture

A Brief Introduction to Apache Tez

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Berkeley Data Analytics Stack:! Experience and Lesson Learned

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Upcoming Announcements

Big Data. Lyle Ungar, University of Pennsylvania

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Scaling Out With Apache Spark. DTL Meeting Slides based on

Workshop on Hadoop with Big Data

David Teplow! Integra Technology Consulting!

Hadoop Trends and Practical Use Cases. April 2014

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

GraySort on Apache Spark by Databricks

Transcription:

Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks

Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data pipeline

Typical Data Pipeline ETL Exploration Dashboards & Reports Advanced Analytics Data Products Data

Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data pipeline

From Tasks to Challenges Tasks Build a Hadoop cluster Build a data pipeline Challenges Clusters hard to setup and manage Need to integrate a zoo of tools

From Challenges to Solutions Tasks Build a Hadoop cluster Build a data pipeline Challenges Solutions Clusters hard to setup and manage Need to integrate a zoo of tools

From Challenges to Solutions Challenges Solutions Clusters hard to setup and manage Need to integrate a zoo of tools Apache Spark

Apache Spark Cluster computing system that generalizes Google s MapReduce Benefits for users: > Performance: up to 100x faster > Ease of use: 2-5x less code than MapReduce > Generality: unifies previously specialized models

Big Data Systems Today MapReduce Pregel Giraph Impala Dremel Drill Presto? Storm S4... General batch processing Specialized systems for new workloads Unified engine

Problems with Specialized Systems More systems to manage, tune, deploy Can t combine processing types in one application > Even though many pipelines need to do this! > E.g. load data with SQL, then run machine learning In many pipelines, data exchange between engines is the dominant cost!

Libraries that come with Spark Spark SQL relational Spark Streaming real-time MLlib machine learning GraphX graph Spark Core

Impala (mem) Spark (mem) GraphLab Spark Response Time (sec) Impala (disk) Spark (disk) Throughput (MB/s/node) Storm Response Time (min) Spark Hive Mahout Performance vs Specialized Systems 50 35 60 40 30 50 30 20 25 20 15 10 40 30 20 10 5 10 0 0 0 SQL Streaming ML

On-Disk Performance: Petabyte Sort Spark beat last year s Sort Benchmark winner, Hadoop, by 3 using 10 fewer machines 2013 Record (Hadoop) Spark 100 TB Spark 1 PB Data Size 102.5 TB 100 TB 1000 TB Time 72 min 23 min 234 min Nodes 2100 206 190 Cores 50400 6592 6080 Rate/Node 0.67 GB/min 20.7 GB/min 22.5 GB/min tinyurl.com/spark-sort

Project History Started in 2009 in RAD / AMP Labs at UC Berkeley Open sourced in 2010 Since then, a growing open source community > Became Apache project in 2013

Monthly contributors Rapid Growth 100 75 50 25 0 2011 2012 2013 2014 2-3x more activity than: Hadoop, Storm, MongoDB, NumPy, D3, Julia,

MapReduce YARN HDFS Storm MapReduce YARN Storm HDFS Spark Spark Compared to Other Projects 2000 350000 1800 1600 1400 1200 1000 800 300000 250000 200000 150000 600 400 200 100000 50000 0 Commits Lines of Code Changed Activity in past 6 months 0

From Challenges to Solutions Challenges Clusters hard to setup and manage Need to integrate a zoo of tools Solutions? Apache Spark

Getting started with Spark is still hard Acquire machines Setup the big data pipeline Setup Hadoop Setup Hive Setup Spark

From Challenges to Solutions Challenges Clusters hard to setup and manage Need to integrate a zoo of tools Solutions Hosted service Apache Spark

Hosted Big Data Service Run Spark in the cloud Machines on the fly Dynamically scale up Outsource ops to cloud provider Advantages Faster (months down to days) Handle bursts Cheaper

Cluster Provisioining Deploying a cluster on-prem takes 3-6 months Launch a cluster in the cloud (AWS/GCE/Azure) Cluster up and spinning in one hour Still need to configure and setup cluster Ecosystem evolving, e.g. CloudFormation Can reuse process to re-launch clusters

Dynamic scale-out Increase cluster capacity dynamically Burst to 10TB cluster ($55K/week vs $390K/yr) Interactive exploration and production jobs Cost-optimizations available Spot instances Auto-resizing clusters based on activity

Outsourcing Ops Instead of buying ops or consultants Outsource to cloud provider, which automates it All other things being equal, will be cheaper than buying it as a service

From Challenges to Solutions Challenges Clusters hard to setup and manage Need to integrate a zoo of tools Solutions Hosted platform Apache Spark

What could it look like?

Databricks Platform Start clusters in seconds Zero-cost management Dynamically scale up & down

Interactive Workspace: Notebooks Python, SQL, Scala Interactive queries & plots On-line collaboration

Interactive Workspace: Dashboards WYSIWYG builder Interactive plots One-click publishing

Interactive Workspace: Job Launcher Run arbitrary Spark jobs, programmatically

Summary Big data is as hard as rocket science but doesn t need to be Apache Spark unifies many existing frameworks Big data in the cloud faster, more scalable, cheaper