Ali Ghodsi Head of PM and Engineering Databricks

Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks

Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data pipeline

Typical Data Pipeline ETL Exploration Dashboards & Reports Advanced Analytics Data Products Data

Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data pipeline

From Tasks to Challenges Tasks Build a Hadoop cluster Build a data pipeline Challenges Clusters hard to setup and manage Need to integrate a zoo of tools

From Challenges to Solutions Tasks Build a Hadoop cluster Build a data pipeline Challenges Solutions Clusters hard to setup and manage Need to integrate a zoo of tools

From Challenges to Solutions Challenges Solutions Clusters hard to setup and manage Need to integrate a zoo of tools Apache Spark

Apache Spark Cluster computing system that generalizes Google s MapReduce Benefits for users: > Performance: up to 100x faster > Ease of use: 2-5x less code than MapReduce > Generality: unifies previously specialized models

Big Data Systems Today MapReduce Pregel Giraph Impala Dremel Drill Presto? Storm S4... General batch processing Specialized systems for new workloads Unified engine

Problems with Specialized Systems More systems to manage, tune, deploy Can t combine processing types in one application > Even though many pipelines need to do this! > E.g. load data with SQL, then run machine learning In many pipelines, data exchange between engines is the dominant cost!

Libraries that come with Spark Spark SQL relational Spark Streaming real-time MLlib machine learning GraphX graph Spark Core

Impala (mem) Spark (mem) GraphLab Spark Response Time (sec) Impala (disk) Spark (disk) Throughput (MB/s/node) Storm Response Time (min) Spark Hive Mahout Performance vs Specialized Systems 50 35 60 40 30 50 30 20 25 20 15 10 40 30 20 10 5 10 0 0 0 SQL Streaming ML

On-Disk Performance: Petabyte Sort Spark beat last year s Sort Benchmark winner, Hadoop, by 3 using 10 fewer machines 2013 Record (Hadoop) Spark 100 TB Spark 1 PB Data Size 102.5 TB 100 TB 1000 TB Time 72 min 23 min 234 min Nodes 2100 206 190 Cores 50400 6592 6080 Rate/Node 0.67 GB/min 20.7 GB/min 22.5 GB/min tinyurl.com/spark-sort

Project History Started in 2009 in RAD / AMP Labs at UC Berkeley Open sourced in 2010 Since then, a growing open source community > Became Apache project in 2013

Monthly contributors Rapid Growth 100 75 50 25 0 2011 2012 2013 2014 2-3x more activity than: Hadoop, Storm, MongoDB, NumPy, D3, Julia,

MapReduce YARN HDFS Storm MapReduce YARN Storm HDFS Spark Spark Compared to Other Projects 2000 350000 1800 1600 1400 1200 1000 800 300000 250000 200000 150000 600 400 200 100000 50000 0 Commits Lines of Code Changed Activity in past 6 months 0

From Challenges to Solutions Challenges Clusters hard to setup and manage Need to integrate a zoo of tools Solutions? Apache Spark

Getting started with Spark is still hard Acquire machines Setup the big data pipeline Setup Hadoop Setup Hive Setup Spark

From Challenges to Solutions Challenges Clusters hard to setup and manage Need to integrate a zoo of tools Solutions Hosted service Apache Spark

Hosted Big Data Service Run Spark in the cloud Machines on the fly Dynamically scale up Outsource ops to cloud provider Advantages Faster (months down to days) Handle bursts Cheaper

Cluster Provisioining Deploying a cluster on-prem takes 3-6 months Launch a cluster in the cloud (AWS/GCE/Azure) Cluster up and spinning in one hour Still need to configure and setup cluster Ecosystem evolving, e.g. CloudFormation Can reuse process to re-launch clusters

Dynamic scale-out Increase cluster capacity dynamically Burst to 10TB cluster ($55K/week vs $390K/yr) Interactive exploration and production jobs Cost-optimizations available Spot instances Auto-resizing clusters based on activity

Outsourcing Ops Instead of buying ops or consultants Outsource to cloud provider, which automates it All other things being equal, will be cheaper than buying it as a service

From Challenges to Solutions Challenges Clusters hard to setup and manage Need to integrate a zoo of tools Solutions Hosted platform Apache Spark

What could it look like?

Databricks Platform Start clusters in seconds Zero-cost management Dynamically scale up & down

Interactive Workspace: Notebooks Python, SQL, Scala Interactive queries & plots On-line collaboration

Interactive Workspace: Dashboards WYSIWYG builder Interactive plots One-click publishing

Interactive Workspace: Job Launcher Run arbitrary Spark jobs, programmatically

Summary Big data is as hard as rocket science but doesn t need to be Apache Spark unifies many existing frameworks Big data in the cloud faster, more scalable, cheaper