Ali Ghodsi Head of PM and Engineering Databricks

Size: px

Start display at page:

Download "Ali Ghodsi Head of PM and Engineering Databricks"

Loren Shields
8 years ago
Views:

1 Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks

2 Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data pipeline

3 Typical Data Pipeline ETL Exploration Dashboards & Reports Advanced Analytics Data Products Data

4 Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data pipeline

5 From Tasks to Challenges Tasks Build a Hadoop cluster Build a data pipeline Challenges Clusters hard to setup and manage Need to integrate a zoo of tools

6 From Challenges to Solutions Tasks Build a Hadoop cluster Build a data pipeline Challenges Solutions Clusters hard to setup and manage Need to integrate a zoo of tools

7 From Challenges to Solutions Challenges Solutions Clusters hard to setup and manage Need to integrate a zoo of tools Apache Spark

8 Apache Spark Cluster computing system that generalizes Google s MapReduce Benefits for users: > Performance: up to 100x faster > Ease of use: 2-5x less code than MapReduce > Generality: unifies previously specialized models

9 Big Data Systems Today MapReduce Pregel Giraph Impala Dremel Drill Presto? Storm S4... General batch processing Specialized systems for new workloads Unified engine

10 Problems with Specialized Systems More systems to manage, tune, deploy Can t combine processing types in one application > Even though many pipelines need to do this! > E.g. load data with SQL, then run machine learning In many pipelines, data exchange between engines is the dominant cost!

11 Libraries that come with Spark Spark SQL relational Spark Streaming real-time MLlib machine learning GraphX graph Spark Core

12 Impala (mem) Spark (mem) GraphLab Spark Response Time (sec) Impala (disk) Spark (disk) Throughput (MB/s/node) Storm Response Time (min) Spark Hive Mahout Performance vs Specialized Systems SQL Streaming ML

13 On-Disk Performance: Petabyte Sort Spark beat last year s Sort Benchmark winner, Hadoop, by 3 using 10 fewer machines 2013 Record (Hadoop) Spark 100 TB Spark 1 PB Data Size TB 100 TB 1000 TB Time 72 min 23 min 234 min Nodes Cores Rate/Node 0.67 GB/min 20.7 GB/min 22.5 GB/min tinyurl.com/spark-sort

14 Project History Started in 2009 in RAD / AMP Labs at UC Berkeley Open sourced in 2010 Since then, a growing open source community > Became Apache project in 2013

15 Monthly contributors Rapid Growth x more activity than: Hadoop, Storm, MongoDB, NumPy, D3, Julia,

16 MapReduce YARN HDFS Storm MapReduce YARN Storm HDFS Spark Spark Compared to Other Projects Commits Lines of Code Changed Activity in past 6 months 0

17 From Challenges to Solutions Challenges Clusters hard to setup and manage Need to integrate a zoo of tools Solutions? Apache Spark

18 Getting started with Spark is still hard Acquire machines Setup the big data pipeline Setup Hadoop Setup Hive Setup Spark

19 From Challenges to Solutions Challenges Clusters hard to setup and manage Need to integrate a zoo of tools Solutions Hosted service Apache Spark

20 Hosted Big Data Service Run Spark in the cloud Machines on the fly Dynamically scale up Outsource ops to cloud provider Advantages Faster (months down to days) Handle bursts Cheaper

21 Cluster Provisioining Deploying a cluster on-prem takes 3-6 months Launch a cluster in the cloud (AWS/GCE/Azure) Cluster up and spinning in one hour Still need to configure and setup cluster Ecosystem evolving, e.g. CloudFormation Can reuse process to re-launch clusters

22 Dynamic scale-out Increase cluster capacity dynamically Burst to 10TB cluster ($55K/week vs $390K/yr) Interactive exploration and production jobs Cost-optimizations available Spot instances Auto-resizing clusters based on activity

23 Outsourcing Ops Instead of buying ops or consultants Outsource to cloud provider, which automates it All other things being equal, will be cheaper than buying it as a service

24 From Challenges to Solutions Challenges Clusters hard to setup and manage Need to integrate a zoo of tools Solutions Hosted platform Apache Spark

25 What could it look like?

26 Databricks Platform Start clusters in seconds Zero-cost management Dynamically scale up & down

27 Interactive Workspace: Notebooks Python, SQL, Scala Interactive queries & plots On-line collaboration

28 Interactive Workspace: Dashboards WYSIWYG builder Interactive plots One-click publishing

29 Interactive Workspace: Job Launcher Run arbitrary Spark jobs, programmatically

30 Summary Big data is as hard as rocket science but doesn t need to be Apache Spark unifies many existing frameworks Big data in the cloud faster, more scalable, cheaper

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia

Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing