Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks
Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data pipeline
Typical Data Pipeline ETL Exploration Dashboards & Reports Advanced Analytics Data Products Data
Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data pipeline
From Tasks to Challenges Tasks Build a Hadoop cluster Build a data pipeline Challenges Clusters hard to setup and manage Need to integrate a zoo of tools
From Challenges to Solutions Tasks Build a Hadoop cluster Build a data pipeline Challenges Solutions Clusters hard to setup and manage Need to integrate a zoo of tools
From Challenges to Solutions Challenges Solutions Clusters hard to setup and manage Need to integrate a zoo of tools Apache Spark
Apache Spark Cluster computing system that generalizes Google s MapReduce Benefits for users: > Performance: up to 100x faster > Ease of use: 2-5x less code than MapReduce > Generality: unifies previously specialized models
Big Data Systems Today MapReduce Pregel Giraph Impala Dremel Drill Presto? Storm S4... General batch processing Specialized systems for new workloads Unified engine
Problems with Specialized Systems More systems to manage, tune, deploy Can t combine processing types in one application > Even though many pipelines need to do this! > E.g. load data with SQL, then run machine learning In many pipelines, data exchange between engines is the dominant cost!
Libraries that come with Spark Spark SQL relational Spark Streaming real-time MLlib machine learning GraphX graph Spark Core
Impala (mem) Spark (mem) GraphLab Spark Response Time (sec) Impala (disk) Spark (disk) Throughput (MB/s/node) Storm Response Time (min) Spark Hive Mahout Performance vs Specialized Systems 50 35 60 40 30 50 30 20 25 20 15 10 40 30 20 10 5 10 0 0 0 SQL Streaming ML
On-Disk Performance: Petabyte Sort Spark beat last year s Sort Benchmark winner, Hadoop, by 3 using 10 fewer machines 2013 Record (Hadoop) Spark 100 TB Spark 1 PB Data Size 102.5 TB 100 TB 1000 TB Time 72 min 23 min 234 min Nodes 2100 206 190 Cores 50400 6592 6080 Rate/Node 0.67 GB/min 20.7 GB/min 22.5 GB/min tinyurl.com/spark-sort
Project History Started in 2009 in RAD / AMP Labs at UC Berkeley Open sourced in 2010 Since then, a growing open source community > Became Apache project in 2013
Monthly contributors Rapid Growth 100 75 50 25 0 2011 2012 2013 2014 2-3x more activity than: Hadoop, Storm, MongoDB, NumPy, D3, Julia,
MapReduce YARN HDFS Storm MapReduce YARN Storm HDFS Spark Spark Compared to Other Projects 2000 350000 1800 1600 1400 1200 1000 800 300000 250000 200000 150000 600 400 200 100000 50000 0 Commits Lines of Code Changed Activity in past 6 months 0
From Challenges to Solutions Challenges Clusters hard to setup and manage Need to integrate a zoo of tools Solutions? Apache Spark
Getting started with Spark is still hard Acquire machines Setup the big data pipeline Setup Hadoop Setup Hive Setup Spark
From Challenges to Solutions Challenges Clusters hard to setup and manage Need to integrate a zoo of tools Solutions Hosted service Apache Spark
Hosted Big Data Service Run Spark in the cloud Machines on the fly Dynamically scale up Outsource ops to cloud provider Advantages Faster (months down to days) Handle bursts Cheaper
Cluster Provisioining Deploying a cluster on-prem takes 3-6 months Launch a cluster in the cloud (AWS/GCE/Azure) Cluster up and spinning in one hour Still need to configure and setup cluster Ecosystem evolving, e.g. CloudFormation Can reuse process to re-launch clusters
Dynamic scale-out Increase cluster capacity dynamically Burst to 10TB cluster ($55K/week vs $390K/yr) Interactive exploration and production jobs Cost-optimizations available Spot instances Auto-resizing clusters based on activity
Outsourcing Ops Instead of buying ops or consultants Outsource to cloud provider, which automates it All other things being equal, will be cheaper than buying it as a service
From Challenges to Solutions Challenges Clusters hard to setup and manage Need to integrate a zoo of tools Solutions Hosted platform Apache Spark
What could it look like?
Databricks Platform Start clusters in seconds Zero-cost management Dynamically scale up & down
Interactive Workspace: Notebooks Python, SQL, Scala Interactive queries & plots On-line collaboration
Interactive Workspace: Dashboards WYSIWYG builder Interactive plots One-click publishing
Interactive Workspace: Job Launcher Run arbitrary Spark jobs, programmatically
Summary Big data is as hard as rocket science but doesn t need to be Apache Spark unifies many existing frameworks Big data in the cloud faster, more scalable, cheaper