Introduction to Big Data! with Apache Spark" UC#BERKELEY#



Similar documents
Big Data Analytics Hadoop and Spark

CSE-E5430 Scalable Cloud Computing Lecture 11

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Apache Spark and Distributed Programming

Big Data Processing. Patrick Wendell Databricks

Unified Big Data Processing with Apache Spark. Matei

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Dell In-Memory Appliance for Cloudera Enterprise

Scaling Out With Apache Spark. DTL Meeting Slides based on

Unified Big Data Analytics Pipeline. 连 城

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Spark: Making Big Data Interactive & Real-Time

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Data and Apache Hadoop s MapReduce

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Open source Google-style large scale data analysis with Hadoop

VariantSpark: Applying Spark-based machine learning methods to genomic information

Hadoop Ecosystem B Y R A H I M A.

How Companies are! Using Spark

Large scale processing using Hadoop. Ján Vaňo

Improving MapReduce Performance in Heterogeneous Environments

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Spark and the Big Data Library

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

How To Create A Data Visualization With Apache Spark And Zeppelin

Introduction to Hadoop

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Conquering Big Data with BDAS (Berkeley Data Analytics)

HiBench Introduction. Carson Wang Software & Services Group

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Hadoop implementation of MapReduce computational model. Ján Vaňo

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Machine- Learning Summer School

Moving From Hadoop to Spark

Brave New World: Hadoop vs. Spark

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Hadoop IST 734 SS CHUNG

Ali Ghodsi Head of PM and Engineering Databricks

Big Data With Hadoop

Beyond Hadoop with Apache Spark and BDAS

Introduction to Hadoop

Map Reduce & Hadoop Recommended Text:

Cloud Computing based on the Hadoop Platform

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Big Data and Scripting Systems build on top of Hadoop

Spark: Cluster Computing with Working Sets

Hadoop & Spark Using Amazon EMR

Application Development. A Paradigm Shift

MapReduce and Hadoop Distributed File System

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

MapReduce and Hadoop Distributed File System V I J A Y R A O

Apache Hadoop. Alexandru Costan

Hadoop & its Usage at Facebook

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

HDFS. Hadoop Distributed File System

Open source large scale distributed data management with Google s MapReduce and Bigtable

Spark Application on AWS EC2 Cluster:

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Energy Efficient MapReduce

GraySort on Apache Spark by Databricks

MapReduce with Apache Hadoop Analysing Big Data

Architectures for massive data management

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Streaming items through a cluster with Spark Streaming

Big Data and Scripting Systems beyond Hadoop

Challenges for Data Driven Systems

ANALYSIS OF BILL OF MATERIAL DATA USING KAFKA AND SPARK

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

BIG DATA TRENDS AND TECHNOLOGIES

Future Prospects of Scalable Cloud Computing

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

DduP Towards a Deduplication Framework utilising Apache Spark

Introduction to Hadoop

Beyond Hadoop MapReduce Apache Tez and Apache Spark

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

Apache Flink Next-gen data analysis. Kostas

Hadoop and Map-Reduce. Swati Gore

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Data-Intensive Computing with Map-Reduce and Hadoop

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Rakam: Distributed Analytics API

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Oracle Big Data SQL Technical Update

Transcription:

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs" Apache Spark"

Some Traditional Analysis Tools" Unix shell commands, Pandas, R" All run on a! single machine!"

The Big Data Problem" Data growing faster than computation speeds" Growing data sources"» Web, mobile, scientific, " Storage getting cheaper"» Size doubling every 18 months" But, stalling CPU speeds and storage! bottlenecks"

Big Data Examples" Facebook s daily logs: 60 TB" 1,000 genomes project: 200 TB" Google web index: 10+ PB" Cost of 1 TB of disk: ~$35" Time to read 1 TB from disk: 3 hours! (100 MB/s)"

The Big Data Problem" A single machine can no longer process or even store all the data!" Only solution is to distribute data over large clusters"

Google Datacenter" How do we program this thing?!

Hardware for Big Data" Lots of hard drives" and CPUs"

One big box?" (1990 s solution)" Hardware for Big Data" But, expensive"» Low volume"» All premium hardware" And, still not big enough!! Image: Wikimedia Commons / User:Tonusamuel

Hardware for Big Data" Consumer-grade hardware" "Not gold plated " " Many desktop-like servers" "Easy to add capacity" "Cheaper per CPU/disk" " Complexity in software" " Image: Steve Jurvetson/Flickr

Problems with Cheap Hardware" Failures, Google s numbers:" "1-5% hard drives/year" "0.2% DIMMs/year" " Network speeds versus shared memory" "Much more latency" "Network slower than storage" " Uneven performance" "

What s Hard About Cluster Computing?" How do we split work across machines?"

How do you count the number of occurrences of each word in a document?" I am Sam" I am Sam" Do you like" Green eggs and ham? " I: 3" am: 3" Sam: 3" do: 1" you: 1" like: 1" "

One Approach: Use a Hash Table" I am Sam" I am Sam" Do you like" Green eggs and ham? " {}"

One Approach: Use a Hash Table" I am Sam" I am Sam" Do you like" Green eggs and ham? " {I :1}"

One Approach: Use a Hash Table" I am Sam" I am Sam" Do you like" Green eggs and ham? " {I: 1," am: 1}"

One Approach: Use a Hash Table" I am Sam" I am Sam" Do you like" Green eggs and ham? " {I: 1," am: 1," Sam: 1}"

One Approach: Use a Hash Table" I am Sam" I am Sam" Do you like" Green eggs and ham? " {I: 2," am: 1," Sam: 1}"

What if the Document is Really Big?" I am Sam" I am Sam" Do you like" Green eggs and ham?" I do not like them" I do not like" Green eggs and ham" Would you like them" Here or there?" "

What if the Document is Really Big?" I am Sam" I am Sam" Do you like" Green eggs and ham?" I do not like them" I do not like" Green eggs and ham" Would you like them" Here or there?" " Machines 1-4" {I: 3," am: 3," Sam: 3" {do: 2, }" {Sam: 1," }" {Would:1, }" Machine 5" {I: 6," am: 4," Sam: 4," do: 3" }" What s the problem with this approach?!

What if the Document is Really Big?" I am Sam" I am Sam" Do you like" Green eggs and ham?" I do not like them" I do not like" Green eggs and ham" Would you like them" Here or there?" " Machines 1-4" {I: 3," am: 3," Sam: 3" {do: 2, }" {Sam: 1," }" {Would:1, }" Machine 5" {I: 6," am: 4," Sam: 4," do: 3" }" Results have to fit on one machine!

What if the Document is Really Big?" I am Sam" I am Sam" Do you like" Green eggs and ham?" I do not like them" I do not like" Green eggs and ham" Would you like them" Here or there?" " {I: 3," am: 3," Sam: 3" {do: 2, }" {Sam: 1," }" {Would:1, }" {I: 4," am: 3, }" {I: 2," do: 1," }" {I: 6," am: 3," you: 2, not: 1, }" Can add aggregation layers but results still must fit on one machine!

What if the Document is Really Big?" I am Sam" I am Sam" Do you like" Green eggs and ham?" I do not like them" I do not like" Green eggs and ham" Would you like them" Here or there?" " {I: 1," am: 1," {do: 1," you: 1, " {Would: 1," you: 1," {Would: 1," you: 1," Machines 1-4" {I: 6," do: 3, " }" Use Divide and Conquer!!!

What if the Document is Really Big?" I am Sam" I am Sam" Do you like" Green eggs and ham?" I do not like them" I do not like" Green eggs and ham" Would you like them" Here or there?" " {I: 1," am: 1," {do: 1," you: 1, " {Would: 1," you: 1," {I: 6," do: 3, " }" {am:5," Sam: 4, }" {you: 2," {Would: 1," {Would: 1," you: 1," Machines 1-4" Machines 1-4" Use Divide and Conquer!!!

What if the Document is Really Big?" I am Sam" I am Sam" Do you like" Green eggs and ham?" I do not like them" I do not like" Green eggs and ham" Would you like them" Here or there?" " {I: 1," am: 1," {do: 1," you: 1, " MAP" {Would: 1," you: 1," {Would: 1," you: 1," {I: 6," do: 3, " }" {am:5," Sam: 4, }" {you: 2," {Would: 1," Use Divide and Conquer!!!

What if the Document is Really Big?" I am Sam" I am Sam" Do you like" Green eggs and ham?" I do not like them" I do not like" Green eggs and ham" Would you like them" Here or there?" " {I: 1," am: 1,"! }" Google Map Reduce 2004" {do: 1," you: 1, " MAP" {Would: 1," you: 1," {Would: 1," you: 1," {I: 6," do: 3, " {am:5," Sam: 4, }" REDUCE" {you: 2," {Would: 1," http://research.google.com/archive/mapreduce.html"

Map Reduce for Sorting" I am Sam" I am Sam" Do you like" Green eggs and ham?" I do not like them" I do not like" Green eggs and ham" Would you like them" Here or there?" " {I: 1," am: 1," {do: 1," you: 1, " {Would: 1," you: 1," {Would: 1," you: 1," 2" > 2,! 4" > 4,! 5" > 5" {1: would, 2: you, }" {3: do," 4: Sam, }" {5: am, }" {6: I" What word is used most? "

What s Hard About Cluster Computing?" How to divide work across machines?"» Must consider network, data locality"» Moving data may be very expensive" How to deal with failures?"» 1 server fails every 3 years! with 10,000 nodes see 10 faults/day"» Even worse: stragglers (not failed, but slow nodes)"

How Do We Deal with Failures?" I am Sam" I am Sam" Do you like" Green eggs and ham?" I do not like them" I do not like" Green eggs and ham" Would you like them" Here or there?" " {do: 1," you: 1, " {Would: 1," you: 1," {Would: 1," you: 1,"

How Do We Deal with Machine Failures?" I am Sam" I am Sam" Do you like" Green eggs and ham?" I do not like them" I do not like" Green eggs and ham" Would you like them" Here or there?" " {do: 1," you: 1, " {Would: 1," you: 1," {Would: 1," you: 1," {I: 1," am: 1," Launch another task!"

How Do We Deal with Slow Tasks?" I am Sam" I am Sam" Do you like" Green eggs and ham?" I do not like them" I do not like" Green eggs and ham" Would you like them" Here or there?" "

How Do We Deal with Slow Tasks?" I am Sam" I am Sam" Do you like" Green eggs and ham?" I do not like them" I do not like" Green eggs and ham" Would you like them" Here or there?" " {do: 1," you: 1, " {Would: 1," you: 1," {Would: 1," you: 1," {I: 1," am: 1," Launch another task!"

Map Reduce: Distributed Execution" MAP" REDUCE" Each stage passes through the hard drives "

Map Reduce: Iterative Jobs" Iterative jobs involve a lot of disk I/O for each repetition" Disk I/O is very slow!" Stage 1" Stage 2" Stage 3"

Apache Spark Motivation" Using Map Reduce for complex jobs, interactive queries and online processing involves lots of disk I/O! Query 1" Query 2" Query 3" Job 1" Job 2" " Interactive mining" Stream processing" Also, iterative jobs" Disk I/O is very slow"

Tech Trend: Cost of Memory" PRICE" " Memory" 2010: 1 /MB" disk" flash" Lower cost means can put more memory in each server" http://www.jcmit.com/mem2014.htm " YEAR" "

Hardware for Big Data" Lots of hard drives" and CPUs" and memory!"

Keep more data in-memory! Opportunity" Create new distributed execution engine:" http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf "

HDFS! read" Use Memory Instead of Disk" HDFS! write" HDFS! read" HDFS! write" iteration 1" iteration 2"..." Input" Input" HDFS! read" query 1" query 2" query 3"..." result 1" result 2" result 3"

HDFS! read" In-Memory Data Sharing" iteration 1" iteration 2"..." Input" one-time" processing" query 1" query 2" result 1" result 2" Input" Distributed! memory" query 3"..." result 3" 10-100x faster than network and disk"

Resilient Distributed Datasets (RDDs)" Write programs in terms of operations on distributed datasets" Partitioned collections of objects spread across a cluster, stored in memory or on disk " RDDs built and manipulated through a diverse! set of parallel transformations (map, filter, join)! and actions (count, collect, save)" RDDs automatically rebuilt on machine failure"

The Spark Computing Framework" Provides programming abstraction and parallel runtime to hide complexities of fault-tolerance and slow machines" Here s an operation, run it on all of the data "» I don t care where it runs (you schedule that)"» In fact, feel free to run it twice on different! nodes"

Spark Tools" Spark SQL" Spark Streaming" MLlib (machine learning)" GraphX (graph)" Apache Spark"

Spark and Map Reduce Differences " Hadoop" Map Reduce" Spark" Storage" Disk only" In-memory or on disk" Operations" Map and Reduce" Map, Reduce, Join, Sample, etc " Execution model" Batch" Programming" environments" Java" Batch, interactive, streaming" Scala, Java, R, and Python"

Other Spark and Map Reduce Differences " Generalized patterns! unified engine for many use cases" Lazy evaluation of the lineage graph! reduces wait states, better pipelining" Lower overhead for starting jobs" Less expensive shuffles"

In-Memory Can Make a Big Difference" Two iterative Machine Learning algorithms:" 4.1" K-means Clustering" 121" Hadoop MR" Spark" 0" 50" 100" 150" sec" 0.96" Logistic Regression" 80" Hadoop MR" Spark" 0" 20" 40" 60" 80" 100" sec"

First Public Cloud Petabyte Sort" Daytona Gray 100 TB sort benchmark record (tied for 1 st place)" http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html!

Spark Expertise Tops Big Data Median Salaries" Over 800 respondents across 53 countries and 41 U.S. states" http://www.oreilly.com/data/free/2014-data-science-salary-survey.csp "

History Review"

Historical References" circa 1979 Stanford, MIT, CMU, etc.: set/list operations in LISP, Prolog, etc., for parallel processing! http://www-formal.stanford.edu/jmc/history/lisp/lisp.htm" circa 2004 Google: MapReduce: Simplified Data Processing on Large Clusters" Jeffrey Dean and Sanjay Ghemawat! http://research.google.com/archive/mapreduce.html" circa 2006 Apache Hadoop, originating from the Yahoo! s Nutch Project! Doug Cutting! http://research.yahoo.com/files/cutting.pdf" circa 2008 Yahoo!: web scale search indexing! Hadoop Summit, HUG, etc.! http://developer.yahoo.com/hadoop/" circa 2009 Amazon AWS: Elastic MapReduce! Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc.! http://aws.amazon.com/elasticmapreduce/" "

Spark Research Papers" Spark: Cluster Computing with Working Sets" Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica! USENIX HotCloud (2010)" people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf! Resilient Distributed Datasets: A Fault-Tolerant Abstraction for" In-Memory Cluster Computing" Matei Zaharia, Mosharaf Chowdhury, Tathagata Das,! Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin,! Scott Shenker, Ion Stoica! NSDI (2012)! usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf"