Introduc8on to Apache Spark

Similar documents
Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Spark: Making Big Data Interactive & Real-Time

Unified Big Data Processing with Apache Spark. Matei

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Spark and the Big Data Library

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Unified Big Data Analytics Pipeline. 连 城

Ibis: Scaling Python Analy=cs on Hadoop and Impala

Data Science in the Wild

Hadoop Ecosystem B Y R A H I M A.

Conquering Big Data with Apache Spark

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

BIG DATA APPLICATIONS

Internals of Hadoop Application Framework and Distributed File System

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Real Time Data Processing using Spark Streaming

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Introduction to MapReduce and Hadoop

HPCHadoop: MapReduce on Cray X-series

Moving From Hadoop to Spark

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Beyond Hadoop with Apache Spark and BDAS

CS54100: Database Systems

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Hadoop & Spark Using Amazon EMR

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Using RDBMS, NoSQL or Hadoop?

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Hadoop + Clojure. Hadoop World NYC Friday, October 2, Stuart Sierra, AltLaw.org

Big Data Management and NoSQL Databases

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

How To Create A Data Visualization With Apache Spark And Zeppelin

Using Apache Spark Pat McDonough - Databricks

Big Data for the JVM developer. Costin Leau,

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Oracle Big Data Fundamentals Ed 1 NEW

Why Spark on Hadoop Matters

Dell In-Memory Appliance for Cloudera Enterprise

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Connecting Hadoop with Oracle Database

Hadoop WordCount Explained! IT332 Distributed Systems

Scaling Out With Apache Spark. DTL Meeting Slides based on

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

How Companies are! Using Spark

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Xiaoming Gao Hui Li Thilina Gunarathne

Word Count Code using MR2 Classes and API

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

Biomap Jobs and the Big Picture

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Workshop on Hadoop with Big Data

Machine- Learning Summer School

High-Speed In-Memory Analytics over Hadoop and Hive Data

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

The Future of Data Management

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Introduction to Spark

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Apache Spark and Distributed Programming

Big Data With Hadoop

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Can t We All Just Get Along? Spark and Resource Management on Hadoop

Big Data Analytics* Outline. Issues. Big Data

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Petabyte-scale Data with Apache HDFS. Matt Foley Hortonworks, Inc.

Architectures for massive data management

Apache HBase. Crazy dances on the elephant back

This is a brief tutorial that explains the basics of Spark SQL programming.

Word count example Abdalrahman Alsaedi

Apache Flink Next-gen data analysis. Kostas

Big Data Course Highlights

Hadoop and Map-Reduce. Swati Gore

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Big Data Analytics. Lucas Rego Drumond

Introduction To Hadoop

Next-Gen Big Data Analytics using the Spark stack

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

Extreme Computing. Hadoop MapReduce in more detail.

MR-(Mapreduce Programming Language)

Hadoop IST 734 SS CHUNG

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

The Future of Data Management with Hadoop and the Enterprise Data Hub

Transcription:

Introduc8on to Apache Spark Jordan Volz, Systems Engineer @ Cloudera 1

Analyzing Data on Large Data Sets Python, R, etc. are popular tools among data scien8sts/analysts, sta8s8cians, etc. Why are these tools popular? Easy to learn and maximizes produc8vity for data engineers, data scien8sts, sta8s8cians Build robust sooware and do interac8ve data analysis Large, diverse open source development communi8es Comprehensive libraries: data wrangling, ML, visualiza8on, etc. Limita8ons do exist: Largely confined to single- node analysis and smaller data sets Requires sampling or aggrega8ons for larger data Distributed tools compromise in various ways adds complexity and 8me Restricts effec8veness in certain use cases 2

MapReduce Analysis on Large Data Sets (Hadoop) Map Map Map Map Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce Key Advances by MapReduce: Data Locality: Automa8c split computa8on and launch of mappers appropriately Fault- Tolerance: Write out of intermediate results and restartable mappers meant ability to run on commodity hardware Linear Scalability: Combina8on of locality + programming model that forces developers to write generally scalable solu8ons to problems 3

Map Reduce is Not Perfect Map Map Map Reduce Reduce Limited to map- reduce paradigm Map Map Reduce Map Reduce Lots of I/O à slower jobs Itera8ve jobs (ML) à even slower Map Reduce Redundant joins with SQL Tools 4

MapReduce on YARN Cloudera, Inc. All rights reserved. 5

Death by Pinprick 6

Apache Spark Flexible, in- memory data processing for Hadoop Easy Development Rich APIs for Scala, Java, and Python Interac8ve shell Flexible Extensible API APIs for different types of workloads: Batch (MR) Streaming Machine Learning Graph Retains: Linear Scalability, Fault- Tolerance, Data Locality Fast Batch & Stream Processing In- Memory processing and caching 7

Spark Basics Distributed cluster framework (like MR), running tasks in parallel across a cluster Tasks operate in- memory, spill to disk when memory exceeded. Resilient Distributed Datasets (RDD): Read- only par88oned collec8on of records RDDs ac8onable through parallel transforma8ons and ac8ons Lazy materializa8on op8mizes resources RDD lineage from storage to compute and caching layer provides fault- tolerance Users control persistence and par88oning 8

Fast Processing Using RAM, Operator Graphs In- Memory Caching A: B: B: Data Par88ons read from RAM instead of disk map groupby F: Operator Graphs Scheduling Op8miza8ons Fault Tolerance C: D: E: Ç Ω join take map filter = RDD = cached par88on 9

Logis8c Regression Performance (Data Fits in Memory) 4000 3500 3000 Running Time(s) 2500 2000 1500 1000 500 0 1 5 10 20 30 # of IteraMons MapReduce Spark 110 s/itera8on First itera8on = 80s Further itera8ons 1s due to caching 10

Spark on YARN 11

Spark will replace MapReduce To become the standard execu8on engine for Hadoop Hadoop MapReduce public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { } private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.tostring(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); output.collect(word, one); } } Spark val spark = new SparkContext(master, appname, [sparkhome], [jars]) val file = spark.textfile("hdfs://...") val counts = file.flatmap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) counts.saveastextfile("hdfs://...") public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { } public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } 12

The Future of Data Processing on Hadoop Spark complemented by specialized fit- for- purpose engines General Data Processing w/ Spark Fast Batch Processing, Machine Learning, and Stream Processing Full- Text Search w/solr Querying textual data AnalyMc Database w/ Impala Low- Latency Massively Concurrent Queries On- Disk Processing w/mapreduce Jobs at extreme scale and extremely disk IO intensive Shared: Data Storage Metadata Resource Management Administra8on Security Governance 13

Easy Development High Produc8vity Language Support Python lines = sc.textfile(...) lines.filter(lambda s: ERROR in s).count() Scala val lines = sc.textfile(...) lines.filter(s => s.contains( ERROR )).count() Java JavaRDD<String> lines = sc.textfile(...); lines.filter(new Function<String, Boolean>() { Boolean call(string s) { return s.contains( error ); } }).count(); Na8ve support for mul8ple languages with iden8cal APIs Scala, Java, Python Use of closures, itera8ons, and other common language constructs to minimize code 2-5x less code 14

Easy Development Use Interac8vely percolateur:spark srowen$./bin/spark-shell --master local[*]... Welcome to / / / / \ \/ _ \/ _ `/ / '_/ / /. /\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51) Type in expressions to have them evaluated. Type :help for more information.... scala> val words = sc.textfile("file:/usr/share/dict/words")... words: org.apache.spark.rdd.rdd[string] = MapPartitionsRDD[1] at textfile at <console>:21 Interac8ve explora8on of data for data scien8sts No need to develop applica8ons Developers can prototype applica8on on live system scala> words.count... res0: Long = 235886 scala> 15

Easy Development Expressive API map filter groupby sort union join leftouterjoin rightouterjoin reduce count fold reducebykey groupbykey cogroup cross zip sample take first partitionby mapwith pipe save 16

Example Logis8c Regression data = spark.textfile(...).map(readpoint).cache() w = numpy.random.rand(d) for i in range(iterations): gradient = data.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x).reduce(lambda x, y: x + y) w -= gradient print Final w: %s % w 17

The Spark Ecosystem & Hadoop Spark Streaming MLlib SparkSQL GraphX Data- frames SparkR Spark Impala Search MR Others RESOURCE MANAGEMENT YARN STORAGE HDFS, HBase 18

One Plauorm, Many Workloads Process Ingest Sqoop, Flume, Kaxa, Spark Streaming Transform MapReduce, Hive, Pig, Spark Discover Analy8c Database Impala Search Solr Security and Administra8on Model Machine Learning SAS, R, Spark, Mahout YARN, Cloudera Manager, Cloudera Navigator Unlimited Storage HDFS, HBase Serve NoSQL Database HBase Streaming Spark Streaming Batch, Interac8ve, and Real- Time. Leading performance and usability in one plauorm. End- to- end analy8c workflows Access more data Work with data in new ways Enable new users 19

Cloudera Customer Use Cases Over 150 customers using Spark Spark clusters as large as 800 nodes Core Spark Financial Services Health Poruolio Risk Analysis ETL Pipeline Speed- Up 20+ years of stock data Iden8fy disease- causing genes in the full human genome Calculate Jaccard scores on health care data sets Spark Streaming ERP 1010 Data Services Op8cal Character Recogni8on and Bill Classifica8on Trend analysis Document classifica8on (LDA) Fraud analy8cs Financial Services Online Fraud Detec8on Ad Tech Real- Time Ad Performance Analysis 20

Uni8ng Spark and Hadoop The One Plauorm Ini8a8ve Investment Areas Management Leverage Hadoop- na8ve resource management. Security Full support for Hadoop security and beyond. Scale Enable 10k- node clusters. Streaming Support for 80% of common stream processing workloads. 21

Spark Resources Learn Spark O Reilly Advanced Analy8cs with Spark ebook (wri{en by Clouderans) Cloudera Developer Blog cloudera.com/spark Get Trained Cloudera Spark Training Try it Out Cloudera Live Spark Tutorial 22

Thank You jordan.volz@cloudera.com linkedin.com/in/jordan.volz 23