Introduc8on to Apache Spark

Introduc8on to Apache Spark Jordan Volz, Systems Engineer @ Cloudera 1

Analyzing Data on Large Data Sets Python, R, etc. are popular tools among data scien8sts/analysts, sta8s8cians, etc. Why are these tools popular? Easy to learn and maximizes produc8vity for data engineers, data scien8sts, sta8s8cians Build robust sooware and do interac8ve data analysis Large, diverse open source development communi8es Comprehensive libraries: data wrangling, ML, visualiza8on, etc. Limita8ons do exist: Largely confined to single- node analysis and smaller data sets Requires sampling or aggrega8ons for larger data Distributed tools compromise in various ways adds complexity and 8me Restricts effec8veness in certain use cases 2

MapReduce Analysis on Large Data Sets (Hadoop) Map Map Map Map Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce Key Advances by MapReduce: Data Locality: Automa8c split computa8on and launch of mappers appropriately Fault- Tolerance: Write out of intermediate results and restartable mappers meant ability to run on commodity hardware Linear Scalability: Combina8on of locality + programming model that forces developers to write generally scalable solu8ons to problems 3

Map Reduce is Not Perfect Map Map Map Reduce Reduce Limited to mapreduce paradigm Map Map Reduce Map Reduce Lots of I/O à slower jobs Itera8ve jobs (ML) à even slower Map Reduce Redundant joins with SQL Tools 4

Death by Pinprick 6

Apache Spark Flexible, in- memory data processing for Hadoop Easy Development Rich APIs for Scala, Java, and Python Interac8ve shell Flexible Extensible API APIs for different types of workloads: Batch (MR) Streaming Machine Learning Graph Retains: Linear Scalability, Fault- Tolerance, Data Locality Fast Batch & Stream Processing In- Memory processing and caching 7

Spark Basics Distributed cluster framework (like MR), running tasks in parallel across a cluster Tasks operate in- memory, spill to disk when memory exceeded. Resilient Distributed Datasets (RDD): Read- only par88oned collec8on of records RDDs ac8onable through parallel transforma8ons and ac8ons Lazy materializa8on op8mizes resources RDD lineage from storage to compute and caching layer provides fault- tolerance Users control persistence and par88oning 8

Fast Processing Using RAM, Operator Graphs In- Memory Caching A: B: B: Data Par88ons read from RAM instead of disk map groupby F: Operator Graphs Scheduling Op8miza8ons Fault Tolerance C: D: E: Ç Ω join take map filter = RDD = cached par88on 9

Logis8c Regression Performance (Data Fits in Memory) 4000 3500 3000 Running Time(s) 2500 2000 1500 1000 500 0 1 5 10 20 30 # of IteraMons MapReduce Spark 110 s/itera8on First itera8on = 80s Further itera8ons 1s due to caching 10

Spark on YARN 11

Spark will replace MapReduce To become the standard execu8on engine for Hadoop Hadoop MapReduce public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { } private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.tostring(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); output.collect(word, one); } } Spark val spark = new SparkContext(master, appname, [sparkhome], [jars]) val file = spark.textfile("hdfs://...") val counts = file.flatmap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) counts.saveastextfile("hdfs://...") public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { } public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } 12

The Future of Data Processing on Hadoop Spark complemented by specialized fit- for- purpose engines General Data Processing w/ Spark Fast Batch Processing, Machine Learning, and Stream Processing Full- Text Search w/solr Querying textual data AnalyMc Database w/ Impala Low- Latency Massively Concurrent Queries On- Disk Processing w/mapreduce Jobs at extreme scale and extremely disk IO intensive Shared: Data Storage Metadata Resource Management Administra8on Security Governance 13

Easy Development High Produc8vity Language Support Python lines = sc.textfile(...) lines.filter(lambda s: ERROR in s).count() Scala val lines = sc.textfile(...) lines.filter(s => s.contains( ERROR )).count() Java JavaRDD<String> lines = sc.textfile(...); lines.filter(new Function<String, Boolean>() { Boolean call(string s) { return s.contains( error ); } }).count(); Na8ve support for mul8ple languages with iden8cal APIs Scala, Java, Python Use of closures, itera8ons, and other common language constructs to minimize code 2-5x less code 14

Easy Development Use Interac8vely percolateur:spark srowen$./bin/spark-shell --master local[*]... Welcome to / / / / \ \/ _ \/ _ `/ / '_/ / /. /\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51) Type in expressions to have them evaluated. Type :help for more information.... scala> val words = sc.textfile("file:/usr/share/dict/words")... words: org.apache.spark.rdd.rdd[string] = MapPartitionsRDD[1] at textfile at <console>:21 Interac8ve explora8on of data for data scien8sts No need to develop applica8ons Developers can prototype applica8on on live system scala> words.count... res0: Long = 235886 scala> 15

Easy Development Expressive API map filter groupby sort union join leftouterjoin rightouterjoin reduce count fold reducebykey groupbykey cogroup cross zip sample take first partitionby mapwith pipe save 16

Example Logis8c Regression data = spark.textfile(...).map(readpoint).cache() w = numpy.random.rand(d) for i in range(iterations): gradient = data.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x).reduce(lambda x, y: x + y) w -= gradient print Final w: %s % w 17

The Spark Ecosystem & Hadoop Spark Streaming MLlib SparkSQL GraphX Data- frames SparkR Spark Impala Search MR Others RESOURCE MANAGEMENT YARN STORAGE HDFS, HBase 18

One Plauorm, Many Workloads Process Ingest Sqoop, Flume, Kaxa, Spark Streaming Transform MapReduce, Hive, Pig, Spark Discover Analy8c Database Impala Search Solr Security and Administra8on Model Machine Learning SAS, R, Spark, Mahout YARN, Cloudera Manager, Cloudera Navigator Unlimited Storage HDFS, HBase Serve NoSQL Database HBase Streaming Spark Streaming Batch, Interac8ve, and Real- Time. Leading performance and usability in one plauorm. End- to- end analy8c workflows Access more data Work with data in new ways Enable new users 19

Cloudera Customer Use Cases Over 150 customers using Spark Spark clusters as large as 800 nodes Core Spark Financial Services Health Poruolio Risk Analysis ETL Pipeline Speed- Up 20+ years of stock data Iden8fy disease- causing genes in the full human genome Calculate Jaccard scores on health care data sets Spark Streaming ERP 1010 Data Services Op8cal Character Recogni8on and Bill Classifica8on Trend analysis Document classifica8on (LDA) Fraud analy8cs Financial Services Online Fraud Detec8on Ad Tech Real- Time Ad Performance Analysis 20

Uni8ng Spark and Hadoop The One Plauorm Ini8a8ve Investment Areas Management Leverage Hadoop- na8ve resource management. Security Full support for Hadoop security and beyond. Scale Enable 10k- node clusters. Streaming Support for 80% of common stream processing workloads. 21

Spark Resources Learn Spark O Reilly Advanced Analy8cs with Spark ebook (wri{en by Clouderans) Cloudera Developer Blog cloudera.com/spark Get Trained Cloudera Spark Training Try it Out Cloudera Live Spark Tutorial 22

Thank You jordan.volz@cloudera.com linkedin.com/in/jordan.volz 23