Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Similar documents
Unified Big Data Processing with Apache Spark. Matei

Unified Big Data Analytics Pipeline. 连 城

Scaling Out With Apache Spark. DTL Meeting Slides based on

Architectures for massive data management

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Spark: Making Big Data Interactive & Real-Time

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Apache Flink Next-gen data analysis. Kostas

This is a brief tutorial that explains the basics of Spark SQL programming.

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

How Companies are! Using Spark

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Moving From Hadoop to Spark

Beyond Hadoop with Apache Spark and BDAS

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Shark Installation Guide Week 3 Report. Ankush Arora

Spark and the Big Data Library

Ali Ghodsi Head of PM and Engineering Databricks

Introduction to Spark

Big Data Analytics Hadoop and Spark

Spark: Cluster Computing with Working Sets

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Brave New World: Hadoop vs. Spark

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

How To Create A Data Visualization With Apache Spark And Zeppelin

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Introduc8on to Apache Spark

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Apache Spark and Distributed Programming

CSE-E5430 Scalable Cloud Computing Lecture 11

Conquering Big Data with BDAS (Berkeley Data Analytics)

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

HDFS. Hadoop Distributed File System

1. The orange button 2. Audio Type 3. Close apps 4. Enlarge my screen 5. Headphones 6. Questions Pane. SparkR 2

Machine- Learning Summer School

Spark Application Carousel. Spark Summit East 2015

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Hadoop Ecosystem B Y R A H I M A.

Big Data Analytics with Cassandra, Spark & MLLib

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Hadoop: The Definitive Guide

Data Science in the Wild

The Internet of Things and Big Data: Intro

Hadoop & Spark Using Amazon EMR

Next-Gen Big Data Analytics using the Spark stack

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Job Oriented Training Agenda

HiBench Introduction. Carson Wang Software & Services Group

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Real Time Data Processing using Spark Streaming

Big Data Frameworks: Scala and Spark Tutorial

Conquering Big Data with Apache Spark

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Native Connectivity to Big Data Sources in MSTR 10

COURSE CONTENT Big Data and Hadoop Training

Big Data Research in the AMPLab: BDAS and Beyond

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Big Data and Scripting Systems build on top of Hadoop

StratioDeep. An integration layer between Cassandra and Spark. Álvaro Agea Herradón Antonio Alcocer Falcón

Big Data Analytics. Lucas Rego Drumond

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Chase Wu New Jersey Ins0tute of Technology

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Bayesian networks - Time-series models - Apache Spark & Scala

Why Spark on Hadoop Matters

Integrating Apache Spark with an Enterprise Data Warehouse

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Where is Hadoop Going Next?

Big Data Processing. Patrick Wendell Databricks

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

Implement Hadoop jobs to extract business value from large and varied data sets

Oracle Big Data SQL Technical Update

Big Data Course Highlights

Dell In-Memory Appliance for Cloudera Enterprise

xpaaerns on Spark, Shark, Tachyon and Mesos

Functional Query Optimization with" SQL

High-Speed In-Memory Analytics over Hadoop and Hive Data

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Workshop on Hadoop with Big Data

Ibis: Scaling Python Analy=cs on Hadoop and Impala

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Big Data and Scripting Systems beyond Hadoop

Hadoop in the Enterprise

Unlocking the True Value of Hadoop with Open Data Science

Spark SQL: Relational Data Processing in Spark

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

A Brief Introduction to Apache Tez

WHITE PAPER. Building Big Data Analytical Applications at Scale Using Existing ETL Skillsets INTELLIGENT BUSINESS STRATEGIES

Hadoop IST 734 SS CHUNG

Apache Flink. Fast and Reliable Large-Scale Data Processing

Beyond Hadoop MapReduce Apache Tez and Apache Spark

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Analytics on Spark &

Transcription:

Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets Representation of data coming to your system as an object format Rely on lineage (case of failure, recover) Transformation What you to to RDD to get another RDD (open file, filter) Actions Asking for an answer the system needs to provide (count, ) Lazy Evaluations Only done where there is an actual action to be done What is Spark? A GrowingStack Fast and expressive cluster computing system compatible with ApacheHadoop Improves efficiency through:» General execution graphs» In- memory storage Improves usability through:» Rich APIs in Scala, Java, Python» Interactive shell Up to 10 faster ondisk, 100 inmemory 2-5 lesscode Shark SQL Spark Streaming real- time Spark GraphX graph MLbase machine learning 1

Why a New Programming Model? Easy to use Compose well for large applications (Implementation) Higher level of computational model Fast data sharing and DAGs lead to more efficiency for the engine much simpler for the end users Spark s goal was to generalize MapReduce to support new apps within engine A Brief History : RDD An RDD is an immutable, partitioned, logical collection of records Spark enabled distributed data processing through functional transformations on distributed collections of data (RDDs) map filter sample union groupbykey reducebykey join cache Transf ormations (define a new RDD) reduce collect count save lookupkey Parallel operations (Actions) (return a result to driver) RDD Essentials Transformations create a new dataset from an existing one All transformations in Spark are Lazy Do not compute their results right away Remember the transformations applied to some base datasets Optimize the required calculations Recover from lost data partitions DataFrame A distributed collection of data organized into named columns Conceptually equivalent to a table in a relational database or a data frame in R/Python Under the hood, DataFrame contains an RDD composed of row objects with additional schema information of types Can incorporate SQL while working with DataFrames, using Spark SQL Can be constructed from a wide array of sources: structured data files tables in Hive external databases existing RDD vs DataFrame New DataFrame API goal: enable wider audiences beyond Big Data engineers to leverage the power of distributed processing provides a way to operate on them using existing RDD tranformations like map(). However, provides additional capabilities Register DataFrame as a temporary table to query it Supporting functions with behavior similar to SQL counterparts like select( ) Cache tables Sql queries using SQL return DataFrames allows Spark to run certain optimizations on the finalized query Since DataFrame has additional metadata due to its tabular format can process Json data, parquet data, HiveQL data at a time by loading them into a DataFrame 2

DataFrameExample JavaSparksc =...; // An existing JavaSpark. SQLsql = new org.apache.spark.sql.sql(sc); DataFramedf = sql.read().json("examples/src/main/resources/people.json"); // Displays the content of the DataFrame to stdout df.show(); DataFrameOperations // Print the schema in a tree format df.printschema(); // Select onlythe"name" column df.select("name").show(); // Select everyb o d y, but increment the ageby 1 df.select(df.col("name"),df.col("age").plus(1)).show(); // Select peopleolder than 21 df.filter(df.col("age").gt(21)).show(); // Count peopleby age df.groupby("age").count().show(); Running SQL Queries Programmatically SQLsql =... // An existing SQL DataFrame df = sql.sql("select * FROM table") DataFrameSupportedOperators map reduce sample JavaRDD<Person> people = // Apply a schema to an RDD of JavaBeans and register it as a table. DataFrame schemapeople = sql.createdataframe(people,person.class); schemapeople.registertemptable("people"); filter groupby sort union join count fold reducebykey groupbykey cogroup take first partitionby save... // SQL can be run over RDDs that have been registered as tables. DataFrameteenagers = sql.sql("select name FROM people WHERE age >= 13 AND age <= 19") leftouterjo in rightouterj oi n cross zip I/O Process in Spark : Write as text file in one partition? By default spark create one partition for each block of the file Make number of partition is equal n times the number of cores in the cluster all partition will process parallel and resources are also used equally What if data does not fit in memory to write in one partition? Use multiple partitions Different formats of input/output files Parquet Files CSV Files 3

Parquet Files A columnar format supported by many other data processing systems Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data Loading & Writing data // sqlfrom the previous example is us ed in this example. DataFrame schemapeople =... // The DataFrame from the previous example. // DataFrames can be saved as Parquet files, maintaining the schema information. schemapeople.write().parquet("people.pa rque t"); // Read in the Parquet file created above. Parquet files are self- describing so the schema is preserved. // The result of loading a parquet file is also a DataFrame. DataFrame parquetfile = sql.read().parquet("people.pa rque t"); Performance Tuning Partitions Fragmentation enables Spark to execute in parallel Level of fragmentation is function of #partitions in your RDD Caching Data In Memory Spark SQL can cache tables using an in- memory columnar format DataFrame schemapeople = sql.createdataframe(people, Person.class); //cache DataFrame in memory schemapeople.cache(); sql.cachetable(" tablen ame") Serialization (something transparent that spark does) Avoiding writing back and forth translate code into ideally compressed format for transferring over the network => Kryo Serialization Other Configuration Options Spark Documentation Example ConfigFile vi spark/config/spark- defaults.conf spark.eventlog.enabled true spark.serializer org.apache.spark.serializer.kryoserializer spark.shuffle.consolidatefiles true spark.kryo.referencetracking false spark.driver.extrajavaoptions "- XX:+UseCompressedOops" spark.executor.extrajavaoptions "- XX:+UseCompressedOops spark.default.parallelism 48 spark.driver.memory 2560M Spark MapReduce Comparison - The Bottomline Hadoop MapReduce is meant for data that does not fit in the memory whereas Apache Spark has a better performance for the data that fits in the memory, particularly on dedicated clusters. Hadoop processing model is On- disk (disk- base parallelization) while Spark can be in- memory or On- disk Apache Spark follows a DAG (Directed Acyclic Graph) execution engine for execution In a distributed system, a conventional program would not work as the data is split across nodes. DAG is a programming style for distributed systems The DAG scheduler divides operators into stages of tasks. A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together. The final result of a DAG scheduler is a set of stages. 4

DAG Example Hadoop MapReduce Vs. Hadoop SparkMapReduce vs. Tez vs. vs. Spark Tez vs. Spark Criteria Criteria 10 10 License Processing Model Language written in API License Open Source Open Open Source, Open Source, Open Source, Apache 2.0, version Apache 2.0, 2.0, version Apache 2.0, version Apache 2.0, version 2.x 2.x version 0.x version 1.x 0.x 1.x Processing On-Disk (Disk- Model based parallelization), Batch On-Disk, (Disk- Batch, based Interactive parallelization), Batch On-Disk, In-Memory, Batch, On-Disk, In-Memory, On-Disk, Interactive Batch, Interactive, Batch, Interactive, Streaming (Near Real- Streaming (Near Real- Time) Time) Language Java written Java Java Java Scala Scala in API [Java, Python, [Java, Java,[ Python, ISV/ Java,[ [Scala, ISV/ Java, Python], [Scala, Java, Python], Scala], User-Facing Scala], Engine/Tool User-Facing builder] Engine/Tool User-Facing builder] User-Facing Libraries Libraries None, separate tools None, None separate tools None [Spark Core, Spark [Spark Core, Spark Streaming, Spark SQL, Streaming, Spark SQL, MLlib, GraphX] MLlib, GraphX] Hadoop Vs. Spark Hadoop MapReduce Hadoop MapReduce vs. Tez vs. vs. Spark Tez vs. Spark Criteria Criteria Installation Bound Installation to Hadoop Bound Bound to Hadoop to Hadoop Isn t Bound bound to Hadoop to Isn t bound to Hadoop Hadoop Ease of Use Difficult Ease to of program, Use Difficult Difficult to program, to program Easy Difficult to program, to Easy to program, needs abstractions needs abstractions No Interactive mode No Interactive No Interactive mode no need of abstractions Interactive No mode no need of abstractions Interactive mode except Hive except mode Hive except Hive mode except Hive Compatibility to data Compatibility types and data to data to data types types and data and sources is sources data is sources is YARN integration YARN YARN application YARN Ground application up YARN integration application 11 to data types and data sources is Spark Ground is up moving YARN towards application YARN 11 to data types and data sources is Spark is moving towards YARN Conclusion Why did we need Spark after Hadoop? handles batch, interactive, and real- time within a single framework Easier to code programming at a higher level of abstraction more general: map/reduce is just one set of supported constructs Spark important Data Structures and I/O Files s Parquet Files Performance Tuning of Spark Change the default configurations in spark s default config file Computational model of Spark Hadoop for very big datasets, Spark for when data fits in memory Spark User Community 1000+ meetup members 80+ contributors 24 companies contributing 5