Functional Query Optimization with" SQL

Similar documents

Unified Data Access with Spark SQL. Michael Armbrust Spark Summit

Unified Big Data Analytics Pipeline. 连城

Spark: Making Big Data Interactive & Real-Time

Unified Big Data Processing with Apache Spark. Matei

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Brave New World: Hadoop vs. Spark

Data Science in the Wild

Spark and the Big Data Library

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Write Once, Run Anywhere Pat McDonough

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Beyond Hadoop with Apache Spark and BDAS

Moving From Hadoop to Spark

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

How Companies are! Using Spark

This is a brief tutorial that explains the basics of Spark SQL programming.

Conquering Big Data with Apache Spark

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Big Data Analytics. Lucas Rego Drumond

Scaling Out With Apache Spark. DTL Meeting Slides based on

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Shark Installation Guide Week 3 Report. Ankush Arora

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Apache Spark and Distributed Programming

Big Data Analytics with Cassandra, Spark & MLLib

Conquering Big Data with BDAS (Berkeley Data Analytics)

Big Data Frameworks: Scala and Spark Tutorial

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Introduc8on to Apache Spark

Ali Ghodsi Head of PM and Engineering Databricks

Hadoop Ecosystem B Y R A H I M A.

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Apache Flink Next-gen data analysis. Kostas

Next-Gen Big Data Analytics using the Spark stack

Spark SQL: Relational Data Processing in Spark

Architectures for massive data management

Introduction to Spark

Big Data Processing. Patrick Wendell Databricks

Spark Application Carousel. Spark Summit East 2015

Real Time Data Processing using Spark Streaming

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Certification Study Guide. MapR Certified Spark Developer Study Guide

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Integrating Apache Spark with an Enterprise Data Warehouse

How To Create A Data Visualization With Apache Spark And Zeppelin

CSE-E5430 Scalable Cloud Computing Lecture 11

Big Data and Scripting Systems build on top of Hadoop

Why Spark on Hadoop Matters

Big Data Analytics Hadoop and Spark

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

The Internet of Things and Big Data: Intro

Workshop on Hadoop with Big Data

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Parquet. Columnar storage for the people

The Berkeley Data Analytics Stack: Present and Future

Why Spark Is the Next Top (Compute) Model

Hadoop, Hive & Spark Tutorial

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Apache Flink. Fast and Reliable Large-Scale Data Processing

Complete Java Classes Hadoop Syllabus Contact No:

Big Data Research in the AMPLab: BDAS and Beyond

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Hadoop & Spark Using Amazon EMR

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

Big Data With Hadoop

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Case Study : 3 different hadoop cluster deployments

Big Data and Scripting Systems beyond Hadoop

Analytics on Spark &

BIG DATA STATE OF THE ART: SPARK AND THE SQL RESURGENCE

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

Using Apache Spark Pat McDonough - Databricks

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Spark: Cluster Computing with Working Sets

Hadoop Job Oriented Training Agenda

The Hadoop Eco System Shanghai Data Science Meetup

Machine- Learning Summer School

Streaming items through a cluster with Spark Streaming

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Introduction to Big Data with Apache Spark UC BERKELEY

Oracle Big Data SQL Technical Update

Big Data and Market Surveillance. April 28, 2014

SQL on NoSQL (and all of the data) With Apache Drill

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

MLlib: Scalable Machine Learning on Spark

Big Data Course Highlights

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

COURSE CONTENT Big Data and Hadoop Training

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Transcription:

" Functional Query Optimization with" SQL Michael Armbrust @michaelarmbrust spark.apache.org

What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves efficiency through:» In-memory computing primitives» General computation graphs Improves usability through:» Rich APIs in Scala, Java, Python» Interactive shell Up to 100 faster (2-10 on disk) 2-5 less code

A General Stack Spark SQL Spark Streaming" real-time GraphX graph MLlib machine learning Spark

Spark Model Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs)» Collections of objects that can be stored in memory or disk across a cluster» Parallel functional transformations (map, filter, )» Automatically rebuilt on failure

More than Map/Reduce map filter groupby sort union join leftouterjoin rightouterjoin reduce count fold reducebykey groupbykey cogroup cross zip sample take first partitionby mapwith pipe save...

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns val lines = spark.textfile( hdfs://... ) val errors = lines.filter(_ startswith ERROR ) val messages = errors.map(_.split( \t )(2)) messages.cache() messages.filter(_ contains foo ).count() messages.filter(_ contains bar ).count()... Result: scaled full-text to search 1 TB data of Wikipedia in 5-7 sec" in <1 (vs sec 170 (vs 20 sec sec for for on-disk on-disk data) data) Base Transformed RDD RDD Action Driver Worker lines Block 3 messages Cache 3 results tasks messages Cache 1 Worker lines Block 1 messages Cache 2 Worker lines Block 2

Fault Tolerance RDDs track lineage info to rebuild lost data file.map(record => (record.tpe, 1)).reduceByKey(_ + _).filter { case (_, count) => count > 10 } map reduce filter Input file

Fault Tolerance RDDs track lineage info to rebuild lost data file.map(record => (record.tpe, 1)).reduceByKey(_ + _).filter { case (_, count) => count > 10 } map reduce filter Input file

and Scala Provides: Concise Serializable* Functions Easy interoperability with the Hadoop ecosystem Interactive REPL * Made even better by Spores (5pm today) Lines of Code Scala! Python! Java! Shell! Other!

Reduced Developer Complexity 140000 120000 100000 80000 60000 40000 20000 0 Hadoop Storm Impala (SQL) MapReduce (Streaming) Giraph (Graph) Spark non-test, non-example source lines

Reduced Developer Complexity 140000 120000 100000 80000 60000 40000 20000 Streaming 0 Hadoop Storm Impala (SQL) MapReduce (Streaming) Giraph (Graph) Spark non-test, non-example source lines

Reduced Developer Complexity 140000 120000 100000 80000 60000 40000 20000 SparkSQL Streaming 0 Hadoop Storm Impala (SQL) MapReduce (Streaming) Giraph (Graph) Spark non-test, non-example source lines

Reduced Developer Complexity 140000 120000 100000 80000 60000 40000 20000 GraphX SparkSQL Streaming 0 Hadoop Storm Impala (SQL) MapReduce (Streaming) Giraph (Graph) Spark non-test, non-example source lines

Spark Community One of the largest open source projects in big data 150+ developers contributing 30+ companies contributing 150 Contributors in past year 100 50 0 Giraph Storm Tez

Community Growth Spark 1.0: 110 contributors Spark 0.8: 67 contributors Spark 0.9: 83 contributors Spark 0.6: 17 contributors Spark 0.7:" 31 contributors Oct 12 Feb 13 Sept 13 Feb 14 May 14

With great power Strict project coding guidelines to make it easier for non-scala users and contributors: Absolute imports only Minimize infix function use Java/Python friendly wrappers for user APIs

SQL

Relationship to Shark modified the Hive backend to run over Spark, but had two challenges:» Limited integration with Spark programs» Hive optimizer not designed for Spark Spark SQL reuses the best parts of Shark: Borrows Hive data loading In-memory column store Adds RDD-aware optimizer Rich language interfaces

Spark SQL Components 38%! 36%! 26%! Catalyst Optimizer Relational algebra + expressions Query optimization Spark SQL Core Execution of queries as RDDs Reading in Parquet, JSON Hive Support HQL, MetaStore, SerDes, UDFs

Adding Schema to RDDs Spark + RDDs! Functional transformations on partitioned collections of opaque objects. SQL + SchemaRDDs! Declarative transformations on partitioned collections of tuples.! User User User User User User Name Age Height Name Age Height Name Age Height Name Age Height Name Age Height Name Age Height

Using Spark SQL SQLContext Entry point for all SQL functionality Wraps/extends existing spark context val sc: SparkContext // An existing SparkContext. val sqlcontext = new org.apache.spark.sql.sqlcontext(sc) // Importing the SQL context gives access to all the SQL functions and conversions. import sqlcontext._

Example Dataset A text file filled with people s names and ages: Michael, 30 Andy, 31 Justin Bieber, 19

Turning an RDD into a Relation // Define the schema using a case class. case class Person(name: String, age: Int) // Create an RDD of Person objects and register it as a table. val people = sc.textfile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toint)) people.registerastable("people")

Querying Using SQL // SQL statements are run with the sql method from sqlcontext. val teenagers = sql(""" SELECT name FROM people WHERE age >= 13 AND age <= 19""") // The results of SQL queries are SchemaRDDs but also // support normal RDD operations. // The columns of a row in the result are accessed by ordinal. val namelist = teenagers.map(t => "Name: " + t(0)).collect()

Querying Using the Scala DSL Express queries using functions, instead of SQL strings. // The following is the same as: // SELECT name FROM people // WHERE age >= 10 AND age <= 19 val teenagers = people.where('age >= 10).where('age <= 19).select('name)

Caching Tables In-Memory Spark SQL can cache tables using an inmemory columnar format: Scan only required columns Fewer allocated objects (less GC) Automatically selects best compression cachetable("people")

Parquet Compatibility Native support for reading data in Parquet: Columnar storage avoids reading unneeded data. RDDs can be written to parquet files, preserving the schema.

Using Parquet // Any SchemaRDD can be stored as Parquet. people.saveasparquetfile("people.parquet") // Parquet files are self- describing so the schema is preserved. val parquetfile = sqlcontext.parquetfile("people.parquet") // Parquet files can also be registered as tables and then used // in SQL statements. parquetfile.registerastable("parquetfile ) val teenagers = sql( "SELECT name FROM parquetfile WHERE age >= 13 AND age <= 19")

Hive Compatibility Interfaces to access data and code in" the Hive ecosystem: o Support for writing queries in HQL o Catalog info from Hive MetaStore o Tablescan operator that uses Hive SerDes o Wrappers for Hive UDFs, UDAFs, UDTFs

Reading Data Stored In Hive val hivecontext = new org.apache.spark.sql.hive.hivecontext(sc) import hivecontext._ hql("create TABLE IF NOT EXISTS src (key INT, value STRING)") hql("load DATA LOCAL INPATH '.../kv1.txt' INTO TABLE src") // Queries can be expressed in HiveQL. hql("from src SELECT key, value")

SQL and Machine Learning val trainingdatatable = sql(""" SELECT e.action, u.age, u.latitude, u.logitude FROM Users u JOIN Events e ON u.userid = e.userid""") // SQL results are RDDs so can be used directly in Mllib. val trainingdata = trainingdatatable.map { row => val features = Array[Double](row(1), row(2), row(3)) LabeledPoint(row(0), features) } val model = new LogisticRegressionWithSGD().run(trainingData)

Supports Java Too! public class Person implements Serializable { private String _name; private int _age; public String getname() { return _name; } public void setname(string name) { _name = name; } public int getage() { return _age; } public void setage(int age) { _age = age; } } JavaSQLContext ctx = new org.apache.spark.sql.api.java.javasqlcontext(sc) JavaRDD<Person> people = ctx.textfile("examples/src/main/resources/ people.txt").map( new Function<String, Person>() { public Person call(string line) throws Exception { String[] parts = line.split(","); Person person = new Person(); person.setname(parts[0]); person.setage(integer.parseint(parts[1].trim())); return person; } }); JavaSchemaRDD schemapeople = sqlctx.applyschema(people, Person.class);

Supports Python Too! from pyspark.context import SQLContext sqlctx = SQLContext(sc) lines = sc.textfile("examples/src/main/resources/people.txt") parts = lines.map(lambda l: l.split(",")) people = parts.map(lambda p: {"name": p[0], "age": int(p[1])}) peopletable = sqlctx.applyschema(people) peopletable.registerastable("people") teenagers = sqlctx.sql("select name FROM people WHERE age >= 13 AND age <= 19") teennames = teenagers.map(lambda p: "Name: " + p.name)

Optimizing Queries with

What is Query Optimization? SQL is a declarative language: Queries express what data to retrieve, not how to retrieve it. The database must pick the best execution strategy through a process known as optimization. 35

Naïve Query Planning SELECT name FROM ( SELECT id, name FROM People) p WHERE p.id = 1 Logical Plan Project name Filter id = 1 Project id,name Physical Plan Project name Filter id = 1 Project id,name People TableScan People 36

Optimized Execution Logical Plan Project name Filter id = 1 Project id,name People Physical Plan IndexLookup id = 1 return: name Writing imperative code to optimize all possible patterns is hard. Instead write simple rules: Each rule makes one change Run many rules together to fixed point. 37

Optimizing with Rules Original Plan Filter Push-Down Combine Projection Physical Plan Project name Project name Filter id = 1 Project id,name Project name Project id,name People Filter id = 1 People Filter id = 1 People IndexLookup id = 1 return: name 38

Prior Work: " Optimizer Generators Volcano / Cascades: Create a custom language for expressing rules that rewrite trees of relational operators. Build a compiler that generates executable code for these rules. Cons: Developers need to learn this custom language. Language might not be powerful enough. 39

TreeNode Library Easily transformable trees of operators Standard collection functionality - foreach, map,collect,etc. transform function recursive modification of tree fragments that match a pattern. Debugging support pretty printing, splicing, etc. 40

Tree Transformations Developers express tree transformations as PartialFunction[TreeType,TreeType] 1. If the function does apply to an operator, that operator is replaced with the result. 2. When the function does not apply to an operator, that operator is left unchanged. 3. The transformation is applied recursively to all children. 41

Writing Rules as Tree Transformations 1. Find filters on top of projections. 2. Check that the filter can be evaluated without the result of the project. 3. If so, switch the operators. Original Plan Project name Filter id = 1 Project id,name People Filter Push-Down Project name Project id,name Filter id = 1 People 42

Filter Push Down Transformation val newplan = queryplan transform { case f @ Filter(_, p @ Project(_, grandchild)) if(f.references subsetof grandchild.output) => p.copy(child = f.copy(child = grandchild) } 43

Filter Push Down Transformation Tree Partial Function val newplan = queryplan transform { case f @ Filter(_, p @ Project(_, grandchild)) if(f.references subsetof grandchild.output) => p.copy(child = f.copy(child = grandchild) } 44

Filter Push Down Transformation Find Filter on Project val newplan = queryplan transform { case f @ Filter(_, p @ Project(_, grandchild)) if(f.references subsetof grandchild.output) => p.copy(child = f.copy(child = grandchild) } 45

Filter Push Down Transformation val newplan = queryplan transform { case f @ Filter(_, p @ Project(_, grandchild)) if(f.references subsetof grandchild.output) => p.copy(child = f.copy(child = grandchild) } Check that the filter can be evaluated without the result of the project. 46

Filter Push Down Transformation val newplan = queryplan transform { case f @ Filter(_, p @ Project(_, grandchild)) if(f.references subsetof grandchild.output) => p.copy(child = f.copy(child = grandchild) } If so, switch the order. 47

Filter Push Down Transformation Pattern Matching val newplan = queryplan transform { case f @ Filter(_, p @ Project(_, grandchild)) if(f.references subsetof grandchild.output) => p.copy(child = f.copy(child = grandchild) } 48

Filter Push Down Transformation Collections library val newplan = queryplan transform { case f @ Filter(_, p @ Project(_, grandchild)) if(f.references subsetof grandchild.output) => p.copy(child = f.copy(child = grandchild) } 49

Filter Push Down Transformation val newplan = queryplan transform { case f @ Filter(_, p @ Project(_, grandchild)) if(f.references subsetof grandchild.output) => p.copy(child = f.copy(child = grandchild) } Copy Constructors 50

Efficient Expression Evaluation Interpreting expressions (e.g., a + b ) can very expensive on the JVM: Virtual function calls Branches based on expression type Object creation due to primitive boxing Memory consumption by boxed primitive objects

Interpreting a+b 1. Virtual call to Add.eval() 2. Virtual call to a.eval() 3. Return boxed Int 4. Virtual call to b.eval() 5. Return boxed Int 6. Integer addition 7. Return boxed result Attribute a Add Attribute b

Using Runtime Reflection def generatecode(e: Expression): Tree = e match { case Attribute(ordinal) => q"inputrow.getint($ordinal)" case Add(left, right) => q""" { val leftresult = ${generatecode(left)} val rightresult = ${generatecode(right)} leftresult + rightresult } """ }

Executing a + b val left: Int = inputrow.getint(0) val right: Int = inputrow.getint(1) val result: Int = left + right resultrow.setint(0, result) Fewer function calls No boxing of primitives

Performance Comparison Milliseconds 40 35 30 25 20 15 10 5 0 Evaluating 'a+a+a' One Billion Times! Intepreted Evaluation Hand-written Code Generated with Scala Reflection

TPC-DS Results 400 350 300 Seconds 250 200 150 100 50 0 Query 19 Query 53 Query 34 Query 59 Shark - 0.9.2 SparkSQL + codegen

Code Generation Made Simple Other Systems (Impala, Drill, etc) do code generation. Scala Reflection + Quasiquotes made our implementation an experiment done over a few weekends instead of a major system overhaul. Initial Version ~1000 LOC

Future Work: Typesafe Results Currently: people.registerastable( people ) val results = sql( SELECT name FROM people ) results.map(r => r.getstring(0)) What we want: val results = sql SELECT name FROM $people results.map(_.name) Joint work with: Heather Miller, Vojin Jovanovic, Hubert Plociniczak

Future Work: Typesafe Results Currently: people.registerastable( people ) val results = sql( SELECT name FROM people ) results.map(r => r.getstring(0)) What we want: val results = sql SELECT name FROM $people results.map(_.name) Joint work with: Heather Miller, Vojin Jovanovic, Hubert Plociniczak

Future Work: Typesafe Results Currently: people.registerastable( people ) val results = sql( SELECT name FROM people ) results.map(r => r.getstring(0)) What we want: val results = sql SELECT name FROM $people results.map(_.name) Joint work with: Heather Miller, Vojin Jovanovic, Hubert Plociniczak

Get Started Visit spark.apache.org for videos & tutorials Download Spark bundle for CDH Easy to run on just your laptop Free training talks and hands-on" exercises: spark-summit.org

Conclusion Big data analytics is evolving to include:» More complex analytics (e.g. machine learning)» More interactive ad-hoc queries, including SQL» More real-time stream processing Spark is a fast platform that unifies these apps " Join us at Spark Summit 2014!" June 30-July 2, San Francisco