Unified Data Access with Spark SQL. Michael Armbrust Spark Summit 2014 @michaelarmbrust



Similar documents
Functional Query Optimization with" SQL

Unified Big Data Analytics Pipeline. 连 城

This is a brief tutorial that explains the basics of Spark SQL programming.

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

High-Speed In-Memory Analytics over Hadoop and Hive Data

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Hadoop, Hive & Spark Tutorial

Brave New World: Hadoop vs. Spark

Moving From Hadoop to Spark

Spark Application Carousel. Spark Summit East 2015

Big Data Analytics with Cassandra, Spark & MLLib

Unified Big Data Processing with Apache Spark. Matei

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Integrating Apache Spark with an Enterprise Data Warehouse

The Hadoop Eco System Shanghai Data Science Meetup

E6895 Advanced Big Data Analytics Lecture 4:! Data Store

Write Once, Run Anywhere Pat McDonough

SQL on NoSQL (and all of the data) With Apache Drill

Complete Java Classes Hadoop Syllabus Contact No:

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Spark SQL: Relational Data Processing in Spark

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

07/11/2014 Julien! Poorna! Andreas

The Internet of Things and Big Data: Intro

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

Beyond Hadoop with Apache Spark and BDAS

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

Architectures for massive data management

Integration of Apache Hive and HBase

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Next-Gen Big Data Analytics using the Spark stack

Hadoop: The Definitive Guide

Apache Spark and Distributed Programming

Shark Installation Guide Week 3 Report. Ankush Arora

Hadoop Distributed File System. -Kishan Patel ID#

Hive User Group Meeting August 2009

Apache Sentry. Prasad Mujumdar

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Impala: A Modern, Open-Source SQL

Hadoop and Big Data Research

Hive Interview Questions

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Analytics on Spark &

xpaaerns on Spark, Shark, Tachyon and Mesos

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM

How Companies are! Using Spark

Why Spark Is the Next Top (Compute) Model

Processing Big Data With SQL on Hadoop. Jens Albrecht

Certification Study Guide. MapR Certified Spark Developer Study Guide

Agenda. ! Strengths of PostgreSQL. ! Strengths of Hadoop. ! Hadoop Community. ! Use Cases

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Tachyon: Reliable File Sharing at Memory- Speed Across Cluster Frameworks

Big Data Analytics. Lucas Rego Drumond

Oracle Database 12c Plug In. Switch On. Get SMART.

Big Data Weather Analytics Using Hadoop

BigData in Real-time. Impala Introduction. TCloud Computing 天 云 趋 势 孙 振 南 2012/12/13 Beijing Apache Asia Road Show

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Parquet. Columnar storage for the people

COURSE CONTENT Big Data and Hadoop Training

Weaving Stored Procedures into Java at Zalando

Conquering Big Data with Apache Spark

BIG DATA STATE OF THE ART: SPARK AND THE SQL RESURGENCE

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

HIVE. Data Warehousing & Analytics on Hadoop. Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team

Big Data and Scripting Systems build on top of Hadoop

MLlib: Scalable Machine Learning on Spark

Scaling Out With Apache Spark. DTL Meeting Slides based on

Architecting the Future of Big Data

How To Create A Data Visualization With Apache Spark And Zeppelin

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

How to Install and Configure EBF15328 for MapR or with MapReduce v1

CSE-E5430 Scalable Cloud Computing Lecture 11

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Hadoop Job Oriented Training Agenda

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Spark: Cluster Computing with Working Sets

Data Warehouse Overview. Namit Jain

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Big Data and Scripting Systems build on top of Hadoop

Big Data SQL and Query Franchising

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Implement Hadoop jobs to extract business value from large and varied data sets

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Real World Hadoop Use Cases

Introduc8on to Apache Spark

American International Journal of Research in Science, Technology, Engineering & Mathematics

Big Telco, Bigger DW Demands: Moving Towards SQL-on-Hadoop

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

IBM Text Analytics on Apache Spark

Spark and the Big Data Library

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data. Facebook Wall Data using Graph API. Presented by: Prashant Patel Jaykrushna Patel

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Introduction and Overview for Oracle 11G 4 days Weekends

Data Science in the Wild

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Transcription:

Unified Data Access with Spark SQL Michael Armbrust Spark Summit 2014 @michaelarmbrust

Spark SQL Components 38%! 36%! 26%! Catalyst Optimizer Relational algebra + expressions Query optimization Spark SQL Core Execution of queries as RDDs Reading in Parquet, JSON Hive Support HQL, MetaStore, SerDes, UDFs

Relationship to Shark modified the Hive backend to run over Spark, but had two challenges:» Limited integration with Spark programs» Hive optimizer not designed for Spark Spark SQL reuses the best parts of Shark: Borrows Hive data loading In-memory column store Adds RDD-aware optimizer Rich language interfaces

Migration from Ending active development of Shark Path forward for current users: Spark SQL to support CLI and JDBC/ODBC Preview release compatible with 1.0 Full version to be included in 1.1 https://github.com/apache/spark/tree/branch-1.0-jdbc

Migration from To start the JDBC server, run the following in the Spark directory:./sbin/start- thriftserver.sh The default port the server listens on is 10000. Now you can use beeline to test the Thrift JDBC server:./bin/beeline Connect to the JDBC server in beeline with: beeline>!connect jdbc:hive2://localhost:10000 *Requires: https://github.com/apache/spark/tree/branch-1.0-jdbc

Adding Schema to RDDs Spark + RDDs! Functional transformations on partitioned collections of opaque objects. SQL + SchemaRDDs! Declarative transformations on partitioned collections of tuples.! User User User User User User Name Age Height Name Age Height Name Age Height Name Age Height Name Age Height Name Age Height

Unified Data Abstraction QL SQL-92 SQL SchemaRDD Parquet {JSON} Image credit: http://barrymieny.deviantart.com/

Using Spark SQL SQLContext Entry point for all SQL functionality Wraps/extends existing spark context from pyspark.sql import SQLContext sqlctx = SQLContext(sc)

Example Dataset A text file filled with people s names and ages: Michael, 30 Andy, 31 Justin Bieber, 19

RDDs into Relations (Python) # Load a text file and convert each line to a dictionary. lines = sc.textfile("examples/ /people.txt") parts = lines.map(lambda l: l.split(",")) people = parts.map(lambda p:{"name": p[0],"age": int(p[1])}) # Infer the schema, and register the SchemaRDD as a table peopletable = sqlctx.inferschema(people) peopletable.registerastable("people")

RDDs into Relations (Scala) val sqlcontext = new org.apache.spark.sql.sqlcontext(sc) import sqlcontext._ // Define the schema using a case class. case class Person(name: String, age: Int) // Create an RDD of Person objects and register it as a table. val people = sc.textfile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toint)) people.registerastable("people")

RDDs into Relations (Java) public class Person implements Serializable { private String _name; private int _age; public String getname() { return _name; } public void setname(string name) { _name = name; } public int getage() { return _age; } public void setage(int age) { _age = age; } } JavaSQLContext ctx = new org.apache.spark.sql.api.java.javasqlcontext(sc) JavaRDD<Person> people = ctx.textfile("examples/src/main/resources/ people.txt").map( new Function<String, Person>() { public Person call(string line) throws Exception { String[] parts = line.split(","); Person person = new Person(); person.setname(parts[0]); person.setage(integer.parseint(parts[1].trim())); return person; } }); JavaSchemaRDD schemapeople = sqlctx.applyschema(people, Person.class);

Querying Using SQL # SQL can be run over SchemaRDDs that have been registered # as a table. teenagers = sqlctx.sql(""" SELECT name FROM people WHERE age >= 13 AND age <= 19""") # The results of SQL queries are RDDs and support all the normal # RDD operations. teennames = teenagers.map(lambda p: "Name: " + p.name)

Caching Tables In-Memory Spark SQL can cache tables using an inmemory columnar format: Scan only required columns Fewer allocated objects (less GC) Automatically selects best compression cachetable("people")

Language Integrated UDFs registerfunction( countmatches, lambda (pattern, text): re.subn(pattern, '', text)[1]) sql("select countmatches( a, text) ")

SQL and Machine Learning training_data_table = sql(""" SELECT e.action, u.age, u.latitude, u.logitude FROM Users u JOIN Events e ON u.userid = e.userid""") def featurize(u): LabeledPoint(u.action, [u.age, u.latitude, u.longitude]) // SQL results are RDDs so can be used directly in Mllib. training_data = training_data_table.map(featurize) model = new LogisticRegressionWithSGD.train(training_data)

Hive Compatibility Interfaces to access data and code in" the Hive ecosystem: o Support for writing queries in HQL o Catalog info from Hive MetaStore o Tablescan operator that uses Hive SerDes o Wrappers for Hive UDFs, UDAFs, UDTFs

Reading Data Stored in Hive from pyspark.sql import HiveContext hivectx = HiveContext(sc) hivectx.hql(""" CREATE TABLE IF NOT EXISTS src (key INT, value STRING)""") hivectx.hql(""" LOAD DATA LOCAL INPATH 'examples/ /kv1.txt' INTO TABLE src""") # Queries can be expressed in HiveQL. results = hivectx.hql("from src SELECT key, value").collect()

Parquet Compatibility Native support for reading data in Parquet: Columnar storage avoids reading unneeded data. RDDs can be written to parquet files, preserving the schema.

Using Parquet # SchemaRDDs can be saved as Parquet files, maintaining the # schema information. peopletable.saveasparquetfile("people.parquet") # Read in the Parquet file created above. Parquet files are # self- describing so the schema is preserved. The result of # loading a parquet file is also a SchemaRDD. parquetfile = sqlctx.parquetfile("people.parquet ) # Parquet files can be registered as tables used in SQL. parquetfile.registerastable("parquetfile ) teenagers = sqlctx.sql(""" SELECT name FROM parquetfile WHERE age >= 13 AND age <= 19""")

Features Slated for 1.1 Code generation Language integrated UDFs Auto-selection of Broadcast (map-side) Join JSON and nested parquet support Many other performance / stability improvements

Preview: TPC-DS Results 400 350 300 Seconds 250 200 150 100 50 0 Query 19 Query 53 Query 34 Query 59 Shark - 0.9.2 SparkSQL + codegen