Ibis: Scaling Python Analy=cs on Hadoop and Impala

Similar documents
Introduc8on to Apache Spark

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Unified Big Data Processing with Apache Spark. Matei

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

The Internet of Things and Big Data: Intro

Moving From Hadoop to Spark

Unlocking the True Value of Hadoop with Open Data Science

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Using RDBMS, NoSQL or Hadoop?

Ali Ghodsi Head of PM and Engineering Databricks

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

SQL on NoSQL (and all of the data) With Apache Drill

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

DNS Big Data

Hadoop & Spark Using Amazon EMR

Real-Time Data Analytics and Visualization

.nl ENTRADA. CENTR-tech 33. November 2015 Marco Davids, SIDN Labs. Klik om de s+jl te bewerken

Self-service BI for big data applications using Apache Drill

Bringing Big Data to People

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

A Brief Introduction to Apache Tez

Self-service BI for big data applications using Apache Drill

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Hadoop Ecosystem B Y R A H I M A.

This is a brief tutorial that explains the basics of Spark SQL programming.

Big Data Analytics - Accelerated. stream-horizon.com

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Native Connectivity to Big Data Sources in MSTR 10

Luncheon Webinar Series May 13, 2013

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Python for Data Analysis and Visualiza4on. Fang (Cherry) Liu, Ph.D PACE Gatech July 2013

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

Beyond Hadoop with Apache Spark and BDAS

NERSC Data Efforts Update Prabhat Data and Analytics Group Lead February 23, 2015

Next-Gen Big Data Analytics using the Spark stack

Data Stream Algorithms in Storm and R. Radek Maciaszek

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Hadoop and Map-Reduce. Swati Gore

Enabling High performance Big Data platform with RDMA

Data Governance in the Hadoop Data Lake. Kiran Kamreddy May 2015

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

SEIZE THE DATA SEIZE THE DATA. 2015

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

Making big data simple with Databricks

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

The Future of Data Management

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

High Performance Big Data Analy5cs powered by Unique Web Accelera5on and NoSQL. The Big Data Engine

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Constructing a Data Lake: Hadoop and Oracle Database United!

Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here

Dell In-Memory Appliance for Cloudera Enterprise

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

SQream Technologies Ltd - Confiden7al

Reference Architecture, Requirements, Gaps, Roles

Architectures for Big Data Analytics A database perspective

Big Data and Industrial Internet

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Building Your Big Data Team

MapR: Best Solution for Customer Success

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

How To Create A Data Visualization With Apache Spark And Zeppelin

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Implement Hadoop jobs to extract business value from large and varied data sets

HPE Vertica & Hadoop. Tapping Innovation to Turbocharge Your Big Data. #SeizeTheData

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases

How Companies are! Using Spark

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Big Data Management and Security

Parquet. Columnar storage for the people

White Paper November Technical Comparison of Perspectium Replicator vs Traditional Enterprise Service Buses

locuz.com Big Data Services

WHITE PAPER USING CLOUDERA TO IMPROVE DATA PROCESSING

Data processing goes big

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Interactive data analytics drive insights

Data Warehouse 2.0 How Hive & the Emerging Interactive Query Engines Change the Game Forever. David P. Mariani AtScale, Inc. September 16, 2013

Data Discovery, Analytics, and the Enterprise Data Hub

Spark: Cluster Computing with Working Sets

Apache Hadoop: The Big Data Refinery

Trafodion Operational SQL-on-Hadoop

YARN Apache Hadoop Next Generation Compute Platform

Transcription:

Ibis: Scaling Python Analy=cs on Hadoop and Impala Wes McKinney, Budapest BI Forum 2015-10- 14 @wesmckinn 1

Me R&D at Cloudera Serial creator of structured data tools / user interfaces Mathema=cian MIT 07 Professional SQL programmer 2007-2010 (@ AQR) Created pandas (Python library) in 2008 Wrote bestseller Python for Data Analysis 2012 Founder of DataPad 2

Python is popular Python has become a standard language of data science Why is it popular? Maximizes produc=vity for data engineers and data scien=sts Build robust socware and do interac=ve data analysis with 100% Python code Easy- to- learn and makes happy and produc=ve data teams Large, diverse open source development community Comprehensive libraries: data wrangling, ML, visualiza=on, etc. Main use case: data science & engineering swiss army knife on small- to- medium size data 3

but Python does not scale today Python ecosystem confined to single- node analysis Great for smaller data sets Requires sampling or aggrega=ons for larger data Distributed tools compromise in various ways Extrac=ng samples or aggrega=ons for larger data means: Scales by losing more fidelity Addi=onal ETL overhead to extract samples/aggrega=ons Loss of produc=vity with mul=ple languages, tools, etc Blocks certain analysis and use cases 4

Some simplis=c generaliza=ons Industry Analy=cs Heterogeneous data Flat tables and JSON Spark / MapReduce SQL DFS- friendly / streaming data formats More physical machines Scien=fic Compu=ng Homogeneous data Mul=dimensional arrays HPC tools Linear algebra Scien=fic data formats Fewer physical machines 5

Some simplis=c generaliza=ons Industry Analy=cs Heterogeneous data Flat tables and JSON Spark / MapReduce SQL DFS- friendly / streaming data formats More physical machines Python: light investment, generally Scien=fic Compu=ng Homogeneous data Mul=dimensional arrays HPC tools Linear algebra Scien=fic data formats (e.g. HDF5) Fewer physical machines Python: heavy investment, generally 6

pandas Hugely popular Python table / data frame library Labeled table, array, and =me series data structures Popular for data prepara=on, ETL, and in- memory analy=cs Built using Python s scien=fic compu=ng stack User API / domain specific language Bespoke in- memory analy=cs / rela=onal algebra engine IO interfaces (CSV, SQL, etc.) Expanded data type system (beyond NumPy) Supports flat data only (or semi- structured data that can be flaqened) 7

Many SQL engines and more 8

The Great Decoupling for Big Data UI Ibis, SQL, Spark API, Storage HDFS, Kudu, HBase Compute Analytic SQL, Spark, MapReduce 9

A sample big data architecture Application data HDFS Kafka JSON Spark/MapReduce Kafka Kafka Kafka Columnar storage Analytic SQL Engine User SQL 10

Nested / Complex types support Arrays, structs, maps, and unions as first- class value types Analyze JSON- like data directly without flaqening or normaliza=on Most new SQL engines have some level of support Impala Presto Drill BigQuery Spark SQL Hive 11

Ibis in a nutshell For Python programmers doing analy=cs in industry Project Blog: hqp://blog.ibis- project.org Joint project with Impala team @ Cloudera Apache- licensed, open source hqp://github.com/cloudera/ibis Cracing a compelling Python- on- Hadoop user experience Remove SQL coding from user workflows Develop high performance Python extension APIs 12

Ibis in a nutshell, cont d Composable Python DSL ( Ibis expressions ) makes hand- coding SQL SELECT statements unnecessary Ibis for SQL Programmers: hqp://docs.ibis- project.org/sql.html Development roadmap targets Impala (C++ / LLVM) query engine but SQL compiler toolchain is general purpose Current supports Impala and SQLite, but soon other dialects We welcome external contributors for other Analy=c SQL engines 13

14

Benefits of Ibis Maximize developer produc=vity Mirrors single- node Python experience Solve big data problems without leaving Python Leverage Python skills, ecosystem, and tools Python as first- class language for Hadoop Full- fidelity analysis without extrac=ons Python analysis at any scale Na=ve hardware speeds for a broad set of use cases 15

Brief interac=ve demo 16

Ibis/Impala Joint Roadmap More natural data modeling Complex types support Integra=on with full Python data ecosystem Advanced analy=cs + machine learning Enable use of performance compu=ng tools User extensibility with na=ve performance In- memory columnar format Python- to- LLVM IR compila=on Workflow and usability tools 17

Execu=ng data science languages in the compute layer UI Ibis, SQL, Spark API, Python, R, Julia,? Storage HDFS, Kudu, HBase Compute Analytic SQL, Spark, MapReduce 18

Enabling interoperability with big data systems Distributed / MPP query engines: implemented in a host language Typically C/C++ or Java/Scala User- defined func=ons (UDFs) through various means Implement in host language Implement in user language through some external language protocol (ocen RPC- based) External UDFs are usually very slow (cf: PL/Python, PySpark, etc.) 19

What are UDFs good for? Note: industry data scien=sts have libraries containing 100s of UDFs for Hive or other distributed query engines Custom data transforma=ons Custom domain logic (date / =me / data types) Custom data types Custom aggrega=ons (incl. machine learning / sta=s=cs expressible as reduc=ons) 20

Why are external UDFs slow? Serializa=on / deserializa=on overhead Scalar vs vectorized computa=ons RPC overhead 21

Example: Vectoriza=on for interpreted languages SUM(CASE WHEN x > y THEN x ELSE x + y END) 22

Vectorized vs Interpreted perf 23

How to make them fast? Common run=me memory representa=on for tabular data Share- memory (zero- copy or memcpy- only) external UDF protocol Vectorized UDF interface (for interpreted languages) Impala is uniquely posi=oned to play well with Ibis Best- in- class performance and scalability C++ and LLVM- based (JIT compiler) run=me Unified, efficient data interchange amongst Ibis, Impala, and Kudu will enable high performance real =me analy=cs from Python 24

Memory representa=on Many query engines are standardizing on in- memory columnar rep n of materialized transient data Impala: hqp://blog.cloudera.com/blog/2015/07/whats- next- for- impala- more- reliability- usability- and- performance- at- even- greater- scale/ Apache Drill: hqps://drill.apache.org/faq/ Industry- standard serializa=on format: Apache Parquet hqps://parquet.apache.org/ 25

Serializa=on vs In- memory Serializa=on formats (e.g. Parquet) Op=mize for IO / DFS throughput at expense of CPU/memory bus throughput Do not consider random access or in- memory analy=cs as a goal No standardized in- memory containers for materialized data from file / RPC protocols (Parquet, Thric, protobuf, Avro, etc.) 26

Standardized in- memory columnar (IMC) Compact in- memory representa=on for semistructured data Part of Impala s upcoming dev roadmap Some prior IMC- for- SQL work: Apache Drill Standardized memory representa=on means data can be shared without serializa=on Create a canonical C/C++ implementa=on for use in Python / R / Julia 27

Ibis s Vision Uncompromised Python experience 100% Python end- to- end user workflows Enable integra=on with the exis=ng Python data ecosystem (pandas, scikit- learn, NumPy, etc) Interac=ve at big data scale Full- fidelity analysis without extrac=ons Scalability for big data Na=ve hardware speeds for a broad set of use cases 28

Thank you Wes McKinney @wesmckinn Views are my own 29