Ibis: Scaling Python Analy=cs on Hadoop and Impala

Ibis: Scaling Python Analy=cs on Hadoop and Impala Wes McKinney, Budapest BI Forum 2015-10- 14 @wesmckinn 1

Me R&D at Cloudera Serial creator of structured data tools / user interfaces Mathema=cian MIT 07 Professional SQL programmer 2007-2010 (@ AQR) Created pandas (Python library) in 2008 Wrote bestseller Python for Data Analysis 2012 Founder of DataPad 2

Python is popular Python has become a standard language of data science Why is it popular? Maximizes produc=vity for data engineers and data scien=sts Build robust socware and do interac=ve data analysis with 100% Python code Easy- to- learn and makes happy and produc=ve data teams Large, diverse open source development community Comprehensive libraries: data wrangling, ML, visualiza=on, etc. Main use case: data science & engineering swiss army knife on small- to- medium size data 3

but Python does not scale today Python ecosystem confined to single- node analysis Great for smaller data sets Requires sampling or aggrega=ons for larger data Distributed tools compromise in various ways Extrac=ng samples or aggrega=ons for larger data means: Scales by losing more fidelity Addi=onal ETL overhead to extract samples/aggrega=ons Loss of produc=vity with mul=ple languages, tools, etc Blocks certain analysis and use cases 4

Some simplis=c generaliza=ons Industry Analy=cs Heterogeneous data Flat tables and JSON Spark / MapReduce SQL DFS- friendly / streaming data formats More physical machines Scien=fic Compu=ng Homogeneous data Mul=dimensional arrays HPC tools Linear algebra Scien=fic data formats Fewer physical machines 5

Some simplis=c generaliza=ons Industry Analy=cs Heterogeneous data Flat tables and JSON Spark / MapReduce SQL DFS- friendly / streaming data formats More physical machines Python: light investment, generally Scien=fic Compu=ng Homogeneous data Mul=dimensional arrays HPC tools Linear algebra Scien=fic data formats (e.g. HDF5) Fewer physical machines Python: heavy investment, generally 6

pandas Hugely popular Python table / data frame library Labeled table, array, and =me series data structures Popular for data prepara=on, ETL, and in- memory analy=cs Built using Python s scien=fic compu=ng stack User API / domain specific language Bespoke in- memory analy=cs / rela=onal algebra engine IO interfaces (CSV, SQL, etc.) Expanded data type system (beyond NumPy) Supports flat data only (or semistructured data that can be flaqened) 7

Many SQL engines and more 8

The Great Decoupling for Big Data UI Ibis, SQL, Spark API, Storage HDFS, Kudu, HBase Compute Analytic SQL, Spark, MapReduce 9

A sample big data architecture Application data HDFS Kafka JSON Spark/MapReduce Kafka Kafka Kafka Columnar storage Analytic SQL Engine User SQL 10

Nested / Complex types support Arrays, structs, maps, and unions as first- class value types Analyze JSON- like data directly without flaqening or normaliza=on Most new SQL engines have some level of support Impala Presto Drill BigQuery Spark SQL Hive 11

Ibis in a nutshell For Python programmers doing analy=cs in industry Project Blog: hqp://blog.ibis- project.org Joint project with Impala team @ Cloudera Apache- licensed, open source hqp://github.com/cloudera/ibis Cracing a compelling Python- on- Hadoop user experience Remove SQL coding from user workflows Develop high performance Python extension APIs 12

Ibis in a nutshell, cont d Composable Python DSL ( Ibis expressions ) makes hand- coding SQL SELECT statements unnecessary Ibis for SQL Programmers: hqp://docs.ibis- project.org/sql.html Development roadmap targets Impala (C++ / LLVM) query engine but SQL compiler toolchain is general purpose Current supports Impala and SQLite, but soon other dialects We welcome external contributors for other Analy=c SQL engines 13

Benefits of Ibis Maximize developer produc=vity Mirrors single- node Python experience Solve big data problems without leaving Python Leverage Python skills, ecosystem, and tools Python as first- class language for Hadoop Full- fidelity analysis without extrac=ons Python analysis at any scale Na=ve hardware speeds for a broad set of use cases 15

Brief interac=ve demo 16

Ibis/Impala Joint Roadmap More natural data modeling Complex types support Integra=on with full Python data ecosystem Advanced analy=cs + machine learning Enable use of performance compu=ng tools User extensibility with na=ve performance In- memory columnar format Python- to- LLVM IR compila=on Workflow and usability tools 17

Execu=ng data science languages in the compute layer UI Ibis, SQL, Spark API, Python, R, Julia,? Storage HDFS, Kudu, HBase Compute Analytic SQL, Spark, MapReduce 18

Enabling interoperability with big data systems Distributed / MPP query engines: implemented in a host language Typically C/C++ or Java/Scala User- defined func=ons (UDFs) through various means Implement in host language Implement in user language through some external language protocol (ocen RPC- based) External UDFs are usually very slow (cf: PL/Python, PySpark, etc.) 19

What are UDFs good for? Note: industry data scien=sts have libraries containing 100s of UDFs for Hive or other distributed query engines Custom data transforma=ons Custom domain logic (date / =me / data types) Custom data types Custom aggrega=ons (incl. machine learning / sta=s=cs expressible as reduc=ons) 20

Why are external UDFs slow? Serializa=on / deserializa=on overhead Scalar vs vectorized computa=ons RPC overhead 21

Example: Vectoriza=on for interpreted languages SUM(CASE WHEN x > y THEN x ELSE x + y END) 22

Vectorized vs Interpreted perf 23

How to make them fast? Common run=me memory representa=on for tabular data Share- memory (zero- copy or memcpy- only) external UDF protocol Vectorized UDF interface (for interpreted languages) Impala is uniquely posi=oned to play well with Ibis Best- in- class performance and scalability C++ and LLVM- based (JIT compiler) run=me Unified, efficient data interchange amongst Ibis, Impala, and Kudu will enable high performance real =me analy=cs from Python 24

Memory representa=on Many query engines are standardizing on in- memory columnar rep n of materialized transient data Impala: hqp://blog.cloudera.com/blog/2015/07/whats- next- for- impala- more- reliability- usability- and- performance- at- even- greater- scale/ Apache Drill: hqps://drill.apache.org/faq/ Industry- standard serializa=on format: Apache Parquet hqps://parquet.apache.org/ 25

Serializa=on vs In- memory Serializa=on formats (e.g. Parquet) Op=mize for IO / DFS throughput at expense of CPU/memory bus throughput Do not consider random access or in- memory analy=cs as a goal No standardized in- memory containers for materialized data from file / RPC protocols (Parquet, Thric, protobuf, Avro, etc.) 26

Standardized in- memory columnar (IMC) Compact in- memory representa=on for semistructured data Part of Impala s upcoming dev roadmap Some prior IMC- for- SQL work: Apache Drill Standardized memory representa=on means data can be shared without serializa=on Create a canonical C/C++ implementa=on for use in Python / R / Julia 27

Ibis s Vision Uncompromised Python experience 100% Python end- to- end user workflows Enable integra=on with the exis=ng Python data ecosystem (pandas, scikit- learn, NumPy, etc) Interac=ve at big data scale Full- fidelity analysis without extrac=ons Scalability for big data Na=ve hardware speeds for a broad set of use cases 28

Thank you Wes McKinney @wesmckinn Views are my own 29