Parquet. Columnar storage for the people

Similar documents

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

Unified Big Data Processing with Apache Spark. Matei

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Integrating Apache Spark with an Enterprise Data Warehouse

Hadoop Ecosystem B Y R A H I M A.

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

Where is Hadoop Going Next?

How To Scale Out Of A Nosql Database

Alexander Rubin Principle Architect, Percona April 18, Using Hadoop Together with MySQL for Data Analysis

Moving From Hadoop to Spark

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

SQL on NoSQL (and all of the data) With Apache Drill

HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

The Internet of Things and Big Data: Intro

Integrating with Hadoop HPE Vertica Analytic Database. Software Version: 7.2.x

Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Big Data Course Highlights

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #

Using RDBMS, NoSQL or Hadoop?

How Companies are! Using Spark

Schema Design Patterns for a Peta-Scale World. Aaron Kimball Chief Architect, WibiData

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Application Development. A Paradigm Shift

Real-Time Data Analytics and Visualization

Hadoop and MySQL for Big Data

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Open source Google-style large scale data analysis with Hadoop

Real World Hadoop Use Cases

Bringing Big Data Modelling into the Hands of Domain Experts

Processing NGS Data with Hadoop-BAM and SeqPig

Cloudera Certified Developer for Apache Hadoop

Can the Elephants Handle the NoSQL Onslaught?

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Oracle Big Data SQL Technical Update

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

IBM Data Retrieval Technologies: RDBMS, BLU, IBM Netezza, and Hadoop

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

Hadoop: The Definitive Guide

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Cost-Effective Business Intelligence with Red Hat and Open Source

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data? Definition # 1: Big Data Definition Forrester Research

Ali Ghodsi Head of PM and Engineering Databricks

Data processing goes big

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

ITG Software Engineering

Open source large scale distributed data management with Google s MapReduce and Bigtable

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

Data Warehouse 2.0 How Hive & the Emerging Interactive Query Engines Change the Game Forever. David P. Mariani AtScale, Inc. September 16, 2013

.nl ENTRADA. CENTR-tech 33. November 2015 Marco Davids, SIDN Labs. Klik om de s+jl te bewerken

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Hadoop & its Usage at Facebook

Benchmarking Hadoop & HBase on Violin

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

CitusDB Architecture for Real-Time Big Data

Big Data: Tools and Technologies in Big Data

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

API Analytics with Redis and Google Bigquery. javier

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Xiaoming Gao Hui Li Thilina Gunarathne

How to Choose Between Hadoop, NoSQL and RDBMS

HDFS. Hadoop Distributed File System

Architectures for Big Data Analytics A database perspective

Constructing a Data Lake: Hadoop and Oracle Database United!

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012

Big Data Processing with Google s MapReduce. Alexandru Costan

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

This Preview Edition of Hadoop Application Architectures, Chapters 1 and 2, is a work in progress. The final book is currently scheduled for release

Comparing SQL and NOSQL databases

# Not a part of 1Z0-061 or 1Z0-144 Certification test, but very important technology in BIG DATA Analysis

Transcription:

Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li nong@cloudera.com Software engineer, Cloudera Impala

Outline Context from various companies Results in production and benchmarks Format deep-dive

Twitter Context Twitter s data 230M+ monthly active users generating and consuming 500M+ tweets a day. 100TB+ a day of compressed data Scale is huge: Instrumentation, User graph, Derived data,... Analytics infrastructure: Several 1K+ node Hadoop clusters Log collection pipeline Processing tools The Parquet Planers Gustave Caillebotte

Twitter s use case Logs available on HDFS Thrift to store logs example: one schema has 87 columns, up to 7 levels of nesting. struct LogEvent { 1: optional logbase.logbase log_base 2: optional i64 event_value 3: optional string context 4: optional string referring_event... 18: optional EventNamespace event_namespace 19: optional list<item> items 20: optional map<associationtype,association> associations 21: optional MobileDetails mobile_details 22: optional WidgetDetails widget_details 23: optional map<externalservice,string> external_ids } struct LogBase { 1: string transaction_id, 2: string ip_address,... 15: optional string country, 16: optional string pid, }

Goal To have a state of the art columnar storage available across the Hadoop platform Hadoop is very reliable for big long running queries but also IO heavy. Incrementally take advantage of column based storage in existing framework. Not tied to any framework in particular

Columnar Storage @EmrgencyKittens Limits IO to data actually needed: loads only the columns that need to be accessed. Saves space: Columnar layout compresses better Type specific encodings. Enables vectorized execution engines.

Collaboration between Twitter and Cloudera: Common file format definition: Language independent Formally specified. Implementation in Java for Map/Reduce: https://github.com/parquet/parquet-mr C++ and code generation in Cloudera Impala: https://github.com/cloudera/impala

Results in Impala TPC-H lineitem table @ 1TB scale factor GB

Impala query times on TPC-DS 500 Text Seq w/ Snappy RC w/snappy Parquet w/snappy Seconds (wall clock) 375 250 125 0 Q27 Q34 Q42 Q43 Q46 Q52 Q55 Q59 Q65 Q73 Q79 Q96

Criteo: The Context Billions of new events per day ~60 columns per log Heavy analytic workload BI analysts using Hive and RCFile Frequent schema modifications Perfect use case for Parquet + Hive!

Parquet + Hive: Basic Reqs MapRed compatibility due to Hive. Correctly handle evolving schemas across Parquet files. Read only the columns used by query to minimize data read. Interoperability with other execution engines (eg Pig, Impala, etc.)

Performance of Hive 0.11 with Parquet vs orc Size relative to text: orc-snappy: 35% parquet-snappy: 33% TPC-DS scale factor 100 All jobs calibrated to run ~50 mappers Nodes: 2 x 6 cores, 96 GB RAM, 14 x 3TB DISK total CPU seconds 20000 15000 10000 5000 orc-snappy parquet-snappy 0 q19 q34 q42 q43 q46 q52 q55 q59 q63 q65 q68 q7 q73 q79 q8 q89 q98

Twitter: production results Data converted: similar to access logs. 30 columns. Original format: Thrift binary in block compressed files (LZO) New format: Parquet (LZO) 100.00% 75.00% 50.00% 25.00% 0% Space Scan time 120.0% 100.0% 80.0% 60.0% 40.0% 20.0% 0% Space 1 Thrift Parquet 30 Thrift Parquet columns Space saving: 30% using the same compression algorithm Scan + assembly time compared to original: One column: 10% All columns: 110%

Production savings at Twitter Petabytes of storage saved. Example jobs taking advantage of projection push down: Job 1 (Pig): reading 32% less data => 20% task time saving. Job 2 (Scalding): reading 14 out of 35 columns. reading 80% less data => 66% task time saving. Terabytes of scanning saved every day. 14

Format Row group Column a Column b Column c Row group: A group of rows in columnar format. Page 0 Page 0 Page 0 Max size buffered in memory while writing. One (or more) per split while reading. roughly: 50MB < row group < 1 GB Page 1 Page 2 Page 3 Page 4 Page 1 Page 2 Page 1 Page 2 Page 3 Column chunk: The data for one column in a row group. Row group Column chunks can be read independently for efficient scans. Page: Unit of access in a column chunk. Should be big enough for compression to be efficient. Minimum size to read to access a single record (when index pages are available). roughly: 8KB < page < 1MB 15

Format Layout: Row groups in columnar format. A footer contains column chunks offset and schema. Language independent: Well defined format. Hadoop and Cloudera Impala support. 16

Nested record shredding/assembly Algorithm borrowed from Google Dremel's column IO Each cell is encoded as a triplet: repetition level, definition level, value. Level values are bound by the depth of the schema: stored in a compact form. Schema: message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } } DocId Document Links Backward Forward Columns Max rep. level Max def. level DocId 0 0 Links.Backward 1 2 Links.Forward 1 2 Record: DocId: 20 Links Backward: 10 Backward: 30 Forward: 80 DocId 20 Backward Document Links 10 30 Forward 80 Column Value R D DocId 20 0 0 Links.Backward 10 0 2 Links.Backward 30 1 2 Links.Forward 80 0 2 17

Repetition level Schema: message nestedlists { repeated group level1 { repeated string level2; } } 0 1 2 Nested lists level1 level2: a level1 level2: b level2: c level2: d R 0 2 2 1 new record new level2 entry new level2 entry new level1 entry Records: [[a, b, c], [d, e, f, g]] [[h], [i, j]] level2: e level2: f level2: g 2 2 2 new level2 entry new level2 entry new level2 entry Nested lists level1 level2: h 0 new record Columns: Level: 0,2,2,1,2,2,2,0,1,2 Data: a,b,c,d,e,f,g,h,i,j level1 level2: i level2: j 1 2 new level1 entry new level2 entry more details: https://blog.twitter.com/2013/dremel-made-simple-with-parquet

Differences of Parquet and ORC Nesting support Parquet: Repetition/Definition levels capture the structure. => one column per Leaf in the schema. Array<int> is one column. Nullity/repetition of an inner node is stored in each of its children => One column independently of nesting with some redundancy. DocId Document Links Backward Forward ORC: An extra column for each Map or List to record their size. => one column per Node in the schema. Array<int> is two columns: array size and content. => An extra column per nesting level.

Reading assembled records Record level API to integrate with existing row based engines (Hive, Pig, M/R). Aware of dictionary encoding: enable optimizations. a1 a2 a3 b1 b2 b3 a1 a2 a3 b1 b2 b3 Assembles projection for any subset of the columns: only those are loaded from disc. Document Document Document Document DocId 20 Links Links Links Backward 10 30 Backward 10 30 Forward 80 Forward 80 20

Projection push down Automated in Pig and Hive: Based on the query being executed only the columns for the fields accessed will be fetched. Explicit in MapReduce, Scalding and Cascading using globing syntax. Example: field1;field2/**;field4/{subfield1,subfield2} Will return: field1 all the columns under field2 subfield1 and 2 under field4 but not field3 21

Reading columns To implement column based execution engine Row: R D V 0 0 1 A Iteration on triplets: repetition level, definition level, value. 1 0 1 1 1 B C R=1 => same row 2 0 0 D<1 => Null Repetition level = 0 indicates a new record. 3 0 1 D Encoded or decoded values: computing aggregations on integers is faster than on strings.

Integration APIs Schema definition and record materialization: Hadoop does not have a notion of schema, however Impala, Pig, Hive, Thrift, Avro, ProtocolBuffers do. Event-based SAX-style record materialization layer. No double conversion. Integration with existing type systems and processing frameworks: Impala Pig Thrift and Scrooge for M/R, Cascading and Scalding Cascading tuples Avro Hive Spark

Encodings Bit packing: Small integers encoded in the minimum bits required 1 3 2 0 0 2 2 0 01 11 10 00 00 10 10 00 Useful for repetition level, definition levels and dictionary keys Run Length Encoding: Used in combination with bit packing Cheap compression Works well for definition level of sparse columns. Dictionary encoding: Useful for columns with few ( < 50,000 ) distinct values 1 1 1 1 1 1 1 1 8 1 When applicable, compresses better and faster than heavyweight algorithms (gzip, lzo, snappy) Extensible: Defining new encodings is supported by the format

Parquet 2.0 More encodings: compact storage without heavyweight compression Delta encodings: for integers, strings and sorted dictionaries. Improved encoding for strings and boolean. Statistics: to be used by query planners and predicate pushdown. New page format: to facilitate skipping ahead at a more granular level.

Main contributors Julien Le Dem (Twitter): Format, Core, Pig, Thrift integration, Encodings Nong Li, Marcel Kornacker, Todd Lipcon (Cloudera): Format, Impala Jonathan Coveney, Alex Levenson, Aniket Mokashi, Tianshuo Deng (Twitter): Encodings, projection push down Mickaël Lacour, Rémy Pecqueur (Criteo): Hive integration Dmitriy Ryaboy (Twitter): Format, Thrift and Scrooge Cascading integration Tom White (Cloudera): Avro integration Avi Bryant, Colin Marc (Stripe): Cascading tuples integration Matt Massie (Berkeley AMP lab): predicate and projection push down David Chen (Linkedin): Avro integration improvements

How to contribute Questions? Ideas? Want to contribute? Contribute at: github.com/parquet Come talk to us. Cloudera Criteo Twitter 27