RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG



Similar documents
Actian Vector in Hadoop

In-Memory Data Management for Enterprise Applications

CitusDB Architecture for Real-Time Big Data

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Using distributed technologies to analyze Big Data

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #

Integrating Apache Spark with an Enterprise Data Warehouse

Alternatives to HIVE SQL in Hadoop File Structure

Parallel Data Warehouse

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Parquet. Columnar storage for the people

Oracle Big Data SQL Technical Update

Big Fast Data Hadoop acceleration with Flash. June 2013

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Navigating the Big Data infrastructure layer Helena Schwenk

IBM Data Retrieval Technologies: RDBMS, BLU, IBM Netezza, and Hadoop

IN-MEMORY DATABASE SYSTEMS. Prof. Dr. Uta Störl Big Data Technologies: In-Memory DBMS - SoSe

Microsoft Analytics Platform System. Solution Brief

Implement Hadoop jobs to extract business value from large and varied data sets

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Hypertable Architecture Overview

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

Data Management in the Cloud

Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

The SAP HANA Database An Architecture Overview

In Memory Accelerator for MongoDB

How To Create A Large Data Storage System

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Data Modeling Considerations in Hadoop and Hive

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Hadoop & its Usage at Facebook

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

ES 2 : A Cloud Data Storage System for Supporting Both OLTP and OLAP

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

The Vertica Analytic Database Technical Overview White Paper. A DBMS Architecture Optimized for Next-Generation Data Warehousing

Massive Cloud Auditing using Data Mining on Hadoop

New Modeling Challenges: Big Data, Hadoop, Cloud

HadoopRDF : A Scalable RDF Data Analysis System

NoSQL for SQL Professionals William McKnight

In-Memory Databases MemSQL

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Big Data With Hadoop

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Hadoop & its Usage at Facebook

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

ITG Software Engineering

Business Intelligence and Column-Oriented Databases

CSE-E5430 Scalable Cloud Computing Lecture 2

2009 Oracle Corporation 1

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

5 Signs You Might Be Outgrowing Your MySQL Data Warehouse*

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Benchmarking Hadoop & HBase on Violin

In-Memory Columnar Databases HyPer. Arto Kärki University of Helsinki

How To Handle Big Data With A Data Scientist

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Oracle Database In-Memory The Next Big Thing

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Oracle Database 12c Plug In. Switch On. Get SMART.

Big Data Analytics Nokia

2015 The MathWorks, Inc. 1

Data Structures and Performance for Scientific Computing with Hadoop and Dumbo

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

How Companies are! Using Spark

NetStore: An Efficient Storage Infrastructure for Network Forensics and Monitoring

Benchmarking Cassandra on Violin

A Study on Big Data Integration with Data Warehouse

Luncheon Webinar Series May 13, 2013

Safe Harbor Statement

Big Data Weather Analytics Using Hadoop

Transcription:

1 RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

Background 2 Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems (HDFS, KFS). Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Background 3 Hive architecture.

Big Data Processing Requirements 4 Requirements for big data processing systems like Hive. Fast data loading. Fast query processing. Highly efficient storage space utilization. Strong adaptivity to highly dynamic workload patterns.

Data Placement for MapReduce 5 What is data placement structure. The way how we map data from a logic view (relational tables in Hive) to the physical placement (HDFS blocks in Hive). Hive perspective.

Data Placement for MapReduce 6 Data placement structures in conventional database systems. Horizontal row-store structure. Vertical column-store structure. Hybrid PAX store structure.

Merits and Limitations of Existing Data Placement Structures - Row-store 7 Merits It has fast data loading and strong adaptive ability to dynamic workloads. Limitations Row-store cannot provide fast query processing due to unnecessary column reads if only a subset of columns in a table are needed in a query. It is not easy for row-store to achieve a high data compression ratio due to mixed columns with different data domains.

Merits and Limitations of Existing Data Placement Structures - Column-store 8 Merits Can avoid reading unnecessary columns during a query execution. Can achieve a high compression ratio by compressing each column within the same domain. Limitations Cannot provide fast query processing due to high overhead of a tuple reconstruction.

Merits and Limitations of Existing Data Placement Structures - Hybrid-store: PAX 9 Merits Strong adaptive ability to various dynamic workloads. Limitations Cannot provide an opportunity to do column-wise data compression. Cannot improve I/O performance. Cannot efficiently store data sets with a highly-diverse range of data resource types.

Data Placement for MapReduce 10 Row-store cannot support fast query processing because it can not skip unnecessary column reads. Column-store can often cause high record reconstruction overhead with expensive network transfer in a cluster. The PAX structure that uses column-store inside each disk page cannot improve the I/O performance.

RCFile 11 RCFile: Record Columnar File. First horizontally-partition, then vertically partition. RCFile guarantees that data in the same row are located in the same node, and can exploit a column-wise data compression and skip unnecessary column reads.

RCFile 12 RCFile is designed and implemented on top the Hadoop Distributed File System (HDFS). 1. A table can have multiple HDFS blocks. 2. In each HDFS block, RCFile organizes records with the basic unit of a row group. 3. A row group contains: a sync marker (separate two continuous row groups), a metadata header and the table data (a column-store).

RCFile 13 The metadata header section and the table data section are compressed independently. 1. For metadata header section, RCFile used the RLE (Run Length Encoding) algorithm to compress data. 2. Each column is independently compressed with the Gzip compression algorithm. (lazy decompression technology)

RCFile 14 Only appending interface is provided for data writing in RCFile. 1. RCFile created and maintains an in-memory column holder for each column. When a record is appended, all its fields will be scattered, and each field will be appended into its corresponding column holder. 2. RCFile provides the limit of the number of records, or the limit of the size of the memory buffer. 3. RCFile compresses the metadata header and stores it in the disk, and compresses each column holder separately, and flushed it into one row group.

RCFile 15 Under MapReduce framework, a mapper is started for an HDFS block. The mapper will sequentially process each row group in the HDFS block. 1. When processing a row group, RCFile only reads the metadata header and the needed columns in the row groups for a given query. 2. Metadata header is always decompressed and held in memory until RCFile processes the next row group. 3. However, a column will not be decompressed in memory until RCFile has determined that the data in the column will be really useful for query execution.

RCFile 16 Row group size. 1. A large row group size can have better data compression efficiency than that of a small one (there is a threshold). 2. A large row group size may have lower read performance than that of a small size because a large size can decrease the performance benefits of lazy decompression.

Performance Evaluation 17 Storage Space 1. Row-store has the worst compression efficiency. 2. RCFile can reduce even more space than columnstore does. Zebra stores column metadata and real column data together (RCFile can compress the two separately).

Performance Evaluation 18 Data Loading Time 1. Row-store has the smallest data loading time, because it has the minimum overhead to reorganize records in the raw text file. 2. Each record in the raw data file will be written to multiple HDFS blocks for different columns, this will cause much more network overhead. 3. RCFile is comparable to row-store, since it only needs to re-organize records inside each row group whose size is significantly smaller than the file size.

Performance Evaluation 19 Query Execution Time 1. RCFile outperforms the other three structures significantly, this is because the lazy decompression technique can accelerate the query execution with a low query selectivity. 2. High selectivity makes lazy decompression useless. However, column-group highly relies on pre-defined column combinations before query execution.

Conclusion 20 RCFile has comparable data loading speed and workload adaptivity with the row-store. RCFile is read-optimized by avoiding unnecessary column reads during table scan. RCFile uses column-wise compression and thus provides efficient storage space utilization.