1 RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG
Background 2 Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems (HDFS, KFS). Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Background 3 Hive architecture.
Big Data Processing Requirements 4 Requirements for big data processing systems like Hive. Fast data loading. Fast query processing. Highly efficient storage space utilization. Strong adaptivity to highly dynamic workload patterns.
Data Placement for MapReduce 5 What is data placement structure. The way how we map data from a logic view (relational tables in Hive) to the physical placement (HDFS blocks in Hive). Hive perspective.
Data Placement for MapReduce 6 Data placement structures in conventional database systems. Horizontal row-store structure. Vertical column-store structure. Hybrid PAX store structure.
Merits and Limitations of Existing Data Placement Structures - Row-store 7 Merits It has fast data loading and strong adaptive ability to dynamic workloads. Limitations Row-store cannot provide fast query processing due to unnecessary column reads if only a subset of columns in a table are needed in a query. It is not easy for row-store to achieve a high data compression ratio due to mixed columns with different data domains.
Merits and Limitations of Existing Data Placement Structures - Column-store 8 Merits Can avoid reading unnecessary columns during a query execution. Can achieve a high compression ratio by compressing each column within the same domain. Limitations Cannot provide fast query processing due to high overhead of a tuple reconstruction.
Merits and Limitations of Existing Data Placement Structures - Hybrid-store: PAX 9 Merits Strong adaptive ability to various dynamic workloads. Limitations Cannot provide an opportunity to do column-wise data compression. Cannot improve I/O performance. Cannot efficiently store data sets with a highly-diverse range of data resource types.
Data Placement for MapReduce 10 Row-store cannot support fast query processing because it can not skip unnecessary column reads. Column-store can often cause high record reconstruction overhead with expensive network transfer in a cluster. The PAX structure that uses column-store inside each disk page cannot improve the I/O performance.
RCFile 11 RCFile: Record Columnar File. First horizontally-partition, then vertically partition. RCFile guarantees that data in the same row are located in the same node, and can exploit a column-wise data compression and skip unnecessary column reads.
RCFile 12 RCFile is designed and implemented on top the Hadoop Distributed File System (HDFS). 1. A table can have multiple HDFS blocks. 2. In each HDFS block, RCFile organizes records with the basic unit of a row group. 3. A row group contains: a sync marker (separate two continuous row groups), a metadata header and the table data (a column-store).
RCFile 13 The metadata header section and the table data section are compressed independently. 1. For metadata header section, RCFile used the RLE (Run Length Encoding) algorithm to compress data. 2. Each column is independently compressed with the Gzip compression algorithm. (lazy decompression technology)
RCFile 14 Only appending interface is provided for data writing in RCFile. 1. RCFile created and maintains an in-memory column holder for each column. When a record is appended, all its fields will be scattered, and each field will be appended into its corresponding column holder. 2. RCFile provides the limit of the number of records, or the limit of the size of the memory buffer. 3. RCFile compresses the metadata header and stores it in the disk, and compresses each column holder separately, and flushed it into one row group.
RCFile 15 Under MapReduce framework, a mapper is started for an HDFS block. The mapper will sequentially process each row group in the HDFS block. 1. When processing a row group, RCFile only reads the metadata header and the needed columns in the row groups for a given query. 2. Metadata header is always decompressed and held in memory until RCFile processes the next row group. 3. However, a column will not be decompressed in memory until RCFile has determined that the data in the column will be really useful for query execution.
RCFile 16 Row group size. 1. A large row group size can have better data compression efficiency than that of a small one (there is a threshold). 2. A large row group size may have lower read performance than that of a small size because a large size can decrease the performance benefits of lazy decompression.
Performance Evaluation 17 Storage Space 1. Row-store has the worst compression efficiency. 2. RCFile can reduce even more space than columnstore does. Zebra stores column metadata and real column data together (RCFile can compress the two separately).
Performance Evaluation 18 Data Loading Time 1. Row-store has the smallest data loading time, because it has the minimum overhead to reorganize records in the raw text file. 2. Each record in the raw data file will be written to multiple HDFS blocks for different columns, this will cause much more network overhead. 3. RCFile is comparable to row-store, since it only needs to re-organize records inside each row group whose size is significantly smaller than the file size.
Performance Evaluation 19 Query Execution Time 1. RCFile outperforms the other three structures significantly, this is because the lazy decompression technique can accelerate the query execution with a low query selectivity. 2. High selectivity makes lazy decompression useless. However, column-group highly relies on pre-defined column combinations before query execution.
Conclusion 20 RCFile has comparable data loading speed and workload adaptivity with the row-store. RCFile is read-optimized by avoiding unnecessary column reads during table scan. RCFile uses column-wise compression and thus provides efficient storage space utilization.