Hypertable Goes Realtime at Baidu Yang Dong yangdong01@baidu.com Sherlock Yang(http://weibo.com/u/2624357843)
Agenda Motivation Related Work Model Design Evaluation Conclusion 2
Agenda Motivation Related Work Model Design Evaluation Conclusion 3
Motivation FAILURE Interval : 60 days Interval: 1 second 4
What is Noah Noah System Operational Data Store System metrics (CPU, Memory, IO, Network) Application metrics (Web, DB, Caches) Baidu metrics (Usage, Revenue) Easily visualize data over time Supports complex aggregation, transformations, etc. Component Monitoring sub-system Collection sub-system...... 5
System Requirement Storage Capacity 10TB~100TB 100~1000 billion Records Automatic Sharding Irregular data growth patterns Heavy Writes 10000~30000 inserts/s Fast Reads of Recent Data Recently written data should be available quickly Table Scans e.g. The entire dataset will also be periodically scanned in order to perform time-based rollups 6
Problems of Existing Stack MySQL Not inherently distributed Difficult to scale Irregular data growth patterns Manually re-sharding data frequently Table size limitation Inflexible schema Hadoop No support for random writes Poor support for random reads 7
Hypertable Meets the Requirements Hypertable + Hadoop Elasticity High write throughput High Availability and Disaster Recovery Fault Isolation Range Scans 8
Typical Applications of Hypertable Common Library AsyncComm/Compression/HQL in Hypertable Database (Query Engine) Hypertable MapReduce (Batch Processing) MapReduce + Hypertable OLAP (Online Analytical Processing) MySQL + Hypertable 9
Agenda Motivation Related Work Model Design Evaluation Conclusion 10
Related Work Apache Hadoop Goes Realtime at Facebook Titan (Facebook Messages) Puma (Facebook Insights) ODS (Facebook Internal Metrics) 11
Agenda Motivation Related Work Model Design Evaluation Conclusion 12
Model Design Storage System Architecture Functional Model Processing Model Data Model 13
Storage System Architecture Java/C++ Client, Thrift Client, MR-like Client, SQL-like Client Scribe Flume Data Inputs Distributed Data Structures Data Warehouse Computing System Hybrid Storage System KV File Block Storage Memory / SSD / Disk Object Analytics System Data Outputs Middleware/DB/K-V Stores Web Fronted OLTP System 14
Functional Model Transfer Transfer Transfer Client Client Client Client Random(Read Scan %%%%%%%%%%%%Client%IO%Stub Real%&me(Flow History(Flow %%%Stream%Compu9ng%Engine %%%%%%%%%%%%%%%%%%%Query%Engine %%%%%%%%%%%%%%%%%%%%Hypertable HDFS Noah%Storage%Cluster 15
Processing Model Query Interface InfoQuery <table> <rowkey_range_start> <rowkey_range_end> <column_family> <to_file> <human machine> [max_lines] table: table name rowkey_range_start & end: row key string range column_family: monitor_data_status is value, other tables are value_pack to_file: file by result written into human machine: text or binary max_lines: default < 2w 16
Processing Model Query Mode Normal Query Hostname + MonitorItemIDs + Interval -> Results Graph Hostnames + MonitorItemIDs -> Sorted Latest Results Hostnames -> Status Results Special Query Many Hosts + MonitorItemIDs + Interval: Multiple queries Long Interval: default <= 2w Recods Low Latency: No guarantee Batch queries continuously: Influence to system 17
Data Model Tradeoff Query efficiency Record volume Tables Row Key Column Version TTL History Tables Hostname + Timestamp Latest Table Status Table Hostname + MonitorItemID MonitorItem +MonitorItem* MonitorItem +MonitorItem* 1 0.5 ~ 24 months Hostname MonitorItem 10 24 months 1 18
Agenda Motivation Related Work Model Design Evaluation Conclusion 19
Evaluation High Availability Memory Usage Read/Write Performance 20
High Availability Challenge Recovery No support sync for append operation in Hadoopv2 (based on 0.18.2) Load Balance No implementation in Hypertable-B2 (based on 0.9.2.0) 21
High Availability Improvements Recovery Data in HDFS, Log in LocalFS Local-Recovery Centralized Meta Ranges 2048PB User-Data index by 16TB Meta-Data Master Standby Application Remove-duplicates Load Balance Split-driven Manual online and offline operations 22
High Availability Deployment Model (before improvement) Hyperspace Master Master( (Standby) %%%%Range%Server %%%%%%Range%Server %%%%%Range%Server Meta%Table Distributed% File%System % %%%%%%%%%%%%%%%(e.g.% HDFS,%KFS% ) User%Table Commit%Log% 23
High Availability Deployment Model (after improvement) Hyperspace Master Master( (Standby) Meta%Range%Server User%Range%Server User%Range%Server Distributed% Meta%Table File%System % %%%%%%%%%%%%%%%(e.g.% HDFS,%KFS% ) User%Table Commit%Log% Local%File%System 24
Weaknesses Range data managed by a single range server Though no data loss, can cause periods of unavailability Can be mitigated by client-side cache or memcached Can minimize recovery time by active standby of cluster 25
Memory Usage Challenge Hypertable is memory intensive Function (during execution) Hypertable::CellCache::add gnu_cxx::new_allocator::allocate Hypertable::DynamicBuffer::grow Hypertable::IOHandlerData::handle_event Hypertable::BlockCompressionCodecLzo ::BlockCompressionCodecLzo Memory Usage 75.6% 18.8% 4.1% 1.0% 0.5% 26
Memory Usage Improvement TCMalloc SimplePollMalloc <Key, value> MemPool <Key, value> Map Cell Cache (Freeze) Cell Cache (new) CellMap Memory Distributed FileSystem Compaction Memory Allocation Cell Store Cell Store Cell Store Cell Store 27
Memory Usage Function (during execution) Mem Usage Mem Usage Function (during execution) CellCachePool::get_memory 94.3% 75.6% Hypertable ::CellCache::add 18.8% gnu_cxx ::new_allocator::allocate Hypertable:: DynamicBuffer::grow 3.8% 4.1% Hypertable ::DynamicBuffer::grow Hypertable ::BlockCompressionCodecLzo ::BlockCompressionCodecLzo 1.1% 1.0% Hypertable ::IOHandlerData ::handle_event Hypertable ::IOHandlerData ::handle_event 0.5% 0.5% Hypertable ::BlockCompressionCodecLzo ::BlockCompressionCodecLzo 28
Memory Usage New/Delete vs. TCMalloc vs. PollMalloc vs. PollMalloc (with map) RangeServer Memory Usage Original Malloc Test Memory Usage(KB) Pool Malloc Test(map) Pool Malloc Test(nonmap) TC Malloc Test Time(s) 29
Memory Usage Opportunity Compaction Efficiency Memory Usage(KB) RangeServer Memory-Usage RS-Mem-Usage PoolCache(s)-Mem-Usage Time(s) Maintanence Queue Elememt Number 12 10 Maintenance Task Thread(1) QE Number 8 6 4 2 0 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 Time [non-seq] 30
Memory Usage Improvement Compaction Optimization Factor Cell Cache Buffer Size Maintenance Thread Number RangeServer Memory-Usage Comparation(Map) Memory Usage(KB) Original Malloc Test Pool Malloc MTask1 Pool-4MB Pool Malloc MTask2 Pool Malloc MTask1 Pool-4KB Pool Malloc MTask1 Pool-1MB Pool Malloc MTask1 Pool-512KB Pool Malloc MTask1 Pool-128KB TC Malloc Test Time(s) 31
Read Performance Challenge Noah generates lots of random read ops Machine Sequential Write Random Read Random Write (KB/s) (KB/s) (KB/s) X1 284191 1932 79834 X2 165264 973 1627 X3 93919 863 7256 X4 63565 471 23149 X5 55598 424 18768 X6 60858 336 23188 <./iozone -f t3 -s 10g -c -e -w -+n -i 0./iozone -f t3 -s 10g -c -e -w -+n -i 2 > 32
Read Performance Memory Read Latency ms level Memory usage Sparse Table: value + 40Bytes(keys + map) MVCC: Old record is deleted when cell cache compaction 33
Read Performance Cache Read Block Cache Cellstore blocks Uncompressed Query Cache Query results 34
Read Performance Disk Read CellStore 5 CSs/Range, 5*216=950 RR, 950*2=1900 RR; 500 iops, ioutil 100% Bloomfilter CS 10 CS 5 CS 1 Threads 1 16 32 1 16 32 1 16 32 Queries/s Aggregate Queries/s Average Latency(ms) 3.8 3.8 2.4 9.8 13.5 5 23.1 52.6 25.6 3.8 61 77 9.8 216 160 23.1 1684 1641 260 260 420 102 74 200 43 19 39 35
Read Performance Disk Read Compression Lzo/Quicklz/Gzip/Snappy SSD/SATA/SAS Read Access Time/Capacity/Price per GB/Idle or Full power/mtbf Concurrent W/R Resource isolation Manual control 36
Read Performance Hypertable vs. Hbase 37
Write Performance Challenge HDFS operation pressure RangeServer writing pressure Improvement Compaction frequency adjusting Row key reversal 38
Agenda Motivation Related Work Model Design Evaluation Conclusion 39
Conclusion Application level Table Design Load-in Strategy Duplicate Removing High Availability Metadata Centralization Master Standby Log/Data Separation Load Balance System Active Standby Memory Usage Memory Pool Compaction Strategy Read/Write Performance Mem/SSD/SAS/SATA Block/Query Caching Compression Strategy Resource Isolation Compaction Strategy 40
Hypertable versus HBase versus Hypertable HBase Community Hypertable Apache Company Hypertable Inc Cloudera, Implementation Language Boost C++ Memory Management Explicit Memory Management HortonWorks Java Cache Managemnt Dynamic cache management Performance High Normal Compiler Optimization Easy Hard Garbage Collection Java heap for caching Compression Direct native compression JNI-based compression 41
Thanks for your Attention! 42
Questions? 43
关 注 我 们 :t.baidu-tech.com 资 料 下 载 和 详 细 介 绍 :infoq.com/cn/zones/baidu-salon 畅 想 交 流 争 鸣 聚 会 是 百 度 技 术 沙 龙 的 宗 旨 百 度 技 术 沙 龙 是 由 百 度 与 InfoQ 中 文 站 定 期 组 织 的 线 下 技 术 交 流 活 动 目 的 是 让 中 高 端 技 术 人 员 有 一 个 相 对 自 由 的 思 想 交 流 和 交 友 沟 通 的 的 平 台 主 要 分 讲 师 分 享 和 OpenSpace 两 个 关 键 环 节, 每 期 只 关 注 一 个 焦 点 话 题 讲 师 分 享 和 现 场 Q&A 让 大 家 了 解 百 度 和 其 他 知 名 网 站 技 术 支 持 的 先 进 实 践 经 验,OpenSpace 环 节 是 百 度 技 术 沙 龙 主 题 的 升 华 和 展 开, 提 供 一 个 自 由 交 流 的 平 台 针 对 当 期 主 题, 参 与 者 人 人 都 可 以 发 起 话 题, 展 开 讨 论 InfoQ 策 划 组 织 实 施 关 注 我 们 :weibo.com/infoqchina