Neptune Distributed Data Storage H.J. Kim 2009.07 http://dev.naver.com/projects/neptune http://www.openneptune.com
Data Tsunami 1 billions book 40 billions Web page 55 trillions Web link 281 exa-bytes 45 GB/person 2X growth in 4 years
Think Various Storage! Data Complexity Dynamo Cassandra NAS Data Volume
Neptune Distributed Data Storage semi-structured data store(not file system) Use Distributed File System for Data file Supports real time and batch processing Google Bigtable clone Data Model, Architecture, Features Goal 1,000 nodes 100 ~ 200 GB per node, Peta bytes
Features Schema Management Create, drop, modify table schema Real-time Transaction Single row operation(no join, group by, order by) Multi row operation: like, between Batch Transaction Scanner, uploader, map&reduce adapter Scalability Automatic table split & re-assignment Reliability Data file stored in Distributed File System(HDFS, Others) Commit log stored in ChangeLog Cluster Failover Tablet takeover time: max 2 min. Utility Web Console, Shell(simple query), Data Verifier
Architecture 사용자 애플리케이션 분산/병렬컴퓨팅 플랫폼(MapReduce) Neptune Master Neptune (대용량분산 데이터 저장소) TabletServer #1 TabletServer #2 TabletServer #n 논리적 Table 물리적 저장소 분산파일시스템(Hadoop or other)
System Components Master Lock Server Neptune Client Neptune Master Neptune Master failover / event ZooKeeper NChubby Pleidas NTable Scanner Shell Control failover / event Data/Control TabletServer #1 (Neptune) LogServer #1 TabletServer #2 (Neptune) LogServer #2 TabletServer #n (Neptune) LogServer #n DFS #1 (DataNode) Computing #1 (Map&Reduce) DFS #2 (DataNode) Computing #2 (Map&Reduce) DFS #n (DataNode) Computing #n (Map&Reduce) Local disk Local disk Local disk
Data Model Table TabletA-1 row #1 rk-1 ck-1 v1, t1 v2, t2 row #k Rowkey ck-2 Column#1 v3, t2 v4, t3 v5, t4 Column#n TabletA-2 row #k+1 ck-n vn, tn - Sorted by rowkey - Sorted by columnkey TabletA-n row #m row #m+1 row #n Row#1 Column1 Cell1 Cell2 Cell3 Row.Key Column2 Cell1 Cell2 Cell-k Column-n Cell1 Cell2 Cell-m Cell Cell.Key Cell.Value(t1) Cell.Value(t2) Cell-n Cell.Value(tn)
Data Model Examples 1:N relation - 1 user has 1+friends - will lookup all friends of a user T_USER id(pk) RDBMS T_FRIEND user_id Neptune T_USER_FRIEND row info friend name sex friend_id type <user_id> name sex age <user_id>=type age select * from T_USER, T_FRIEND where T_USER_ID.id =? and T_USER_ID.id = T_FRIEND.user_id Hbase Schema Design Case Studies (http://www.slideshare.net/hmisty/20090713-hbase-schema-design-case-studies)
Data Model Examples access log - each log line contains time, ip, domain, url - will be analyzed every 5 minutes, every hour, daily, weekly RDBMS T_ACCESS_LOG time T_ACCESS_LOG Neptune row http user ip domain url <time><inc_counter> ip domain url referer login_id referer login_id
Data Model Examples N:M 관계 - 1 student many courses - 1 course many students RDBMS T_Student Neptune T_Student T_S_C T_Course row info course id(pk) name s_id c_id id(pk) title s_id name sex age c_id:<type> sex age type teacher_id T_Course row info course c_id title teacher_id s_id:<type>
Data Operation Client put(key, value) ChangeLogServer TabletServer MemoryTable Minor Compaction ChangeLog get(key) Searcher Merged MapFile (HDFS) MapFile#1 (HDFS) MapFile#2 (HDFS) MapFile #n (HDFS) Major Compaction
Failover 6. Active elected NeptuneMaster #1 NeptuneMaster #2 NeptuneMaster #3 1. Try lock 1. Try lock 1. Try lock 2. Get lock 5. Get lock /neptune_master ZooKeeper Cluster (5 nodes) Where master? NeptuneClient 3. Active elected Send event 4. Master fail No shared data in Master Manage Tablet assignment /tserver_host01 Get lock TabletServer Network fail If can t lock -> self kill
Failover Master 장애 Table Schema Management, Tablet Split 기능만 장애 Active-standby로 장애 대처 TabletServer 장애 Master에 의해 Tablet re-assign 수초 ~ 수십 초 이내 복구 ZooKeeper 장애 5개 node로 클러스터 구성 절대 장애 발생하지 않음 Hadoop NameNode 장애 별도의 이중화 방안 필요 Hadoop 전체 장애 Neptune 클러스터도 장애
TabletInputFormat MapReduce TableA TabletA-1 Tablet A-2 Tablet A-3 Tablet A-N TaskTracker Map TaskMap Task Task TaskTracker Map TaskMap Task Task TaskTracker Map TaskMap Task Task Partition using key TaskTracker Reduce Task TaskTracker Reduce Task TableB Tablet B-1 Tablet B-2 DBMS or HDFS META Table
Client Client API Single row operation: put/get Multi row operation: like, between Batch operation: scanner/uploader MapReduce: TabletInputFormat Command line Shell NQL(Neptune Query Language) JDBC support Web Console
Client API Example TableShema tableschema = new TableSchema( T_TEST, new String[]{ col1, col2 }); NTable.createTable(tableSchema); NTable ntable = Ntable.openTable( T_TEST ); Row row = new Row(new Row.Key( RK1 )); Row.addCell( col1, new Cell(new Cell.Key( CK1 ), test_value.getbytes())); ntable.put(row); Row selectedrow = ntable.get(new Row.Key( RK1 )); System.out.println(selectedRow.getCellList( col1 ).get(0)); TableScanner scanner = ScannerFactory.openScanner(ntable, new String[]{ col1 }); Row scanrow = null; while( (scanrow = scanner.next()) = null) { System.out.println(selectedRow.getCellList( col1 ).get(0)); } scanner.close();
Data Definition CREATE TABLE DROP TABLE SHOW TABLES DESC Data Manipulation SELECT DELETE INSERT TRUNCATE COLUMN TRUNCATE TABLE SET CHARSET Cluster Monitoring PING TABLETSERVER REPORT TABLE SHOW USERS STOP ACTION Neptune Shell
Web Console
Performance Experiment Neptune HBase HBase(Cache) Random read 495 578 1,623 Random write 1,223 2,864 8,300 Sequential read 498 600 2,109 Sequential write 1,327 2,635 6,553 Scan 40,329 22,795 30,840 Number of 1000-byte values read/written per second
HBase, Bigtable File System Neptune Bigtable HBase Hadoop DFS or other DFS GFS Hadoop DFS Computing Hadoop or others MapReduce Hadoop Master failover Yes(ZooKeeper) Yes(Chubby) 0.20(ZooKeeper) Script Language No(NQL) Sawzall No Change log 별도 구성 GFS HDFS + Memory API Java, Thrift, REST C++ Java, Thrift, REST ACL Yes Yes No Memory Table No Yes No Scanner Yes Yes Yes Uploader Yes Unknown No
Storage 구분 데이터 용량 (확장성) 실시간 데이터 처리 데이터 복잡성 안정성 분석작업 연계 비용 Local Disk X X X X X Low NAS O X X O X Middle RDBMS X O O Distributed RDBMS O O O (Difficult) (Difficult) X High Very High Hadoop O X X O O Low Bigtable 계열 (Neptune 등) Dynamo 계열 (Dynomite 등) O O O O Low O O X O Low
Google Infra Usage Application Type: 대용량 데이터 실시간 조회 Application Type: 대용량 데이터 분석 + 실시간 조회 Application Type: 대용량 데이터 저장 + 분석 Application Type: 대용량 데이터 저장
Neptune Usage RDBMS (Slave) WebServer 분석결과 조회 실시간처리 (META 데이터) 실시간처리 (대량데이터) 실시간조회 (입력데이터/ 분석데이터) RDBMS (Master) Neptune Neptune Neptune 분석용 Out 분석용 In/Out Batch Processing 첨부파일 Neptune Neptune Hadoop 분석용 In/Out
Stress Test Cluster: 43 nodes 1 hadoop NameNode, 42 DataNode 1 Job Tracker, 20 TaskTracker 7 TabletServer node: 2GB Heap 15 ChangeLogServer node Disk: Hadoop and ChangeLogServer use different disk Hadoop: 0.19.0 Map Task: 1024 map task, 1GB/Map 2 Maps/TaskTracker, 40 Map Task Concurrently Map only Data: 1 row: 10,000 bytes Total 1TB, 110 million rows
- 20 40 60 80 100 120 140 160-100 200 300 400 500 600 700 800 900 2 20 38 56 74 92 110 128 146 164 182 200 218 236 254 272 290 308 326 344 362 380 398 416 434 452 470 488 506 524 542 560 578 596 614 632 650 668 TPS/TabletServer TPS/TabletServer Data(GB)/TabletServer Time(min) TPS GB
0 500 1000 1500 2000 2500 3000 1 25 49 73 97 121 145 169 193 217 241 265 289 313 337 361 385 409 433 457 481 505 529 553 577 601 625 649 673 697 721 745 769 793 817 841 865 889 913 937 961 985 Map Task Elapse Time(sec) Map ID Time(sec)
Test Result Elapse time: 11 hour 40 min Average TPS/TabletServer: 394 Average TPS/Cluster: 2,758 Average put latency: 9 ms Total # Tablets: 8,133 Average Tablet size: 130MB Each TabletServer # Tablets: 1,162 Service data:143gb Heap Usage: Free: 935,746 KB Total: 2,080,128 KB
Powered by Neptune GAIA(http://www.gaiaville.com) Cloud Searchable Storage Service
Milestone Neptune-1.4 release(2009.07) Split시 lock time 최소화 Supports ganglia metrics Tablet Balancer Add start key in META record Neptune-1.5(2009.10) get 성능향상: DFS Block Cache, Bloom Filter Hive query 연동 Tablet 할당 정책
Join Neptune project http://www.openneptune.com
Question http://dev.naver.com/projects/neptune http://www.openneptune.com babokim@gmail.com