Hadoop Open Platform-as-a-Service (Hops)

Transcription

1 Hadoop Open Platform-as-a-Service (Hops) Academics: PostDocs: PhDs: R/Engineers: Jim Dowling, Seif Haridi Gautier Berthou (SICS) Salman Niazi, Mahmoud Ismail, Kamal Hakimzadeh, Ali Gholami Stig Viaene (SICS), Steffen Grohschmeidt MSc Students: Theofilos Kakantousis, Nikolaos Stangios, Sri Srijeyanthan, Vangelos Savvidis, Seçkin Savaşçı.

2 Why is Big Data Important? In a wide array of academic fields, the ability to effectively process data is superseding other more classical modes of research. More data trumps better algorithms * * The Unreasonable Effectiveness of Data [Halevey, Norvig et al 09]

3 Background: Hadoop Filesystem and MapRed 3

4 Data nodes Data nodes HDFS: Hadoop Filesystem Name node

5 Data nodes Data nodes HDFS: Hadoop Filesystem write /crawler/bot/jd.io/1 Name node

6 Data nodes Data nodes HDFS: Hadoop Filesystem write /crawler/bot/jd.io/1 Name node

7 HDFS: Hadoop Filesystem write /crawler/bot/jd.io/1 Name node Data nodes Data nodes

8 HDFS: Hadoop Filesystem write /crawler/bot/jd.io/1 Name node Rebalance Data nodes Data nodes

11 HDFS: Hadoop Filesystem write /crawler/bot/jd.io/1 Name node Under-replicated blocks Heartbeats Data nodes Data nodes

12 HDFS: Hadoop Filesystem write /crawler/bot/jd.io/1 Name node Under-replicated blocks Data nodes Data nodes

13 HDFS: Hadoop Filesystem write /crawler/bot/jd.io/1 Name node Under-replicated blocks Data nodes Data nodes 6

14 Big Data Processing with No Data Locality Job( /genomes/jim.bam ) Workflow Manager

15 Big Data Processing with No Data Locality Job( /genomes/jim.bam ) submit Workflow Manager

16 Big Data Processing with No Data Locality Job( /genomes/jim.bam ) submit Workflow Manager Compute Grid Node

17 Big Data Processing with No Data Locality Job( /genomes/jim.bam ) submit Workflow Manager Compute Grid Node Job

18 Big Data Processing with No Data Locality Job( /genomes/jim.bam ) submit Workflow Manager Compute Grid Node Job

19 Big Data Processing with No Data Locality Job( /genomes/jim.bam ) submit Workflow Manager Compute Grid Node Job This doesn t scale. Bandwidth is the bottleneck

20 MapReduce Data Locality Job( /genomes/jim.bam ) Job Tracker Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker DN DN DN DN DN DN

21 MapReduce Data Locality Job( /genomes/jim.bam ) submit Job Tracker Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker DN DN DN DN DN DN

22 MapReduce Data Locality Job( /genomes/jim.bam ) submit Job Tracker Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker DN DN DN DN DN DN

23 MapReduce Data Locality Job( /genomes/jim.bam ) submit Job Tracker Task Task Task Task Task Task Tracker Tracker Tracker Tracker Tracker Tracker Job Job Job Job Job Job DN DN DN DN DN DN

24 MapReduce Data Locality Job( /genomes/jim.bam ) submit Job Tracker Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker DN DN DN DN DN DN R R = resultfile(s) R R

25 The NameNode 7

26 HDFS: Hadoop Filesystem Name node Data nodes Data nodes 8

27 HDFS NameNode Stores Mappings: path_component -> inode inode -> {block1, block2, block3, } block -> {replica1,replica2,replica3} External API to HDFS Clients - Internal API to DataNodes Monitors Datanodes for failures, corrupted data Manages Leases, Quotas, (re-)replication Must do all this in a single JVM - Spotify have a 90GB Heap storing references to 300m files 9

28 High Availability for the NameNode HDFS 2.x NN DN DN DN DN

29 High Availability for the NameNode HDFS 2.x DN DN DN DN

30 High Availability for the NameNode HDFS 2.x JN JN JN Master-Slave Replication of NN State. NN Active Shared NN log stored in quorum of journal nodes NN Standby DN DN DN DN

31 High Availability for the NameNode HDFS 2.x Agreement on the Active Master JN JN JN Master-Slave Replication of NN State. NN Active Shared NN log stored in quorum of journal nodes NN Standby DN DN DN DN

32 High Availability for the NameNode HDFS 2.x Agreement on the Active Master ZK ZK ZK JN JN JN Master-Slave Replication of NN State. NN Active Shared NN log stored in quorum of journal nodes NN Standby DN DN DN DN

33 High Availability for the NameNode HDFS 2.x Agreement on the Active Master ZK ZK ZK JN JN JN Master-Slave Replication of NN State. NN Active Shared NN log stored in quorum of journal nodes NN Standby Faster Recovery, Cut Journal Log DN DN DN DN

34 High Availability for the NameNode HDFS 2.x Agreement on the Active Master ZK ZK ZK JN JN JN Master-Slave Replication of NN State. NN Active Shared NN log stored in quorum of journal nodes NN Standby Faster Recovery, Cut Journal Log Checkpt NN DN DN DN DN

35 High Availability for the NameNode HDFS 2.x Agreement on the Active Master ZK ZK ZK DOESN T SCALEOUT! JN JN JN Master-Slave Replication of NN State. NN Active Shared NN log stored in quorum of journal nodes NN Standby Faster Recovery, Cut Journal Log Checkpt NN DN DN DN DN

36 The Evolution of the NamNode HDFS (2006) - In-memory metadata HDFS 0.07 (2006) - WAL (EditLog) - FSImage HDFS 0.21 (2009) - Weaken Global Lock HDFS 2.0 (2011) - Eventually Consistent Replication: HA-NameNode

37 The Evolution of the NamNode HDFS (2006) - In-memory metadata HDFS 0.07 (2006) - WAL (EditLog) - FSImage HDFS 0.21 (2009) - Weaken Global Lock They reinvented the Database for the NameNode! HDFS 2.0 (2011) - Eventually Consistent Replication: HA-NameNode

38 Databases had these features long ago

39 Databases had these features long ago Oracle v6 (1988) - Redo and Undo Logs - Rollback Segments

40 Databases had these features long ago Oracle v6 (1988) - Redo and Undo Logs - Rollback Segments Oracle V7.1 (1994) - Symmetric Replication

41 Databases had these features long ago Oracle v6 (1988) - Redo and Undo Logs - Rollback Segments Oracle V7.1 (1994) - Symmetric Replication and have continued to evolve..

42 Databases had these features long ago Oracle v6 (1988) - Redo and Undo Logs - Rollback Segments Oracle V7.1 (1994) - Symmetric Replication and have continued to evolve.. Oracle 9i RAC (2001) - Shared State Replication

43 The end of the One-size-fits-All Database Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,

44 The end of the One-size-fits-All Database Columnar Databases Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,

45 The end of the One-size-fits-All Database Columnar Databases NewSQL Databases Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,

46 The end of the One-size-fits-All Database Columnar Databases NewSQL Databases Graph Databases Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,

47 The end of the One-size-fits-All Database Columnar Databases NewSQL Databases Graph Databases RDBMSes Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,

48 The end of the One-size-fits-All Database Columnar Databases In-Memory Stores NewSQL Databases Graph Databases RDBMSes Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,

49 The end of the One-size-fits-All Database Columnar Databases NewSQL Databases In-Memory Stores Key-Value Stores Graph Databases RDBMSes Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,

50 The end of the One-size-fits-All Database Columnar Databases NewSQL Databases In-Memory Stores Key-Value Stores Graph Databases Petabyte Databases RDBMSes Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,

51 The end of the One-size-fits-All Database Columnar Databases - Vertica, Hana NewSQL Databases In-Memory Stores Key-Value Stores Graph Databases Petabyte Databases RDBMSes Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,

52 The end of the One-size-fits-All Database Columnar Databases - Vertica, Hana NewSQL Databases - MySQL Cluster, VoltDB, Memstore, AtlasDB, FoundationDB Graph Databases In-Memory Stores Key-Value Stores Petabyte Databases RDBMSes Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,

53 The end of the One-size-fits-All Database Columnar Databases - Vertica, Hana NewSQL Databases - MySQL Cluster, VoltDB, Memstore, AtlasDB, FoundationDB Graph Databases - Neo4J RDBMSes In-Memory Stores Key-Value Stores Petabyte Databases Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,

54 The end of the One-size-fits-All Database Columnar Databases - Vertica, Hana NewSQL Databases - MySQL Cluster, VoltDB, Memstore, AtlasDB, FoundationDB Graph Databases - Neo4J RDBMSes - MySQL, Postgres, DB2, Oracle, SQLServer In-Memory Stores Key-Value Stores Petabyte Databases Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,

55 The end of the One-size-fits-All Database Columnar Databases - Vertica, Hana NewSQL Databases - MySQL Cluster, VoltDB, Memstore, AtlasDB, FoundationDB Graph Databases - Neo4J RDBMSes - MySQL, Postgres, DB2, Oracle, SQLServer In-Memory Stores - Memcached, Redis Key-Value Stores Petabyte Databases Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,

56 The end of the One-size-fits-All Database Columnar Databases - Vertica, Hana NewSQL Databases - MySQL Cluster, VoltDB, Memstore, AtlasDB, FoundationDB Graph Databases - Neo4J RDBMSes - MySQL, Postgres, DB2, Oracle, SQLServer In-Memory Stores - Memcached, Redis Key-Value Stores - Dynamo, Cassandra, MongoDB, Riak Petabyte Databases Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,

57 The end of the One-size-fits-All Database Columnar Databases - Vertica, Hana NewSQL Databases - MySQL Cluster, VoltDB, Memstore, AtlasDB, FoundationDB Graph Databases - Neo4J RDBMSes - MySQL, Postgres, DB2, Oracle, SQLServer In-Memory Stores - Memcached, Redis Key-Value Stores - Dynamo, Cassandra, MongoDB, Riak Petabyte Databases - BigQuery (Google), RedShift (Amazon), Impala (Cloudera) Stonebraker et al, One Size Fits All: An Idea Whose Time Has Come and Gone,

58 MySQL Cluster (NDB) Shared Nothing DB SQL API NDB API 30+ million update transactions/second on a 30-node cluster Distributed, In-memory 2-Phase Commit - Replicate DB, not the Log! Real-time - Low TransactionInactive timeouts Commodity Hardware Scales out - Millions of transactions/sec - TB-sized datasets (48 nodes) Split-Brain solved with Arbitrator Pattern SQL and Native Blocking/Non- Blocking APIs 14

59 15 HopsFS

60 HopsFS Customizable and Scalable Metadata High throughput for read and write operations NameNode failover time 5 seconds (vs ~1 minute for HDFS)

61 Request Handling (Apache HDFS vs HopsFS) Apache HDFS NameNode Request Handling HopsFS NameNode Request Handling

62 Fine-Grained Locking, Transactional Updates NDB gives us READ_COMMITTED isolation-level, not strong enough. We implemented Serializability for FS operations using implicit locking in the DAG and row-level locking in NDB. [Hakimzadeh, Peiro, Dowling, Scaling HDFS with a Strongly Consistent Relational Model for Metadata, DAIS 2014] 18

63 Preventing Deadlocks and Starvation /user/jdowling/dna.bam 19

64 Preventing Deadlocks and Starvation read /user/jdowling/dna.bam 19

65 Preventing Deadlocks and Starvation read mv /user/jdowling/dna.bam 19

66 Preventing Deadlocks and Starvation read mv /user/jdowling/dna.bam block_report 19

67 Preventing Deadlocks and Starvation read mv /user/jdowling/dna.bam block_report Solution: all request threads for inode operations traverse the FS hierarchy in the same order, acquiring locks in the same order. Block-level operations have to follow the same order. 19

68 Per Transaction Cache Experimentation revealed many roundtrips to the database per transaction. Cache intermediate transaction results at NameNodes. We also use Memcached at each NameNode to cache mappings of: path->{inode/blocks/replicas}

69 Sometimes, Transactions Just ain t Enough Large Subtree Operations with millions of Inodes can t be executed in a single Transaction, due to the low timeouts for Transactions (real-time). Subtree Operations: 4-phase Protocol Sacrifices Atomicity, but keeps Isolation and Consistency. Batch operations and multithreading for performance. Failed NameNodes handled transparently. Leases used to handle failed clients. 21

70 Leader Election using the Database (NDB) We need a leader NameNode to coordinate replication and lease management Use NDB as shared memory for Leader Election. No more Zookeeper, yay! 22

71 HopsFS Internal Protocol Scalability On 100PB+ clusters, internal protocols make up most of the network traffic for HDFS Block Reporting and Exiting Safe Mode - Batching and work stealing.

72 HopsFS Write Performance 1 Gbit Network, Nodes: 12-core Xeon 2.8 Ghz. 2-Node NDB Cluster. 24

73 HopsFS Read Performance 1 Gbit Network, Nodes: 12-core Xeon 2.8 Ghz. 2-Node NDB Cluster. 25

74 HopsFS Erasure Coding HDFS 2.x Triple Replication (300%)

75 HopsFS Erasure Coding HDFS 2.x Triple Replication (300%) 2x Replication + XOR (220%)

76 HopsFS Erasure Coding HDFS 2.x Triple Replication (300%) 2x Replication + XOR (220%) Reed-Solomon (140%)

77 HopsFS Erasure Coding Data durability with Triple Replication Data durability with Reed-Solomon 27

78 Comparison with HDFS-RAID

79 We did the same for YARN 29

80 Apache Hadoop Yarn HA/Scaleout Limitations Clients Zookeeper Primary RM Standby RM NM NM NM NM NM The Resource Manager (RM) is a bottleneck. Zookeeper throughput not high enough to persist all RM state Standby resource manager can only recover partial state All running jobs must be restarted. RM state not queryable. 30

81 Hops Yarn Present. Client NDB NDB NDB Scheduler RT RT NM NM NM NM NM NDB is faster than Zookeeper; the complete state of the system is stored in the database Provide transparent failover. The standby resource manager is replaced by several Resources Trackers. Node Managers heartbeats are handled by the Resources Trackers. Reduce the load on the Scheduler allows to have more nodes managers and/or to send more frequent heartbeats. 31

82 Hops Yarn Future. Client NDB NDB NDB RM RM NM NM NM NM NM The RM is a State-Machine. Almost no session state to manage. Transparent failover working. 32

83 Hops Yarn HA-YARN (done) Distributed Resource Tracker Service (ongoing) Make YARN more interactive (ongoing) - Reduce NodeManager Heartbeat Time 33

84 Hops-Hadoop NDB NDB NDB NDB NN NN NN RM RM RM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM Exabyte-Scale Hadoop

85 Hops-Hadoop NDB NDB NDB NDB NN NN NN RM RM RM HDFS HDFS DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM Exabyte-Scale Hadoop

86 Hops-Hadoop NDB NDB NDB NDB NN NN NN RM RM RM HDFS HDFS YARN YARN DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM DN NM Exabyte-Scale Hadoop

87 The Hops Stack Continued 35

88 Bringing Data People Together HopsHub Spark Flink Adam Cuneiform Hops-YARN Hops-HDFS Karamel/PaaS 36

89 Bringing Data People Together Data Owners - Metadata, Ingestion - Non-programmers HopsHub Spark Flink Adam Cuneiform Hops-YARN Hops-HDFS Karamel/PaaS 36

90 Bringing Data People Together Data Owners - Metadata, Ingestion - Non-programmers Data Scientists - Data analysts - Programmers HopsHub Spark Flink Adam Cuneiform Hops-YARN Hops-HDFS Karamel/PaaS 36

91 Perimeter Security and Multi-Tenancy HopsHub - Project-level RBAC Hadoop trusted proxy - Analytics Plugin Framework Adam, Cuneiform, Spark, Flink, MR - REST APIs Network Isolation Kerberos LIMS Related Hadoop Security Projects Knox, Sentry, Rhino LDAP 37

92 Projects for Multi-Tenancy; Activity Trails 38

93 Projects for Multi-Tenancy; Activity Trails Project 38

94 Projects for Multi-Tenancy; Activity Trails Project Global Activity Trail 38

95 Project Membership 39

96 File Browser (Iceberg)

97 HDFS Files File Browser (Iceberg)

98 HDFS Files File Browser (Iceberg)

99 Upload Data 41

100 Upload Data Overcome 3 GB browser upload limit 41

101 Upload Data Overcome 3 GB browser upload limit 41

102 Upload Data Overcome 3 GB browser upload limit Automated Ingestion of Data 41

103 Upload Data Overcome 3 GB browser upload limit Apache Flume Automated Ingestion of Data 41

104 Run Cuneiform Workflows on YARN

105 PaaS support with Chef/Karamel Support for EC2, Vagrant, Bare Metal. 43

106 Conclusions Hops will be the first European distribution of Hadoop when released. - First beta release coming in Q Lots of ideas for future work - Tighter Spark, Flink integration - BiobankCloud support 44