Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

Size: px

Start display at page:

Download "Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН"

Britton Norton
8 years ago
Views:

1 Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

42 Zettabytes Petabytes ABC Sharding A B C Id Fn Ln Addr 1 Fred Jones Liberty, NY 2 John Smith??????

43 122+ NoSQL Database Offerings Today! 4 Dominant Flavors The Buzz Regular machine failure, data center outages, and network service interruptions happen frequently Need is higher volume, fewer features Existing RDBMS do not automatically manage the distribution of data over available hardware. Sharding solutions over RDBMS introduce large overhead High-scale RDBMS too expensive for increased data volume Need for a flexible data model Need for a low-latency, low-overhead API to access data Need to scale-out on cheap commodity hardware Increase use of distributed analytics

44 Simple Key Value Stores Simplest NoSQL Store, provides low-latency writes but single key/value access Stores data as hash table of keys where every key maps to an opaque binary object Easily scales across many machines, does not support other data types Ideal for apps that require massive amounts of simple data like sensor data or rapidly changing data such as stock quotes Use-cases: apps that require massive amounts of simple data (sensor, web ops), apps that require rapidly changing data (stock quotes), Caching. Examples : MemcaheD, Dynamo Document Stores Represents rich, hierarchical data structures, reducing need for multi-table joins Structure of the documents need not be known a priori, can be variable, and evolve instantly, but Query can understand the contents of the document Applications: rapid ingest and delivery for evolving schemas and web-based objects. Examples. MongoDB, couchdb (couchbase) Column Family Manages structured data, with multiple-attribute access Columns are grouped together in column-families/groups. Each storage block contains data from only one column/column set to provide data locality for hot columns Column groups defined a-priori, but supports variable schema within a column group Scale using replication, multi-node distribution for high availability and easy failover. Optimized for writes (writes faster than reads) Applications: High throughput verticals (activity feeds, message queues). Caching. Web ops. Examples. HBase, Cassandra, BigTable, Amazon Dynamo Graph Store Uses nodes, relationships between nodes, and key-value properties Accesses data using graph traversal, navigating from start nodes to related nodes according to graph algorithms Faster for associative data sets Uses schema-less, bottoms-up model for capturing ad-hoc and rapidly changing data Common Model : RDF Applications: storing and reasoning on complex and connected data, e.g. inferencing applications in healthcare, government, telecom,oil, perform closure on social networking graphs Examples : Neo4J, DB2

77 Row A Row B Row C Column A Column B Column C X Note: Column Families contain Columns with time stamped versions. Columns only exist when inserted (i.e. sparse) Row A Column A Integer Row A Column B Value Row B Column B Long Timestamp1 Row B Row C Long Timestamp2 Row C Column C Huge URL Family 1 Family 2

80 Data Layout Transactions Query Language Security Indexes HBase A sparse, distributed, persistent multidimensional sorted map ACID Support on Single Row Only get/put/scan only unless combined with Hive or other technology Authentication /Authorization Row-Key only or special table RDBMS Row or Column Oriented Yes SQL Authentication /Authorization Throughput Millions of Queries per Sec Thousands of Queries per Sec Maximum Database Size PBs Yes TBs

81 Given this sample RDBMS table SSN primary key Last Name First Name Account Number Type of Account Timestamp Smith John abcd1234 Checking Johnson Michael wxyz1234 Checking Johnson Michael aabb1234 Checking

82 Row key Value (CF, Column, Version, Cell) info: { lastname : Smith, firstname : John } acct: { checking : abcd1234 } info: { lastname : Johnson, firstname : Michael } acct: { checking : checking :

83 info Column Family Row Key Column Key Timestamp Cell Value info:fname John info:lname Smith info:fname Michael info:lname Johnson acct Column Family Row Key Column Key Timestamp Cell Value acct:checking abcd acct:checking wxyz acct:checking aabb1234 Key Key/Value Row Column Family Column Qualifier Timestamp Value

84 Key Key/Value Row Column Family Column Qualifier Timestamp Value

85 HBase API Region Servers Master HFile Memstore Write-Ahead Log ZooKeeper HDFS

96 The data is sharded Each shard contains all the data in a key-range Rows Region Server 1 Region Server 2 Region Server 3 Table Logical View A.. K.. T.. Z Keys:[A-D] Region Keys:[E-H] Region Keys:[S-V] Region Keys:[N-R] Region Keys:[I-M] Region Keys:[W-Z] Region Auto-Sharded Regions

100

101

102

103 HLog Region Server Region Client Store StoreFile MemStore StoreFile StoreMemStore StoreFile HFile HFile HFile

104 HLog Region Server Region Client Store StoreFile MemStore StoreFile StoreMemStore StoreFile HFile HFile HFile

105

106

107 HLog HLog HLog External Gateway Clients atch Clients: Thrift Client Rest Client Avro Client Other Clients: Hive HBase Thrift Server Rest Server Avro Server jaql Pig Cluster HTable HTable HTable Region Server Region Server Region Server JRuby MapReduce Store MemStore StoreFile HFile Store MemStore StoreFile HFile Store StoreFile MemStore HFile AsyncHBase HTable Java Client HTable Thrift Server Thrift Client External API Clients

108

109

110 Key Key/Value Row Column Family Column Qualifier Timestamp Value

111

112

113

114 /////////////////////////////////////An Example///////////////////////////////////// //Create an HBase Table import hbase(*); T = hbasestring('test', schema { key: string, f1?: {*:string}, f2?: {*:string}, f3?: {*:string} }, replace=true); //Define a Row data = [ { key: "1", f1: { Name: "Bruce Brown"}, f2: { Address: "BigData Blvd, Washington DC 20001" }, f3: { "brownb@us.ibm.com"} }, { key: "2", f1: { Name: "John Smith"}, //Put the data into the HBase Table data -> write(t); f3: { "jsmith@us.ibmcom"} } ]; Speeds Development of HBase Applications!

115

116 InfoSphere BigInsights Visualization & Exploration Development Tools Advanced Engines Connectors Workload Optimization Administration & Security IBM-certified Apache Hadoop

117

118 HBase GUI for Queries Integration with BigSheets for Advanced Analytics Advanced Support and Analytics through Jaql

119

120 DB2/ InfoSphere Warehouse / Netezza tools/apps DB2 / InfoSphere Warehouse BigInsights (HBase) HDFS

121 *

122

123 Process Heap Description NameNode 8 GB About 1 GB of heap for every 100TB of raw data stored, or per million files / inodes Secondary NameNode 8 GB Applies the edits in memory, and therefore needs about the same amount as the NameNode JobTracker 2 GB Moderate requirements HBase Master 4 GB Usually lightly loaded, moderate requirements only DataNode 1 GB Moderate requirements TaskTracker 1 GB Moderate requirements HBase RegionServer 12 GB Majority of available memory, while leaving enough room for the operating system (for the buffer cache), and for the Task Attempts processes Tasks Attempts 1 GB (ea.) Multiply by the maximum number you allow for each ZooKeeper 1 GB Moderate requirements

124

125 NoSQL -- Your Ultimate Guide to the Non - Relational Universe! Brewer's CAP Theorem posted by Julian Browne, January 11, HBase Schema design: HBase in Action by Nick Dimiduk, Amandeep Khurana

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory