NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu
HDFS Hadoop: standard storage mechanism for HADOOP Hadoop Distributed File System (HDFS) 2
HDFS Hadoop Distributed File System (HDFS) Fault tolerance Assuming that failure will happen allows HDFS to run on commodity hardware. Streaming data access HDFS is written with batch processing in mind, and emphasizes high throughput rather than random access to data. Extreme scalability HDFS will scale to petabytes (current versions) Portability HDFS is portable across platforms. 3
Hadoop: standard storage mechanism Hadoop Distributed File System (HDFS) Most HDFS applications need a write-once-read-many access model for files By assuming a file will remain unchanged after it is written, HDFS simplifies replication and speeds up data throughput. Moving Computation is Cheaper than Moving Data : Locality of computation Due to data volume, it is often much faster to move the program near to the data à HDFS has features to facilitate this. 4
Hadoop: standard storage mechanism Starting point http://hadoop.apache.org/docs/r1.0.4/hdfs_user_guide.html / 5
Hadoop: standard storage mechanism HDFS Interface Interface similar to that of regular filesystems. can only store and retrieve data, not index it. Simple random access to data is not possible. Map Reduce Solution: higher-level layers à HBase have been created to provide finer-grained functionality to Hadoop deployments Hbase HDFS 6
Hbase, the Hadoop HBase Creates indexes à offers fast and random access to its content Modeled after Google's BigTable DB is a column-oriented database designed to store massive amounts of data. Uses HDFS as a storage system Map Reduce Hbase HDFS It belongs to the NoSQL universe similar to Cassandra, Hypertable, 7
Hbase versus HDFS (a brief comparison) HDFS: Optimized For: Large Files Sequential Access (High Throughput) Append Only Use for fact tables that are mostly append only and require sequential full table scans. HBase: Optimized For: Small Records (but many records) Random Access Atomic Record Updates Use for dimension lookup tables which are updated frequently and require random low-latency lookups. 8
HDFS: an example A given file is broken down into blocks (default=64mb), 1 2 3 4 5 9
HDFS: an example then blocks are replicated across cluster (default=3). 1 3 5 2 3 4 1 2 3 4 5 1 3 4 2 4 5 1 2 5 10
: Resource Management Scheduling A given job is broken down into tasks, then tasks are scheduled to be as close to data as possible. 2 3 4 1 3 5 Optimized for Bach processing Failure recovery 2 4 5 1 3 4 1 2 5 11
Common characteristics of NoSQL Shared nothing systems CPU CPU CPU CPU CPU CPU BUS RAM RAM RAM LAN RAM Disk RAM Disk SAN Shared RAM Shared Disk Shared Nothing LAN Shared nothing systems have proven to be most cost-effective and flexible Source: h*p://www.slideshare.net/couchbase/webinar- making- sense- of- nosql- applying- nonrela?onal- databases- to- business- needs?ref=h*p:// www.slideshare.net/slideshow/embed_code/18124982?rel=0 12
Common characteristics of NoSQL Distributed models requests Node Master-Slave Master Node Used only if primary master fails Standby Master Node requests Peer-to-Peer Node Node Node Node Peer to peer models do not have standby nodes that are idle Source: h*p://www.slideshare.net/couchbase/webinar- making- sense- of- nosql- applying- nonrela?onal- databases- to- business- needs?ref=h*p:// www.slideshare.net/slideshow/embed_code/18124982?rel=0 13 13
Common characteristics of NoSQL Move Queries to the Nodes Query Queries work best if the run on the local node that has the data Source: h*p://www.slideshare.net/couchbase/webinar- making- sense- of- nosql- applying- nonrela?onal- databases- to- business- needs?ref=h*p://www.slideshare.net/slideshow/embed_code/18124982?rel=0 14
Alternatives to Hbase/HDFS? An Apache project, Cassandra originated at Facebook and is now in production in many large-scale websites (also at BSC). Hypertable was created at Zvents and spun out as an open source project. Are both scalable column-store databases that follow the pattern of BigTable, similar to HBase. Map Reduce Cassandra Map Reduce Hypertable And 15
And dozens http://nosql-database.org List Of NoSQL s [currently 150] 16
NoS QL The concept is something that has gained momentum in recent years Today is a mature and efficient alternative that can help us solve the problems of scalability and performance (e.g. online applications with thousands of concurrent users and million hits a day) 17
NoSQL on Google Trends Source: http://www.google.com/trends/explore#q=nosql 18 18
Different Types of NoSQL Systems Distributed Key-Value Systems Amazon s S3 Key-Value Store (Dynamo) Voldemort (LinkedIn) Cassandra (Facebook) Column-based Systems BigTable (Google) HBase Cassandra Document-based systems CouchDB MongoDB Graph DB Neo4j 19 19
Common Themes Horizontal scalability Clever use of hashing and caching Parallel execution of queries move queries to the data, not the other way around Share resources when possible Example memcached protocol Use simple interfaces when possible put, get, delete Source: Kelly-McCreary & Associates, LLC http://www.slideshare.net/couchbase/webinar-making-sense-of-nosql-applyingnonrelational-databases-to-business-needs?ref=http://www.slideshare.net/ slideshow/embed_code/18124982?rel=0 20 20
21