SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55
Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems and business intelligence Not every data management/analysis problem is best solved exclusively using a traditional relational DBMS Does not scale linearly beyond a few nodes because the need to provide shared-memory abstraction vs. recent MPP databases (e.g., Greenplum, AsterDB, Hive, HadoopDB)
Greenplum an MPP Database Builds on open source database PostgreSQL (non-parallel, single threaded) Utilizes shared-nothing, MPP architecture Automatic data partitioning across multiple segment servers, each owning a portion of data Parallel query optimizer, considering the cost of moving data between nodes Execution plans include relational database operations as well as parallel motion operations 57
Greenplum Architecture 58
NoSQL: The Name NoSQL = Not only using traditional relational DBMS and SQL No SQL Don t use SQL language 59
What does DBMS Do Well Database Management System (DBMS) provides. efficient, reliable, convenient, and safe multi-user storage of and access to massive amounts of consistent and persistent data.
NoSQL Systems Alternative to traditional relational DBMS + Flexible schema + Quicker/cheaper to set up + Massive scalability + Relaxed consistency (all nodes see the same data same time) higher performance & availability No declarative query language more programming Relaxed consistency fewer guarantees
Transactions ACID Properties Atomic All of the work in a transaction completes (commit) or none of it completes Consistent A transaction transforms the database from one consistent state to another consistent state. Consistency is defined in terms of constraints. Isolated The results of any changes made during a transaction are not visible until the transaction has committed. Durable The results of a committed transaction survive failures
CAP Theorem Consistency (all nodes see the same data at the same time) Availability (a guarantee that every request receives a response about whether it was successful or failed) Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system) CAP Theorem: A distributed system cannot satisfy all three of these guarantees at the same time 63
BASE Transactions Basically Available Soft state (i.e., cached data) Eventually Consistent Characteristics: Weak consistency stale data OK Availability first Sacrifice consistency to be immediately responsive, even if a failure prevents the first-tier service from connecting to some needed resource Criticized as increasing the complexity of distributed software applications
Examples of NoSQL Systems MapReduce framework (e.g., Hadoop, Hive) Key-value stores (e.g., Hbase, BigTable, BerkeleyDB) Document stores (e.g., couchdb, MongoDB) Graph Database systems (e.g., Neo4j)
Data Manipulation and Analysis Data Retrieval Document store Key-value store Join, Group By, Aggregation, Recursive queries Schema and complex XQL queries Hive and Graph DBs Statistical machine learning methods Natural language processing Clustering and classification UDFs, MapReduce programs 66
Example #1: Web log Analysis Each record: UserID, URL, timestamp, additional-info Task: Count # of records for Each UserID Each URL Each timestamp Certain construct appearing in additional-info
Example #1: Web log Analysis Each record: UserID, URL, timestamp, additional-info Separate records: UserID, name, age, gender, Task: Find average age of user accessing each URL Other Tasks?
Example #2: Social network Graph Each record: UserID 1, UserID 2 Separate records: UserID, name, age, gender, Task: Find all friends of friends of friends of friends of given user Other Tasks?
Example #3: Wikipedia Pages Large collection of documents Combination of structured and unstructured data Task: Retrieve introductory paragraph of all pages about U.S. presidents before 1900 Other Tasks?
MapReduce Framework Originally from Google, open source Hadoop No data model, data stored in files User provides specific functions: map() and reduce() System provides data processing glue, fault-tolerance, and scalability
MapReduce Framework Schemas and declarative queries are missed Hive schemas, SQL-like query language Pig more imperative but with relational operators Both compile to workflow of Hadoop (MapReduce) jobs
Key-Value Stores Extremely simple interface Data model: (key, value) pairs Operations: Insert(key,value), Fetch(key), Update(key), Delete(key) Implementation: efficiency, scalability, faulttolerance Records distributed to nodes based on key Replication Single-record transactions, eventual consistency
Key-Value Stores Extremely simple interface Data model: (key, value) pairs Operations: Insert(key,value), Fetch(key), Update(key), Delete(key) Some allow (non-uniform) columns within value Some allow Fetch on range of keys Example systems Google BigTable, Amazon Dynamo, Cassandra, Voldemort, HBase,
Document Stores Like Key-Value Stores except value is document Data model: (key, document) pairs Document: JSON, XML, other semistructured formats Basic operations: Insert(key,document), Fetch(key), Update(key), Delete(key) Also Fetch based on document contents Example systems CouchDB, MongoDB, SimpleDB,