Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12
What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language E.g. Google BigTable, Amazon Dynamo 2. A DBMS where mapreduce is used instead of queries Manual programs to iterate over entire data sets E.g. Hadoop, CouchDB, Dynamo 3. A mapreduce engine with a query language on top: HIVE on top of Hadoop provides HiveQL Provides non-procedural data analythics (select from groupby) without detailed programming Executed in batch as parallel Hadoop jobs
NoSQL Characteristics Highly distributed and parallel architectures Typically runs on data centers This is similar to parallel databases! Highly scalable systems by compromised consistency Eventual consistency Or perhaps never consistency Puts burden on programmer to handle consistency! => Mainly suitable for applications not needing consistency
MapReduce Parallel batch processing using mapreduce Many NoSQL databases use mapreduce for parallel batch processing of data stored in data centers Highly scalable implementation of parallell batch processing of same (e.g. Java, JaVaScript, Pyhon) program over large amounts of data stored in different files Based on a scalable file system (e.g. HDFS) The mapreduce function: Applies a (costly) user function mapper producing key/value pairs in parallel on many nodes accessing files in a cluster Applies a user aggregate function on the key/value pairs produced by the mapper Very similar to GROUP BY in SQL! Read reference article on MapReduce: http://dl.acm.org/citation.cfm?id=1327452.1327492
File I/O Mapreduce Data file Map Partition Data file Map Partition Reduce Data file Data file Map Map Partition Partition Reduce Output Writer Result file Data file Map Partition Reduce Data file Map Partition
Input reader Mapreduce stages System component that reads files from scalable file system (e.g. HDFS) and sends to map functions applied in parallell Map function Applied in parallel on many different files Parses input file data from HDFS Does some (expensive) computation Emits key value pairs as result Result stored by MapReduce system as file Partition function (optional) Partitions output key/value pairs from map function into groups of key/value pairs to be reduced in parallel Usually hash partitioning Reduce function Iterates over set of key/value pairs to produced a reduced set of key value pairs stored in the file system C.f. aggregate functions
HIVE SQL-like query language HiveQL support on top of Hadoop Developed by Facebook, now maintained by Netflix Queries very often involving grouping and statistics Queries compiled into mapreduce jobs Handles 80% of mapreduce analysis tasks at Facebook Much easier to use compared to raw mapreduce coding Substantial impovement of programming productivity Non-programmers can analyze data
Hive architecture
HIVQL example Log file: 2012-02-03 20:26:41 SampleClass3 [TRACE] verbose detail for id 1527353937 2012-02-03 20:26:41 SampleClass2 [TRACE] verbose detail for id 191364434 Java.lang.Exception: 2012-02-03 20:27:10 SampleClass7 [ERROR] incorrect format for id 8276745 2012-02-03 20:29:38 SampleClass1 [DEBUG] detail for id 903114158 HIVEQL: CREATE TABLE logs(t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '; LOAD DATA LOCAL INPATH 'sample.log' OVERWRITE INTO TABLE logs; SELECT t4 AS sev, COUNT(*) AS cnt FROM logs WHERE t4 LIKE '[%' GROUP BY t4; [TRACE] 2 [DEBUG] 1
HIVQL example Log file: 2012-02-03 20:26:41 SampleClass3 [TRACE] verbose detail for id 1527353937 2012-02-03 20:26:41 SampleClass2 [TRACE] verbose detail for id 191364434 Java.lang.Exception: 2012-02-03 20:27:10 SampleClass7 [ERROR] incorrect format for id 8276745 2012-02-03 20:29:38 SampleClass1 [DEBUG] detail for id 903114158 HIVEQL: SELECT t5 AS sev, COUNT(*) AS cnt FROM logs WHERE t5 LIKE '[%' GROUP BY t5; [ERROR] 1 SELECT L.sev, SUM(L.cnt) FROM ( SELECT t4 AS sev, COUNT(*) AS cnt FROM logs WHERE t4 LIKE '[%' GROUP BY t4 UNION ALL SELECT t5 AS sev, COUNT(*) AS cnt FROM logs WHERE t5 LIKE '[%' GROUP BY t5) L GROUP BY L.sev; [TRACE] 2 [DEBUG] 1 [ERROR] 1 => 3 Mapreduce jobs!
Wordcount in SQL Alt 1, assume words on documents stored in table: CREATE TABLE docs(id INTEGER, word VARCHAR(20), The query becomes: SELECT d.word, COUNT(d.word) PRIMARY KEY(id, word)) FROM docs d GROUP BY d.word Problem: DOCS is table, not stored in file Alt 2, use user defined table function to access documents in file: SELECT d.word, COUNT(d.word) FROM mydocuments( C:/mydocuments ) AS d GROUP BY d.word
HIVEQL vs raw mapreduce Raw mapreduce: Java (Python, C++, etc.) program does map and reduce Very common use of mapreduce: Statistics collection over files (count, sum, stdev, etc) HIVEQQL handles basic statistics 80% of applications When advanced statistics not supported in HIVEQL (or SQL): Alt 1: User defined aggregate functions in HIVE (UDAF) Can be generally used in other queries too Alt 2: Raw mapreduce Code may be complicated Code cannot be used in queries
Reading raw data files to RDBMS Major point with mapreduce: Save time to load database and build index Accessing CSV (Comma Separated Values) file Can treat CSV file as relational table Modern DBMS have bulk load facilities Avoid using insert to bulk load At least do not do commit after each insert (MySQL autommit default) Bulk load speed approaches file copy time Orders of magnitude faster than naïve inserts Automatic parallel bulk loading Binary and formatted bulk loading supported Indexes are voluntary and can be built afterwards
Relational Databases vs mapreduce Modern RDBMSs have user defined table functions and aggregate functions Can read data from files using UDFs User defined aggregate functions are incremental and parallelized RDBMSs have indexing and high parallelism to provide scalability Notice that it takes time to build index when database is loaded => May slow down database loading considerable When is HIVE with mapreduce better than RDBs? One shot queries when no indexing is needed Massively parallel very expensive brute force computations embarrasingly parallel computations Google F1: Use highly distributed DBMS to select input to brute force mapreduce jobs