Lecture Data Warehouse Systems Eva Zangerle SS 2013
PART C: Novel Approaches in DW NoSQL and MapReduce
Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores will dominate the DW market over time, replacing row stores. Vast majority of data warehouses are not candidates for main or flash memory. Massively parallel processor systems will be omnipresent in this market. No knobs is the only thing that makes any sense. Appliances should be software only. 3
Stonebraker on Data Warehouses Hybrid workloads are not optimized by one size fits all Essentially all DW installations want high availability. DBMS should support online reprovisioning. Virtualization often has performance problems in a DMBS world. 4
Big Data Estimates of International Data Corporation 2006: 0.18 zettabytes stored electronically 1 zettabyte = 1 billion terabytes 2011: 1.8 zettabytes 2012: 2.7 zettabytes Data from Internet archive, social networks (photos, posts) LHC, NYSE, sensor networks, machine logs, etc. Problem Storage capacities of hard drives increased Access speeds have not kept up New data needs, can not be squeezed into RDBMS (e.g. graph data) 5
Aim of NoSQL NoSQL Introduction Develop databases to target data amounts in terabyte / petabyte scale (web 2.0 age) Hard to scale relational systems with commodity hardware Characteristics (often) Non relational Scale horizontally (scale-out) Open source (of course not strict: Amazon SimpleDB) Schema free (cf. alter table) Replication Consistency model is BASE/eventually consistent (often not ACID) Simple API (complex queries often not possible, CRUD operators (create, read, update, delete) 6
NoSQL Definition Taken from www.nosql-database.org 7
NoSQL Introduction (2) Time of one-size-fits-all database over Alternative to the relational database model Separation non trivial (relational SQL vs NoSQL) HadoopDB Advantages for specific use cases Scaling, application development, operating costs Open source as a central element No top-down approach De facto standards (Hadoop) Nowadays interpreted as Not only SQL Facebook Different flavours, new? 8
Key/value database Early Forms of NoSQL Database manager (1979), stores data by use of a key in buckets (hashing) Document oriented systems IBM Lotus Notes (1984), stores user docs, groupware system Storage as Key/value pairs Berkeley DB (1991) Column oriented systems Sysbase IQ (1996) Distribution of these systems limited compared to RDBMS 9
NOSQL - a Short History (1) 1998 Word NoSQL first time used. Relational model but no SQL API 2004 Google s Map/Reduce & BigTable system GFS Until 2005 NoSql like systems (Db4o (objects), Neo4j (graphs)) 2006-2009 Development of standard NoSQL systems (HBase, CouchDB, MongoDB, Redis, etc.) Since 2009 Increasing popularity, term NoSql widely used Conjunction of all non relational drifts http://nosql-databases.org 10
NoSQL core systems Key/Value stores (Wide) column stores Document stores Graph databases NOSQL Overview (1) Soft NoSql systems Object databases XML databases Grid databases Many more non relational databases 11
Key Value stores NOSQL Overview (3) Mostly ease-to-use key/value schema Simple schema Queries limited Not only strings as values (Sets; Lists) Amazon Dynamo, Redis, Voldemort 12
NOSQL Overview (2) Wide Column stores / Column families Idea from the 80s. Column oriented More I/O efficient for read only queries (compression, aggregation) Data Warehouses C-Store (VLDB 2005) -> Vertica -> HP, Sysbase IQ Column stores nowadays HBase, Cassandra, Hyptertable One key for many key/value pairs (grouping to column families) Mixture of column based approach and key/value systems 13
Document based stores NOSQL Overview (4) Storage of structured data collections JSON, YAML, RDF Nested key-value pairs CouchDB (storage as JSON), MongoDB (BSON), Riak Graph databases Management of graph or tree structures Graph structure Hyperlink structure, Pagerank, shortest paths (social networks) Property graphs ( Alice (23) knows Bob (46) ) Problems (recursion) of relational databases can be avoided Neo4j, Sones GraphDB (native graph databases) 14
NoSQL basics Map/Reduce CAP Theorem 15
Map/Reduce MapReduce: Simplified Data Processing on Large Clusters (2004) Framework developed at Google, Dean & Ghemawat Processing of large amounts of data Terabytes up to petabytes Concurrent computations Inverted indexes Hide everything from end users Automatic parallelization and distribution No deadlocks, side effects (computation on copies of orig. data) Fault Tolerance for software and hardware I/O Scheduling (disc scheduling) Monitoring 16
Based on two constructs MapReduce basics Map & Reduce Primitives presented in functional languages like Lisp / ML Input and output are key/value pairs map (k1,v1) -> list(k2,v2) (generating a set of intermediate keys) reduce (k2,list(v2)) -> list(v2) (merging of same keys) No side effects -> Parallel execution possible 17
MapReduce approach 18
A first example Wordcount Count words in a document Map (String key, String value) //key: doc name //value: doc contents For each word w in value: Emit (w, 1 ); Reduce (String key, Iterator values) //key: a word //values: a list of counts For each v in values: result += ParseInt(v) Emit(result) 19
A first example Example taken from http://tarnbarford.net/journal/mapreduce-on-mongo 20
Data flow (1) 1.Distribute input data on different map processes 2.Parallel execution of the map processes (provided by user!) 3. Save intermediate results Wait until all processes are finished 4. Start reduce processes (provided by user!) One reduce process for every set of intermediate results 5. Save final results 21
Dataflow improved (1) Input divided in chunks 16-64 MB Master and workers Master assigns data, workers doing the job Map gets input chunk Writes intermediate results periodically to the local filesystem into R different partitions Reducer gets data over RPC Sorting and grouping (more keys per reduce-worker) Apply reduce function Saved in R output files 23
Dataflow improved (2) 24
Master details Maintain a master machine directing the job Assigns chunks to map tasks (map stores files to intermediate files) Scheduling across machines Inter machine communication Notifies reduce task where to find data (RPC) Reassignment of tasks (machine errors) 25
CAP Theorem & BASE
CAP Theorem Consistency major requirement for database systems ACID (Atomicity, Consistency, Isolation-, Durability) Things changed with web 2.0 Consistency & parallel access (horizontal scaling -> replication) CAP theorem Consistency Availability Partition Tolerance (=Ausfalltoleranz) Only 2 of 3 factors can be reached for distributed databases (Eric Brewer, 2000, Principles of Distributed Computing) Proven theorem 27
CAP Theorem (2) Consistency Consistent state of a distributed system after a transaction has finished Consistent if and only if all replicated nodes got updated from a previous transaction. Availability Acceptable reaction time Depends on system Partition Tolerance If a node fails, whole system is still available redundancy 28
Cap Theorem (3) Example distributed DB (simplified) K1 responsible for writes on data D0 K2 responsible for reads of D0 New write D0 changes after write into state D1 K2 receives update and new state (over synchronisation mechanism) NOW: network down Blocking K1? (system only reacts after full syncronization, block certain writes, bring system down eventually) Loosen consistency for availability? Depends on use case 29
Cap Theorem (4) 30
Alternative consistency model All about Availability BASE Consistency less important than availability Optimistic behaviour Not consistent after every transaction Eventual Consistency Consistency is reached after a certain amount of time Especially for systems with a lot of replicated nodes Spectrum of possibilities (not either ACID or BASE) Keep in mind when using a NoSQL DB 31