Lecture Data Warehouse Systems



Similar documents
SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Can the Elephants Handle the NoSQL Onslaught?

Advanced Data Management Technologies

Open source large scale distributed data management with Google s MapReduce and Bigtable

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

NoSQL Data Base Basics

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

How To Scale Out Of A Nosql Database

nosql and Non Relational Databases

Challenges for Data Driven Systems

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Introduction to NOSQL

Big Data Technologies. Prof. Dr. Uta Störl Hochschule Darmstadt Fachbereich Informatik Sommersemester 2015

An Approach to Implement Map Reduce with NoSQL Databases

Big Data and Apache Hadoop s MapReduce

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Making Sense ofnosql A GUIDE FOR MANAGERS AND THE REST OF US DAN MCCREARY MANNING ANN KELLY. Shelter Island

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

NoSQL Databases. Nikos Parlavantzas


A COMPARATIVE STUDY OF NOSQL DATA STORAGE MODELS FOR BIG DATA

Structured Data Storage

NoSQL Systems for Big Data Management

Databases 2 (VU) ( )

Jeffrey D. Ullman slides. MapReduce for data intensive computing

So What s the Big Deal?

Cloud Computing at Google. Architecture

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Introduction to NoSQL

Preparing Your Data For Cloud

Slave. Master. Research Scholar, Bharathiar University

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Applications for Big Data Analytics

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Cloud Scale Distributed Data Storage. Jürmo Mehine

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

MapReduce (in the cloud)

Map Reduce / Hadoop / HDFS

Big Data With Hadoop

A programming model in Cloud: MapReduce

16.1 MAPREDUCE. For personal use only, not for distribution. 333

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Infrastructures for big data

INTRODUCTION TO CASSANDRA

An Open Source NoSQL solution for Internet Access Logs Analysis

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases

NoSQL. Thomas Neumann 1 / 22

Integrating Big Data into the Computing Curricula

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

How To Write A Database Program

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

these three NoSQL databases because I wanted to see a the two different sides of the CAP

Hadoop Ecosystem B Y R A H I M A.

Big Systems, Big Data

MapReduce with Apache Hadoop Analysing Big Data

NoSQL Databases. Polyglot Persistence

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

Large-Scale Data Processing

How To Improve Performance In A Database

NoSQL Database Systems and their Security Challenges

NoSQL for SQL Professionals William McKnight

NOSQL INTRODUCTION WITH MONGODB AND RUBY GEOFF

Hadoop implementation of MapReduce computational model. Ján Vaňo

Chapter 7. Using Hadoop Cluster and MapReduce

NoSQL Evaluation. A Use Case Oriented Survey

The NoSQL Ecosystem, Relaxed Consistency, and Snoop Dogg. Adam Marcus MIT CSAIL

NoSQL Database Options

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

NoSQL. What Is NoSQL? Why NoSQL?

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

NOSQL DATABASE SYSTEMS

Data Management in the Cloud -

BRAC. Investigating Cloud Data Storage UNIVERSITY SCHOOL OF ENGINEERING. SUPERVISOR: Dr. Mumit Khan DEPARTMENT OF COMPUTER SCIENCE AND ENGEENIRING

Data-Intensive Computing with Map-Reduce and Hadoop

NOSQL DATABASES AND CASSANDRA

Big Data Analytics. Rasoul Karimi

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

NoSQL and Graph Database

Hadoop. Sunday, November 25, 12

NoSQL: Going Beyond Structured Data and RDBMS

How To Handle Big Data With A Data Scientist

Survey of NoSQL Database Engines for Big Data

A survey of big data architectures for handling massive data

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Transcription:

Lecture Data Warehouse Systems Eva Zangerle SS 2013

PART C: Novel Approaches in DW NoSQL and MapReduce

Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores will dominate the DW market over time, replacing row stores. Vast majority of data warehouses are not candidates for main or flash memory. Massively parallel processor systems will be omnipresent in this market. No knobs is the only thing that makes any sense. Appliances should be software only. 3

Stonebraker on Data Warehouses Hybrid workloads are not optimized by one size fits all Essentially all DW installations want high availability. DBMS should support online reprovisioning. Virtualization often has performance problems in a DMBS world. 4

Big Data Estimates of International Data Corporation 2006: 0.18 zettabytes stored electronically 1 zettabyte = 1 billion terabytes 2011: 1.8 zettabytes 2012: 2.7 zettabytes Data from Internet archive, social networks (photos, posts) LHC, NYSE, sensor networks, machine logs, etc. Problem Storage capacities of hard drives increased Access speeds have not kept up New data needs, can not be squeezed into RDBMS (e.g. graph data) 5

Aim of NoSQL NoSQL Introduction Develop databases to target data amounts in terabyte / petabyte scale (web 2.0 age) Hard to scale relational systems with commodity hardware Characteristics (often) Non relational Scale horizontally (scale-out) Open source (of course not strict: Amazon SimpleDB) Schema free (cf. alter table) Replication Consistency model is BASE/eventually consistent (often not ACID) Simple API (complex queries often not possible, CRUD operators (create, read, update, delete) 6

NoSQL Definition Taken from www.nosql-database.org 7

NoSQL Introduction (2) Time of one-size-fits-all database over Alternative to the relational database model Separation non trivial (relational SQL vs NoSQL) HadoopDB Advantages for specific use cases Scaling, application development, operating costs Open source as a central element No top-down approach De facto standards (Hadoop) Nowadays interpreted as Not only SQL Facebook Different flavours, new? 8

Key/value database Early Forms of NoSQL Database manager (1979), stores data by use of a key in buckets (hashing) Document oriented systems IBM Lotus Notes (1984), stores user docs, groupware system Storage as Key/value pairs Berkeley DB (1991) Column oriented systems Sysbase IQ (1996) Distribution of these systems limited compared to RDBMS 9

NOSQL - a Short History (1) 1998 Word NoSQL first time used. Relational model but no SQL API 2004 Google s Map/Reduce & BigTable system GFS Until 2005 NoSql like systems (Db4o (objects), Neo4j (graphs)) 2006-2009 Development of standard NoSQL systems (HBase, CouchDB, MongoDB, Redis, etc.) Since 2009 Increasing popularity, term NoSql widely used Conjunction of all non relational drifts http://nosql-databases.org 10

NoSQL core systems Key/Value stores (Wide) column stores Document stores Graph databases NOSQL Overview (1) Soft NoSql systems Object databases XML databases Grid databases Many more non relational databases 11

Key Value stores NOSQL Overview (3) Mostly ease-to-use key/value schema Simple schema Queries limited Not only strings as values (Sets; Lists) Amazon Dynamo, Redis, Voldemort 12

NOSQL Overview (2) Wide Column stores / Column families Idea from the 80s. Column oriented More I/O efficient for read only queries (compression, aggregation) Data Warehouses C-Store (VLDB 2005) -> Vertica -> HP, Sysbase IQ Column stores nowadays HBase, Cassandra, Hyptertable One key for many key/value pairs (grouping to column families) Mixture of column based approach and key/value systems 13

Document based stores NOSQL Overview (4) Storage of structured data collections JSON, YAML, RDF Nested key-value pairs CouchDB (storage as JSON), MongoDB (BSON), Riak Graph databases Management of graph or tree structures Graph structure Hyperlink structure, Pagerank, shortest paths (social networks) Property graphs ( Alice (23) knows Bob (46) ) Problems (recursion) of relational databases can be avoided Neo4j, Sones GraphDB (native graph databases) 14

NoSQL basics Map/Reduce CAP Theorem 15

Map/Reduce MapReduce: Simplified Data Processing on Large Clusters (2004) Framework developed at Google, Dean & Ghemawat Processing of large amounts of data Terabytes up to petabytes Concurrent computations Inverted indexes Hide everything from end users Automatic parallelization and distribution No deadlocks, side effects (computation on copies of orig. data) Fault Tolerance for software and hardware I/O Scheduling (disc scheduling) Monitoring 16

Based on two constructs MapReduce basics Map & Reduce Primitives presented in functional languages like Lisp / ML Input and output are key/value pairs map (k1,v1) -> list(k2,v2) (generating a set of intermediate keys) reduce (k2,list(v2)) -> list(v2) (merging of same keys) No side effects -> Parallel execution possible 17

MapReduce approach 18

A first example Wordcount Count words in a document Map (String key, String value) //key: doc name //value: doc contents For each word w in value: Emit (w, 1 ); Reduce (String key, Iterator values) //key: a word //values: a list of counts For each v in values: result += ParseInt(v) Emit(result) 19

A first example Example taken from http://tarnbarford.net/journal/mapreduce-on-mongo 20

Data flow (1) 1.Distribute input data on different map processes 2.Parallel execution of the map processes (provided by user!) 3. Save intermediate results Wait until all processes are finished 4. Start reduce processes (provided by user!) One reduce process for every set of intermediate results 5. Save final results 21

Dataflow improved (1) Input divided in chunks 16-64 MB Master and workers Master assigns data, workers doing the job Map gets input chunk Writes intermediate results periodically to the local filesystem into R different partitions Reducer gets data over RPC Sorting and grouping (more keys per reduce-worker) Apply reduce function Saved in R output files 23

Dataflow improved (2) 24

Master details Maintain a master machine directing the job Assigns chunks to map tasks (map stores files to intermediate files) Scheduling across machines Inter machine communication Notifies reduce task where to find data (RPC) Reassignment of tasks (machine errors) 25

CAP Theorem & BASE

CAP Theorem Consistency major requirement for database systems ACID (Atomicity, Consistency, Isolation-, Durability) Things changed with web 2.0 Consistency & parallel access (horizontal scaling -> replication) CAP theorem Consistency Availability Partition Tolerance (=Ausfalltoleranz) Only 2 of 3 factors can be reached for distributed databases (Eric Brewer, 2000, Principles of Distributed Computing) Proven theorem 27

CAP Theorem (2) Consistency Consistent state of a distributed system after a transaction has finished Consistent if and only if all replicated nodes got updated from a previous transaction. Availability Acceptable reaction time Depends on system Partition Tolerance If a node fails, whole system is still available redundancy 28

Cap Theorem (3) Example distributed DB (simplified) K1 responsible for writes on data D0 K2 responsible for reads of D0 New write D0 changes after write into state D1 K2 receives update and new state (over synchronisation mechanism) NOW: network down Blocking K1? (system only reacts after full syncronization, block certain writes, bring system down eventually) Loosen consistency for availability? Depends on use case 29

Cap Theorem (4) 30

Alternative consistency model All about Availability BASE Consistency less important than availability Optimistic behaviour Not consistent after every transaction Eventual Consistency Consistency is reached after a certain amount of time Especially for systems with a lot of replicated nodes Spectrum of possibilities (not either ACID or BASE) Keep in mind when using a NoSQL DB 31