Big Data Management. Big Data Management. (BDM) Autumn 2013. Povl Koch September 16, 2013 15-09-2013 1



Similar documents
Big Data Management. Big Data Management. (BDM) Autumn Povl Koch September 30,

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch September 2,

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

Can the Elephants Handle the NoSQL Onslaught?

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Scalable Architecture on Amazon AWS Cloud

NoSQL Databases. Nikos Parlavantzas

NoSQL in der Cloud Why? Andreas Hartmann

NoSQL and Hadoop Technologies On Oracle Cloud

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

Lecture Data Warehouse Systems

MapReduce with Apache Hadoop Analysing Big Data

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Cloud Scale Distributed Data Storage. Jürmo Mehine

HadoopRDF : A Scalable RDF Data Analysis System

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

MongoDB Developer and Administrator Certification Course Agenda

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Big Data and Data Science: Behind the Buzz Words

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data and Apache Hadoop s MapReduce

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Understanding NoSQL on Microsoft Azure

Open Source Technologies on Microsoft Azure

Understanding NoSQL Technologies on Windows Azure

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Map Reduce & Hadoop Recommended Text:

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

An Approach to Implement Map Reduce with NoSQL Databases

MongoDB and Couchbase

Open source large scale distributed data management with Google s MapReduce and Bigtable

NoSQL Database Options

MongoDB: document-oriented database

Challenges for Data Driven Systems

Big Data Technologies Compared June 2014

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Future-Proofing MySQL for the Worldwide Data Revolution

Big Systems, Big Data

How To Handle Big Data With A Data Scientist

I/O Considerations in Big Data Analytics

Introduction to Hadoop

Hadoop and Map-Reduce. Swati Gore

Comparing SQL and NOSQL databases

.NET User Group Bern

Integrating Big Data into the Computing Curricula

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Microsoft Azure Data Technologies: An Overview

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Scaling Out With Apache Spark. DTL Meeting Slides based on

WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS

Apache Hadoop. Alexandru Costan

Dr. Chuck Cartledge. 15 Oct. 2015

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases

A very short Intro to Hadoop

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

CS 378 Big Data Programming

Design and Evolution of the Apache Hadoop File System(HDFS)

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

How To Scale Out Of A Nosql Database

Internals of Hadoop Application Framework and Distributed File System

Testing Big data is one of the biggest

Google Bing Daytona Microsoft Research

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce

Big Data and the Cloud Trends, Applications, and Training

I Logs. Apache Kafka, Stream Processing, and Real-time Data Jay Kreps

Implement Hadoop jobs to extract business value from large and varied data sets

Enterprise Operational SQL on Hadoop Trafodion Overview

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

So What s the Big Deal?

Cleveland State University

Using Intermediate Data of Map Reduce for Faster Execution

Logistics. Database Management Systems. Chapter 1. Project. Goals for This Course. Any Questions So Far? What This Course Cannot Do.

MongoDB. The Definitive Guide to. The NoSQL Database for Cloud and Desktop Computing. Apress8. Eelco Plugge, Peter Membrey and Tim Hawkins

Big Data Training - Hackveda

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

E6895 Advanced Big Data Analytics Lecture 4:! Data Store

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop Job Oriented Training Agenda

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Big Data Analysis and HADOOP

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Cloudera Certified Developer for Apache Hadoop

The NoSQL Ecosystem, Relaxed Consistency, and Snoop Dogg. Adam Marcus MIT CSAIL

HDFS. Hadoop Distributed File System

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Transcription:

Big Data Management Big Data Management (BDM) Autumn 2013 Povl Koch September 16, 2013 15-09-2013 1

Overview Today s program 1. Little more practical details about this course 2. Chapter 7 in NoSQL Distilled 3. Introduction to first database (DB1) MongoDB Chapter 9 in NoSQL Distilled 4. Feedback on exercise 4 (selected data set) 5. New exercise 5 15-09-2013 2

Part 1: Practical details Little more practical details about this course 15-09-2013 3

Course Homepage ITU Intranet http://www.itu.dk/courses/sbdm/e2013/ Course announcements Use it for exercises and TA help First dataset selected: GitHub Archive or Instagram The three databases selected: MongoDB - http://www.mongodb.org/ Hadoop - hadoop.apache.org/ Neo4j - http://www.neo4j.org/ Intro article: http://martinfowler.com/articles/nosql-intro-original.pdf

Teaching Assistants Two teaching assistants for now André Aike Baars <aaba@itu.dk> Ashley Philip Davison-White <ashw@itu.dk> 15-09-2013 5

Course overview Only preliminary for next 4 weeks Lecture Topics covered Litterature 1 Aug. 26 2 Sep. 2 3 Sep. 9 4 Sep. 16 Overview of course. Course details. Big Data use cases. Data Centers. Relational vs. Nonrelational. Exercise 1: Research open datasets Exercise 2: Storage technologies Aggregate data models, graph databases, differences from relational. Selection of Data Set 1 (DS1). Exercise 3: Experiments with DS1. Distribution models, consistency, version stamps. Exercise 4: More experiments with DS1 MongoDB introduction, basics, and Map- Reduce Exercise 5: Map-Reduce on DS1 NoSQL Distilled chapter 1 NoSQL Distilled chapter 2-3 NoSQL Distilled chapter 4-6 NoSQL Distilled chapter 7 and 9 15-09-2013 6

Course overview Only preliminary for next 4 weeks Lecture Topics covered Litterature 5 Sep. 23 Oct. 7 Nov. 4 Key-Value Stores Exercise 6: Experiement with Key-Values Exercise 7: Data Set 2 External lecturer: Hadoop (IBM, Søren Ravn) External lecturer: IBM-Vestas case (IBM, Claus Samuelsen) NoSQL Distilled chapter 8 Also trying to get Microsoft lecturer (maybe analytics) Philippe Bonnet (currently at INRIA Paris will join for lecture on security and big data 15-09-2013 7

Part 2: NoSQL Distilled Chapters 2 & 3 NoSQL Distilled Chapters 7 15-09-2013 8

Central database server vs. cluster Single database server Database cluster Stored procedures Server Amount of data? Amount of data? Local processing Client Client 15-09-2013 9

Map-Reduce Map-Reduce is inspired by functional programming languages Aggregate data structure Key Value Independent use on each single record => Easily parallelizable 15-09-2013 10

Map-Reduce Reduce function aggregates the key-value pairs 15-09-2013 11

Map-Reduce Multiple reducers can run in parallel Partitions 15-09-2013 12

Map-Reduce Reducing data transfer: combining reducer must give same output format as input format Combiners can begin before map functions have completed 15-09-2013 13

Map-Reduce Not all reduce functions can be combined What would a combinable reduce function look like? 15-09-2013 14

Map-Reduce Limitations of Map-Reduce framework Map functions can only work on one aggregate Reduce functions can only operate on a single key 15-09-2013 15

Map-Reduce Example of calculating averages 15-09-2013 16

Map-Reduce Example of counting number of orders Generated by map function 15-09-2013 17

Map-Reduce Two-stage map-reduce example 15-09-2013 18

Map-Reduce First, monthly sales of a product Composite key 15-09-2013 19

Map-Reduce Second, reduce to product per year New composite key +1 No record being emitted for 2009 15-09-2013 20

Map-Reduce Lastly, merge of records 15-09-2013 21

Map-Reduce Working with map-reduce Any programming language Java, etc. Specialized programming languages Apache Pig (spinout from Hadoop) Hive with SQL-like syntax 15-09-2013 22

Map-Reduce Summary Map function reduces to key-value pairs Map functions only read a single aggregate at a time, so good parallelism Reduce functions take may key-value pairs to give a single output Reduce functions only work on a single key, so can easily be parallelized M/R can be chained and intermediate results be stored 15-09-2013 23

Part 3: MongoDB MongoDB introduction, NoSQL Distilled chapter 9 15-09-2013 24

MongoDB Differences between Oracle database and MongoDB Must be unique 15-09-2013 25

MongoDB Different data structures in same collection ( table ) Array Max doc size: 16 MB 15-09-2013 26

MongoDB Features: replica sets Assigned by user 0 1000 Every write can specify how many writes, e.g., majority Also for writes, WriteConcern Every read can specify if slave node reads, i.e., slaveok 15-09-2013 27

MongoDB Uses of replica sets Data redundancy Automated failover Read scaling Disaster recovery 15-09-2013 28

MongoDB Features: Transactions Not possible in traditional way Writes can be atomic transactions per document only From MongoDB documentation: Write operations are atomic on the level of a single document: no single write operation can atomically affect more than one document or more than one collection. When a single write operation modifies multiple documents, the operation as a whole is not atomic, and other operations may interleave. The modification of a single document, or record, is always atomic, even if the write operation modifies multiple subdocument within the single record. 15-09-2013 29

MongoDB Scaling: making the database handle more READ load When joining replica set, it automatically gets synchronized 15-09-2013 30

MongoDB Scaling: making the database handle more WRITE load Sharding/specialization based on selected field or compound field that exists in all documents, e.g., first name 15-09-2013 31

MongoDB Two types of shard keys Range key: Hash key: 15-09-2013 32

MongoDB Find, update and set documents in the collection inventory 15-09-2013 33

MongoDB Aggregation/map-reduce in MongoDB example the code 15-09-2013 34

Aggregation/map-reduce in MongoDB example the evaluation 15-09-2013 35

MongoDB Use cases Good Event logging Content management / blogging Web analytics, real time analytics E-commerce application Bad Complex transactions spanning multiple operations Varying aggregate structures 15-09-2013 36

Exercise for today Experiment with MongoDB 15-09-2013 37

Excercise 5: Experiment with MongoDB Experiments with MongoDB and your dataset Your CEO has returned home from a conference where he has heard about Map-Reduce and how well it scales. Based on your selected data set (Instagram or Github Archive), you decide to make some experiements with MongoDB: - Consider how to distribute your data among multiple nodes on a single site, for optimizing analysis/read operations, what principle would you use (range or hash based distribution) on what keys - Decide on 3 different analysis of your data where map-reduce would be well suitet, describe what the query, map and reduce functions would be 15-09-2013 38