Big Data Management. Big Data Management. (BDM) Autumn 2013. Povl Koch September 2, 2013 01-09-2013 1



Similar documents
Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

NoSQL Databases. Nikos Parlavantzas

Cloud Scale Distributed Data Storage. Jürmo Mehine

NoSQL Databases. Polyglot Persistence

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Lecture Data Warehouse Systems

INTRODUCING DRUID: FAST AD-HOC QUERIES ON BIG DATA MICHAEL DRISCOLL - CEO ERIC TSCHETTER - LEAD METAMARKETS

NoSQL for SQL Professionals William McKnight

NoSQL Database Systems and their Security Challenges

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

NoSQL Data Base Basics

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

How To Handle Big Data With A Data Scientist

Understanding NoSQL on Microsoft Azure

How graph databases started the multi-model revolution

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

NoSQL in der Cloud Why? Andreas Hartmann

Cloud & Big Data a perfect marriage? Patrick Valduriez

An Approach to Implement Map Reduce with NoSQL Databases

Infrastructures for big data

Database Management System Choices. Introduction To Database Systems CSE 373 Spring 2013

How To Scale Out Of A Nosql Database

Logistics. Database Management Systems. Chapter 1. Project. Goals for This Course. Any Questions So Far? What This Course Cannot Do.

Preparing Your Data For Cloud

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Microsoft Azure Data Technologies: An Overview

Big Systems, Big Data

A COMPARATIVE STUDY OF NOSQL DATA STORAGE MODELS FOR BIG DATA

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

nosql and Non Relational Databases

Can the Elephants Handle the NoSQL Onslaught?

Challenges for Data Driven Systems

Big Data Technologies. Prof. Dr. Uta Störl Hochschule Darmstadt Fachbereich Informatik Sommersemester 2015

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

MongoDB and Couchbase

MEAP Edition Manning Early Access Program Neo4j in Action MEAP version 3

Understanding NoSQL Technologies on Windows Azure

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Domain driven design, NoSQL and multi-model databases

An Open Source NoSQL solution for Internet Access Logs Analysis

INTRODUCTION TO CASSANDRA

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

Databases 2 (VU) ( )

Applications for Big Data Analytics

Structured Data Storage

Big data and urban mobility

NoSQL. Thomas Neumann 1 / 22

Moving From Hadoop to Spark

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

The Quest for Extreme Scalability

Cloud Computing with Microsoft Azure

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Slave. Master. Research Scholar, Bharathiar University

Integrating Big Data into the Computing Curricula

Introduction to NOSQL

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce

Choosing The Right Big Data Tools For The Job A Polyglot Approach

BIG DATA TOOLS. Top 10 open source technologies for Big Data

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Business Intelligence and Column-Oriented Databases

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

Reference Architecture, Requirements, Gaps, Roles

GRAPH DATABASE SYSTEMS. h_da Prof. Dr. Uta Störl Big Data Technologies: Graph Database Systems - SoSe

NoSQL Systems for Big Data Management

Cloud Computing Is In Your Future

The World s Leading Graph Database

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

NoSQL Database Options

Database Design for NoSQL Systems

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

How To Improve Performance In A Database

Evaluation of NoSQL databases for large-scale decentralized microblogging

bigdata Managing Scale in Ontological Systems

Do Relational Databases Belong in the Cloud? Michael Stiefel

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Open source, high performance database

Big Data Challenges in Bioinformatics

REAL-TIME BIG DATA ANALYTICS

.NET User Group Bern

Introduction to Polyglot Persistence. Antonios Giannopoulos Database Administrator at ObjectRocket by Rackspace

Transcription:

Big Data Management Big Data Management (BDM) Autumn 2013 Povl Koch September 2, 2013 01-09-2013 1

Overview Today s program 1. Little more practical details about this course 2. Chapter 2 & 3 in NoSQL Distilled 3. Selection of data set 1 (DS1) exercise 1 4. Walkthrough of exercise 2 (storage technologies) 5. New exercise 3 01-09-2013 2

Part 1: Practical details Little more practical details about this course 01-09-2013 3

Course Homepage ITU Intranet http://www.itu.dk/courses/sbdm/e2013/ Course announcements

Teaching Assistants Two teaching assistants for now André Aike Baars <aaba@itu.dk> Ashley Philip Davison-White <ashw@itu.dk> 01-09-2013 5

Course overview Only preliminary for next 4 weeks Lecture Topics covered Litterature 1 Aug. 26 2 Sep. 2 3 Sep. 9 4 Sep. 16 Overview of course. Course details. Big Data use cases. Data Centers. Relational vs. Nonrelational. Exercise 1: Research open datasets Exercise 2: Storage technologies Aggregate data models, graph databases, differences from relational. Selection of Data Set 1 (DS1). Exercise 3: Experiments with DS1. Distribution models, consistency, version stamps. Exercise 4: More experiments with DS1 Map-Reduce Exercise 5: Map-Reduce on DS1 NoSQL Distilled chapter 1 NoSQL Distilled chapter 2-3 NoSQL Distilled chapter 4-6 NoSQL Distilled chapter 7 01-09-2013 6

Course overview Only preliminary for next 4 weeks Lecture Topics covered Litterature 5 Sep. 23 Key-Value Stores Exercise 6: Experiement with Key-Values Exercise 7: Data Set 2 NoSQL Distilled chapter 8 01-09-2013 7

Part 2: NoSQL Distilled Chapters 2 & 3 NoSQL Distilled Chapters 2 & 3 01-09-2013 8

Relational Data Model Relational Data Model Tuples/rows (simple types) Relation 01-09-2013 9

Aggregate Data Models Domain-Driven Design s AGGREGATE data models Collection of related objects Complex data types: lists, etc Treated as a unit for data manipulation Unit for consistency, updates with atomic opreations Easier to handle in cluster Natural unit for replication Easier for application programmers 01-09-2013 10

Relational model Example of e-commerce order in UML NoSQL Distilled p. 15 01-09-2013 11

Relational model Example of e-commerce order NoSQL Distilled p. 15 01-09-2013 12

Aggregate data model Example of e-commerce order in aggregate model JSON notation NoSQL Distilled p. 16 01-09-2013 13

Aggregate data model Example of e-commerce order in aggregate model NoSQL Distilled p. 18 01-09-2013 14

Aggregate orientation Consequences of aggregate orientation Choice of aggregates is crucial to how easy it is to access data later Aggregate-ignorant databases makes it easy to look at data in different ways Aggregate-aware databases makes data easier to distribute in a cluster Transactions is still import to keep consistency In relational database model, ACID may span multipe tables, so can be difficult in cluster environment Consistency more easy in aggregate data model, still ACID properties in some models per aggregate 9/1/2013 15

Document Databases Document database Document databases can be quieried based on fields, a key or any other field Example of document database: MongoDB Key-Value databases only hold a key and its associated value order_1_name -> Martin order_99_customer -> 1 NoSQL Distilled p. 16 01-09-2013 16

Column-Family Store Structure in Column-Family store (inspired by BigTable ) Accessed together Customer-ID 1234 Do not think of this as row but as aggregate Difficult question: how much was sold over last two weeks? 01-09-2013 17

Graph Databases Example of graph databases Edges Node Small records with complex interconnections Example: Neo4j Example query: What does Anna and Barbara both like? 01-09-2013 18

Schemaless Databases Schema vs. schemaless Schema-based databases Easy to format data for reports, web presentation Difficult to accomodate new data, SQL actually allows for change of tables Schemaless databases (key-value, document, column-store, graph) Easy to add new data, best for non-uniform data But formatting becomes more difficult, there is an implicit schema anyway Can be very difficult to deal with both old data and new data 01-09-2013 19

Data access Modelling/optimizing for data access 01-09-2013 20

Column Store vs. Graph versions Column Store version Graph version 01-09-2013 21

Part 3: selection of data set 1 (DS1) Selection of data set 1 (DS1) exercise 1 01-09-2013 22

Received proposals Received student proposals 1. Instagram http://instagram.com/developer/realtime/ 2. Crime of Chicago https://data.cityofchicago.org/public-safety/crimes-2001-topresent/ijzp-q8t2 3. OpenStreetMap http://www.openstreetmap.org/ 4. Transport for London http://www.tfl.gov.uk/businessandpartners/syndication/1649 3.aspx#17615 5. 1000 genomes http://www.1000genomes.org/ 01-09-2013 23

Received proposals Cont. Any kind of weather dataset 6. Datasift social data multiple datasets http://datasift.com/ 7. Facebook stream https://developers.facebook.com/docs/reference/api/realtim e/ 8. Twitter https://dev.twitter.com/docs/streaming-apis 9. Sloan Digital Sky Survey http://www.sdss.org/dr6/index.html 10. GitHub Archive http://www.githubarchive.org 01-09-2013 24

Received proposals Cont. 2 11. Wikipedia, esp. political purposes http://dumps.wikimedia.org/enwiki/ Airline ticket prices 01-09-2013 25

Proposal 1 Instragram http://instagram.com/developer/realtime/ 01-09-2013 26

Proposal 2 Crime of Chicago https://data.cityofchicago.org/public- Safety/Crimes-2001-to-present/ijzp-q8t2 01-09-2013 27

Proposal 3 OpenStreetMap http://www.openstreetmap.org/ http://wiki.openstreetmap.org/wiki/databases_and_data_access_apis 01-09-2013 28

Proposal 4 Transport for London http://www.tfl.gov.uk/businessandpartners/syndication/16493.aspx#17615 01-09-2013 29

Proposal 5 1000 genomes http://www.1000genomes.org/ 01-09-2013 30

Proposal 6 Datasift social data http://datasift.com/ 01-09-2013 31

Proposal 7 Facebook 01-09-2013 32

Proposal 8 Twitter https://dev.twitter.com/docs/streaming-apis 01-09-2013 33

Proposal 9 Sloan Digital Sky Survey http://www.sdss.org/dr6/index.html 01-09-2013 34

Proposal 10 GitHub Archive http://www.githubarchive.org 01-09-2013 35

Proposal 11 Wikipedia http://live.dbpedia.org/ 01-09-2013 36

Short list Mostly likely to succeed Instagram Transport for London Twitter GitHub Selected data set:? 01-09-2013 37

Part 4: Feedback on exercise 2 Storage technologies some takeaways Big data in the size of 128 petabytes is expensive and requires significant amount of space and power Some details compared to 7,200 RPM hard drive 15,000 RPM drives, 10x more expensive, low capacity SATA SSD is 10x more expensive, compact, OK performance and low power PCIe SSD drive, high end, is very fast, but also very costly, 150x SD cards very compact, but slow 01-09-2013 38

Exercise 2: Hard disk vs. Solid State Drives

Exercise for today First data set exercise 01-09-2013 40

Excercise 3: Analysis of Data Set 1 Quick analysis of Data Set 1 (DS1) Your CEO has now changed focus from building the 128 petabytes datacenter due to budget constraints. Instead, the CEO asks you to analyze further the data set that the company s Advisory Group (the BDM class) has selected today. Please write up a new recommendation for the CEO about: - what is the specific big data characteristics of the DS1? - what is the data structure? and will it fit well with relational, key-value, column store, or graph database systems? (overall recommendation is OK) - what other data sources would be relevant to combine DS1 with? 01-09-2013 41