CPS 516: Data-intensive Computing Systems. Instructor: Shivnath Babu TA: Zilong (Eric) Tan

Similar documents
CPS 216: Advanced Database Systems (Data-intensive Computing Systems) Shivnath Babu

NewSQL: Towards Next-Generation Scalable RDBMS for Online Transaction Processing (OLTP) for Big Data Management

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Connecting the Dots. The Progress DataDirect Vision & Roadmap. Brad Wright Product Management October 7, 2013

Big Data Technologies Compared June 2014

Lecture Data Warehouse Systems

Big Data and Its Impact on the Data Warehousing Architecture

Hurtownie Danych i Business Intelligence: Big Data

Cloud Scale Distributed Data Storage. Jürmo Mehine

Structured Data Storage

Performance and Scalability Overview

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Database Usage in the Public and Private Cloud: Choices and Preferences

Data Analytics Infrastructure

Big Data and Data Science: Behind the Buzz Words

Peninsula Strategy. Creating Strategy and Implementing Change

Preparing Your Data For Cloud

Performance and Scalability Overview

The NoSQL Ecosystem, Relaxed Consistency, and Snoop Dogg. Adam Marcus MIT CSAIL

Cloud & Big Data a perfect marriage? Patrick Valduriez

Crazy NoSQL Data Integration with Pentaho

Big Data Technologies. Prof. Dr. Uta Störl Hochschule Darmstadt Fachbereich Informatik Sommersemester 2015

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Database Revolution: Old SQL, NewSQL, NoSQL Huh? Michael Bowers April 9, 2013 v2.9

INTRODUCTION TO CASSANDRA

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Cloud Big Data Architectures

A Study of NoSQL and NewSQL databases for data aggregation on Big Data

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

CISC 432/CMPE 432/CISC 832 Advanced Database Systems

Big Data. Lyle Ungar, University of Pennsylvania

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Composite Software Data Virtualization Turbocharge Analytics with Big Data and Data Virtualization

NoSQL Data Base Basics

BIG DATA APPLIANCES. July 23, TDWI. R Sathyanarayana. Enterprise Information Management & Analytics Practice EMC Consulting

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

Zynga Analytics Leveraging Big Data to Make Games More Fun and Social

Big Data and Industrial Internet

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Big Data Database Revenue and Market Forecast,

So What s the Big Deal?

Stream Processing on Demand for Lambda Architectures

BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

Tackling The Challenges of Big Data Big Data Storage. Tackling The Challenges of Big Data Big Data Storage. History Lesson. Michael Stonebraker

Big Data and Advanced Analytics Technologies and Use Cases"

Big Data Multi-Platform Analytics (Hadoop, NoSQL, Graph, Analytical Database)

Logistics. Database Management Systems. Chapter 1. Project. Goals for This Course. Any Questions So Far? What This Course Cannot Do.

(1) Required: The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, Kimball 2013 ISBN-13:

Database Internals (Overview)

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

Enterprise Operational SQL on Hadoop Trafodion Overview

Real Time Big Data Processing

Marko Grobelnik Jozef Stefan Institute Ljubljana, Slovenia

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

High performance analytics. Benchmark results.

Challenges for Data Driven Systems

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Open source large scale distributed data management with Google s MapReduce and Bigtable

Integrating Big Data into the Computing Curricula

THE AGE OF BIG DATA. Chula DataScience

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

How To Scale Out Of A Nosql Database

NoSQL Systems for Big Data Management

Step by Step: Big Data Technology. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 25 August 2015

Applications for Big Data Analytics

5 Database technology Trends. Guy Harrison, Executive Director, Information Management R&D

MapReduce with Apache Hadoop Analysing Big Data

Open Source Technologies on Microsoft Azure

Introduction to NOSQL

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop. for Oracle database professionals. Alex Gorbachev Calgary, AB September 2013

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

How To Write A Database Program

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Large-Scale Data Processing

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

6.S897 Large-Scale Systems

TABLE OF CONTENTS 1 Chapter 1: Introduction 2 Chapter 2: Big Data Technology & Business Case 3 Chapter 3: Key Investment Sectors for Big Data

BIG DATA TECHNOLOGIES, USE CASES, AND RESEARCH ISSUES

Transcription:

CPS 516: Data-intensive Computing Systems Instructor: Shivnath Babu TA: Zilong (Eric) Tan

The World of Big Data ebay had 6.5 PB of user data + 50 TB/day in 2009 From http://www.umiacs.umd.edu/~jimmylin/

From: http://www.cs.duke.edu/smdb10/

The World of Big Data ebay had 6.5 PB of user data + 50 TB/day in 2009 How much do they have now? See http://en.wikipedia.org/wiki/big_data Also see: http://wikibon.org/blog/big-data-statistics/ From http://www.umiacs.umd.edu/~jimmylin/

FOX AUDIENCE NETWORK Greenplum parallel DB 42 Sun X4500s ( Thumper ) each with: 48 500GB drives 16GB RAM 2 dual-core Opterons Big and growing 200 TB data (mirrored) Fact table of 1.5 trillion rows Growing 5TB per day 4-7 Billion rows per day Also extensive use of R and Hadoop Yahoo! runs a 4000 node Hadoop cluster (probably the largest). Overall, there are 38,000 nodes running Hadoop at Yahoo! From: http://db.cs.berkeley.edu/jmh/ As reported by FAN, Feb, 2009

How many female WWF fans under the age of 30 visited the Toyota community over the last 4 days and saw a Class A ad? How are these people similar to those that visited Nissan? From: http://db.cs.berkeley.edu/jmh/ Open-ended question about statistical densities (distributions)

No One Size Fits All Philosophy Operational Non-relational Analytic Hadoop Mapr Dryad MapReduce Online MapReduce Brisk GridGrain In-memory Platfora Druid Spark Shark Hadapt Teradata Aster RainStor SAP HANA EMC Greenplum HP Vertica Dremel Relational SAP Sybase IQ IBM Netezza Infobright Impala Calpont Oracle IBM DB2 SQL Server PostgreSQL MySQL NoSQL DataStax Enterprise Neo4J LevelDB Hypertable DEX -as-a-service Cassandra HBase OrientDB Amazon RDS ClearDB Big tables Graph Riak Google Cloud SQL FathomDB AppEngine NuvolaBase SQL Azure Database.com Datastore Redis SimpleDB DynamoDB -as-a-service NewSQL New databases Voldemort SQLFire VoltDB MongoHQ Cloudant -as-a-service MemSQL CouchBase StormDB Drizzle NuoDB Key value MongoDB Document Xeround GenieDB Clustrix SchoonerSQL RavenDB Storage engines Solr ScaleDB Clustering/sharding ScaleArc HyperDex Lucene Xapian Lotus Notes Tokutek MySQL Cluster ScaleBase Continuent Search ElasticSearch Sphinx Streaming Esper Gigascope STREAM DejaVu MarkLogic InterSystems Storm Borealis Oracle CEP DataCell SQLStream Versant Starcounter Muppet S4 StreamBase Truviso Progress Sap Sybase ASE EnterpriseDB An extension of the figure given in http://blogs.the451group.com/information_management/2012/11/02/updated-database-landscape-graphic

What we will cover in class Scalable data processing Parallel query plans and operators Systems based on MapReduce Scalable key-value stores Processing rapid, high-speed data streams Principles of query processing Indexes Query execution plans and operators Query optimization Data storage Databases Vs. Filesystems (Google/Hadoop Distributed FileSystem) Data layouts (row-stores, column-stores, partitioning, compression) Concurrency control and fault tolerance/recovery Consistency models for data (ACID, BASE, Serializability) Write-ahead logging

Course Logistics Web pages: Course home page will be at Duke, and everything else will be on github Grading: Three exams: 10 (Feb) + 15 (March) + 25 (April) = 50% Project: 10 (Jan 21) + 10 (Feb 1 Feb 21) + 10 (Feb 22 March 10) + 20 (March 11 April 15) = 50% Books: No one single book Hadoop: The Definitive Guide, by Tom White Database Systems: The Complete Book, by H. Garcia- Molina, J. D. Ullman, and J. Widom

Project Part 0: Due in 2 Weeks For every single system listed in the Data Platforms Map, give as a list of succinct points: Strengths (with numbered references) Weaknesses (with numbered references) References (can be articles, blog posts, research papers, white papers, your own assessment, ) Your own thoughts only. Don t plagiarize. List every source of help. We will enforce honor code strictly. Submit on github (md format) into repository given by Zilong Outcomes: (a) Score out of 10; (b) Project leader selection.

Project Parts 1, 2, 3 Shivnath/Zilong will work with project leaders to assign one system per project. Will also try to have one mentor per project Each student will join one project. Project starts Feb 1 Part 1: Feb 1 Feb 21 Install system Develop an application workload to exercise the system Run workload and give demo and report Part 2: Feb 22 March 15 Identify system logs/metrics and other data that will help you understand deeply how the system is running the workload Collect and send the data to a Kafka/MySQL/ElasticSearch routing and storage system set up by Shivnath/Zilong. Give demo and report Part 3: March 16 to April 15 Analyze and visualize the data to bring out some nontrivial aspects of the system related to what we learn in class. Give demo and report

Primer on DBMS and SQL

Data Management User/Application Query Query Data Query DataBase Management System (DBMS)

Example: At a Company Query 1: Is there an employee named Nemo? Query 2: What is Nemo s salary? Query 3: How many departments are there in the company? Query 4: What is the name of Nemo s department? Query 5: How many employees are there in the Accounts department? Employee ID Name DeptID Salary 10 Nemo 12 120K 20 Dory 156 79K 40 Gill 89 76K 52 Ray 34 85K Department ID Name 12 IT 34 Accounts 89 HR 156 Marketing

DataBase Management System (DBMS) High-level Query Q Answer DBMS Translates Q into best execution plan for current conditions, runs plan Data

Example: Store that Sells Cars Owners of Honda Accords who are <= 23 years old Make Model OwnerID ID Name Age Honda Accord 12 12 Nemo 22 Honda Accord 156 156 Dory 21 Join (Cars.OwnerID = Owners.ID) Cars Filter (Make = Honda and Model = Accord) Make Model OwnerID Honda Accord 12 Toyota Camry 34 Mini Cooper 89 Honda Accord 156 Owners Filter (Age <= 23) ID Name Age 12 Nemo 22 34 Ray 42 89 Gill 36 156 Dory 21

DataBase Management System (DBMS) Keeps data safe and correct despite failures, concurrent updates, online processing, etc. High-level Query Q DBMS Data Answer Translates Q into best execution plan for current conditions, runs plan