CS435 Introduction to Big Data



Similar documents
Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

The NoSQL Ecosystem, Relaxed Consistency, and Snoop Dogg. Adam Marcus MIT CSAIL

Dynamo: Amazon s Highly Available Key-value Store

Distributed Data Stores

LARGE-SCALE DATA STORAGE APPLICATIONS

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Bigdata High Availability (HA) Architecture

Cassandra A Decentralized, Structured Storage System

NoSQL Databases. Nikos Parlavantzas

Distributed Systems. Tutorial 12 Cassandra

Storage Systems Autumn Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

Cassandra. Jonathan Ellis

Data Structures for Big Data: Bloom Filter. Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014.

NoSQL. Thomas Neumann 1 / 22

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Dynamo: Amazon s Highly Available Key-value Store

CS 6343: CLOUD COMPUTING Term Project

Cloud Scale Distributed Data Storage. Jürmo Mehine

Comparing SQL and NOSQL databases

Practical Cassandra. Vitalii

A programming model in Cloud: MapReduce

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

Big Data Development CASSANDRA NoSQL Training - Workshop. March 13 to am to 5 pm HOTEL DUBAI GRAND DUBAI

FAWN - a Fast Array of Wimpy Nodes

Lecture 1: Data Storage & Index

Benchmarking Cassandra on Violin

A Review of Column-Oriented Datastores. By: Zach Pratt. Independent Study Dr. Maskarinec Spring 2011

Data Management in the Cloud

Lecture Data Warehouse Systems

Big Data Management and NoSQL Databases

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Big Data With Hadoop

Big Table A Distributed Storage System For Data

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

query enabled P2P networks Park, Byunggyu

Full and Complete Binary Trees

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Load Balancing in Structured Overlay Networks. Tallat M. Shafaat

HDB++: HIGH AVAILABILITY WITH. l TANGO Meeting l 20 May 2015 l Reynald Bourtembourg

Apache Cassandra Present and Future. Jonathan Ellis

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk

Merkle Hash Trees for Distributed Audit Logs

Highly available, scalable and secure data with Cassandra and DataStax Enterprise. GOTO Berlin 27 th February 2014

Hadoop Job Oriented Training Agenda

No.1 IT Online training institute from Hyderabad URL: sriramtechnologies.com

SWIFT. Page:1. Openstack Swift. Object Store Cloud built from the grounds up. David Hadas Swift ATC. HRL 2012 IBM Corporation

Open source large scale distributed data management with Google s MapReduce and Bigtable

TRUSTED ARCHIVE OVERVIEW

Using Object Database db4o as Storage Provider in Voldemort

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

A Brief Outline on Bigdata Hadoop

Cassandra A Decentralized Structured Storage System

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Introduction to Big Data Training

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Apache HBase: the Hadoop Database

Written examination in Cloud Computing

Domain driven design, NoSQL and multi-model databases

Technical Overview Simple, Scalable, Object Storage Software

Cloud Computing at Google. Architecture

CS455 - Lab 10. Thilina Buddhika. April 6, 2015

Distributed Storage Systems

Evaluation of NoSQL databases for large-scale decentralized microblogging

Big Data and Scripting map/reduce in Hadoop

Chapter 12 File Management. Roadmap

Chapter 12 File Management

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Operating Systems CSE 410, Spring File Management. Stephen Wagner Michigan State University

Comp 5311 Database Management Systems. 16. Review 2 (Physical Level)

Yahoo! Cloud Serving Benchmark

Using Peer to Peer Dynamic Querying in Grid Information Services

NoSQL and Hadoop Technologies On Oracle Cloud

Apache HBase. Crazy dances on the elephant back

FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

BIG DATA What it is and how to use?

Study and Comparison of Elastic Cloud Databases : Myth or Reality?

Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at

CS 245 Final Exam Winter 2013

HDFS Architecture Guide

Apache Hadoop. Alexandru Costan

HDFS. Hadoop Distributed File System

Comparison of Distribution Technologies in Different NoSQL Database Systems

Social Networks and the Richness of Data

Veeam Best Practices with Exablox

HADOOP MOCK TEST HADOOP MOCK TEST II

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

NoSQL Data Base Basics

Transcription:

CS435 Introduction to Big Data Final Exam Date: May 11 6:20PM 8:20PM Location: CSB 130 Closed Book, NO cheat sheets Topics covered *Note: Final exam is NOT comprehensive. 1. NoSQL Impedance mismatch Scale-up vs. Scale-out Polyglot persistence Consistency 2. Column-family storage systems (BigTable) Data model of BigTable 3-level hierarchical lookup scheme for tablets Read/Write operation Data compaction in BigTable Data compression in BigTable (What is the two-pass compression scheme?) Bloomfilter in BigTable 3. Key-value storage systems (Dynamo) Partitioning (Consistent Hashing) Chord protocol Vector clocks Data versioning Sloppy quorum Hinted handoff Merkle tree Ring membership Logical partitioning 4. Data flow management (Pig) Data types and cast Relational operations Skew reducing for order Replicated, skewed, and merge join Controlling execution Algebraic interface Page 1 of 5

5. Data exchange model (RESTful web service) 4 major HTTP methods for REST CRUD Idempotent request Managing errors Sample Questions Question A. daily = load NYSE_daily as (exchange:chararray, symbol:chararry, date:chararray, open:float, low:float, close:float, volume:int, adj_close:float);! rough = foreach daily generate volume*close; In above Apache Pig script, Pig will change volume to a (float) volume internally. (True/False) Question B. Consider that your software joins the following: (1) File A: Airport IDs (e.g., DEN and LAX) and information (e.g., address and capacity) (15 MB) (2) File B: Complete dataset of the flight schedules and the flight logs per airport for the last 30 years (500GB) If you use Apache Pig for this job, what types of join implementation would perform the best? Answer: Replicate-fragment join Page 2 of 5

Question C. Suppose that you build a course content service (e.g. Canvas system) using a RESTful web service. Users and services communicate via Canvas RESTful interfaces. The features that your service provides includes: Feature 1: Create a course Feature 2: Create a thread for a discussion board Feature 3: Delete a thread of a discussion board Feature 4: Add a comment to an ongoing discussion thread of a discussion board C-1. Which HTTP method is most suitable to build Feature-4 as a RESTful service? a. GET b. PUT c. DELETE d. POST C-2. Which HTTP method is most suitable to build Feature-3 as a RESTful service? a. GET b. PUT c. DELETE d. POST Page 3 of 5

Question D. Suppose that there is a DHT ID circle with an identifier space of size 2 m where m=3. The DHT uses the Chord protocol and the ID-space spans: 0 (2 m -1). Initially, there is only one storage node A (id=3) on the identifier ring of a DHT. D-1. Create a finger table for the node A. (Specify.start,.interval, and.successor for each entry) Answer: 4 [4,5) A 5 [5,7) A 7 [7,3) A D-2. Assume that three new machines, B, and C have joined the DHT in the following order: B (with id=0) then C (with id=5). Create finger tables for these nodes. If needed, modify the finger table at node A. Answers: At Node A At Node B 4 [4,5) C 5 [5,7) C 7 [7,3) B 1 [1,2) A 2 [2,4) A 4 [4,0) C At Node C 6 [6,0) B 0 [0,2) B 2 [2,5) A Page 4 of 5

Question E. Suppose that a Dynamo cluster maintains Merkle trees (per data partition) to synchronize replications. For a single data partition, the total number of data blocks stored at each of the replication servers is 4,096=2 12. Assume that there is one data block that has been corrupted at one of the replication servers. The degree of replication of this system is 3. To find the corrupted data, what is the maximum number of comparisons? And why? Answer: Compare the roots of the hash trees: to find out the replication server with the corrupted block V1= are_these_the_same(merkle_treea(root), merkle_treeb(root)) ---(1) V2 = are_these_the_same (merkle_treeb(root), merkle_treec(root)) ---(2) If V1 is true, C contains a corrupted block. If V1 is false, and V2 is true, A contains a corrupted block. If V1 is false and V2 is false, B contains a corrupted block. Consider a server with a corrupted data block (assume that is A), and a server without corruption (B or C). Now compare the hash values from the root (this has been done already in the previous step) to the leaf. Therefore, maximum 2 x 12 comparison between trees. ---(3) By (1), (2), and (3), the maximum number of comparisons is 26. Page 5 of 5