CS435 Introduction to Big Data Final Exam Date: May 11 6:20PM 8:20PM Location: CSB 130 Closed Book, NO cheat sheets Topics covered *Note: Final exam is NOT comprehensive. 1. NoSQL Impedance mismatch Scale-up vs. Scale-out Polyglot persistence Consistency 2. Column-family storage systems (BigTable) Data model of BigTable 3-level hierarchical lookup scheme for tablets Read/Write operation Data compaction in BigTable Data compression in BigTable (What is the two-pass compression scheme?) Bloomfilter in BigTable 3. Key-value storage systems (Dynamo) Partitioning (Consistent Hashing) Chord protocol Vector clocks Data versioning Sloppy quorum Hinted handoff Merkle tree Ring membership Logical partitioning 4. Data flow management (Pig) Data types and cast Relational operations Skew reducing for order Replicated, skewed, and merge join Controlling execution Algebraic interface Page 1 of 5
5. Data exchange model (RESTful web service) 4 major HTTP methods for REST CRUD Idempotent request Managing errors Sample Questions Question A. daily = load NYSE_daily as (exchange:chararray, symbol:chararry, date:chararray, open:float, low:float, close:float, volume:int, adj_close:float);! rough = foreach daily generate volume*close; In above Apache Pig script, Pig will change volume to a (float) volume internally. (True/False) Question B. Consider that your software joins the following: (1) File A: Airport IDs (e.g., DEN and LAX) and information (e.g., address and capacity) (15 MB) (2) File B: Complete dataset of the flight schedules and the flight logs per airport for the last 30 years (500GB) If you use Apache Pig for this job, what types of join implementation would perform the best? Answer: Replicate-fragment join Page 2 of 5
Question C. Suppose that you build a course content service (e.g. Canvas system) using a RESTful web service. Users and services communicate via Canvas RESTful interfaces. The features that your service provides includes: Feature 1: Create a course Feature 2: Create a thread for a discussion board Feature 3: Delete a thread of a discussion board Feature 4: Add a comment to an ongoing discussion thread of a discussion board C-1. Which HTTP method is most suitable to build Feature-4 as a RESTful service? a. GET b. PUT c. DELETE d. POST C-2. Which HTTP method is most suitable to build Feature-3 as a RESTful service? a. GET b. PUT c. DELETE d. POST Page 3 of 5
Question D. Suppose that there is a DHT ID circle with an identifier space of size 2 m where m=3. The DHT uses the Chord protocol and the ID-space spans: 0 (2 m -1). Initially, there is only one storage node A (id=3) on the identifier ring of a DHT. D-1. Create a finger table for the node A. (Specify.start,.interval, and.successor for each entry) Answer: 4 [4,5) A 5 [5,7) A 7 [7,3) A D-2. Assume that three new machines, B, and C have joined the DHT in the following order: B (with id=0) then C (with id=5). Create finger tables for these nodes. If needed, modify the finger table at node A. Answers: At Node A At Node B 4 [4,5) C 5 [5,7) C 7 [7,3) B 1 [1,2) A 2 [2,4) A 4 [4,0) C At Node C 6 [6,0) B 0 [0,2) B 2 [2,5) A Page 4 of 5
Question E. Suppose that a Dynamo cluster maintains Merkle trees (per data partition) to synchronize replications. For a single data partition, the total number of data blocks stored at each of the replication servers is 4,096=2 12. Assume that there is one data block that has been corrupted at one of the replication servers. The degree of replication of this system is 3. To find the corrupted data, what is the maximum number of comparisons? And why? Answer: Compare the roots of the hash trees: to find out the replication server with the corrupted block V1= are_these_the_same(merkle_treea(root), merkle_treeb(root)) ---(1) V2 = are_these_the_same (merkle_treeb(root), merkle_treec(root)) ---(2) If V1 is true, C contains a corrupted block. If V1 is false, and V2 is true, A contains a corrupted block. If V1 is false and V2 is false, B contains a corrupted block. Consider a server with a corrupted data block (assume that is A), and a server without corruption (B or C). Now compare the hash values from the root (this has been done already in the previous step) to the leaf. Therefore, maximum 2 x 12 comparison between trees. ---(3) By (1), (2), and (3), the maximum number of comparisons is 26. Page 5 of 5