Big Data Management Big Data Management (BDM) Autumn 2013 Povl Koch September 16, 2013 15-09-2013 1
Overview Today s program 1. Little more practical details about this course 2. Chapter 7 in NoSQL Distilled 3. Introduction to first database (DB1) MongoDB Chapter 9 in NoSQL Distilled 4. Feedback on exercise 4 (selected data set) 5. New exercise 5 15-09-2013 2
Part 1: Practical details Little more practical details about this course 15-09-2013 3
Course Homepage ITU Intranet http://www.itu.dk/courses/sbdm/e2013/ Course announcements Use it for exercises and TA help First dataset selected: GitHub Archive or Instagram The three databases selected: MongoDB - http://www.mongodb.org/ Hadoop - hadoop.apache.org/ Neo4j - http://www.neo4j.org/ Intro article: http://martinfowler.com/articles/nosql-intro-original.pdf
Teaching Assistants Two teaching assistants for now André Aike Baars <aaba@itu.dk> Ashley Philip Davison-White <ashw@itu.dk> 15-09-2013 5
Course overview Only preliminary for next 4 weeks Lecture Topics covered Litterature 1 Aug. 26 2 Sep. 2 3 Sep. 9 4 Sep. 16 Overview of course. Course details. Big Data use cases. Data Centers. Relational vs. Nonrelational. Exercise 1: Research open datasets Exercise 2: Storage technologies Aggregate data models, graph databases, differences from relational. Selection of Data Set 1 (DS1). Exercise 3: Experiments with DS1. Distribution models, consistency, version stamps. Exercise 4: More experiments with DS1 MongoDB introduction, basics, and Map- Reduce Exercise 5: Map-Reduce on DS1 NoSQL Distilled chapter 1 NoSQL Distilled chapter 2-3 NoSQL Distilled chapter 4-6 NoSQL Distilled chapter 7 and 9 15-09-2013 6
Course overview Only preliminary for next 4 weeks Lecture Topics covered Litterature 5 Sep. 23 Oct. 7 Nov. 4 Key-Value Stores Exercise 6: Experiement with Key-Values Exercise 7: Data Set 2 External lecturer: Hadoop (IBM, Søren Ravn) External lecturer: IBM-Vestas case (IBM, Claus Samuelsen) NoSQL Distilled chapter 8 Also trying to get Microsoft lecturer (maybe analytics) Philippe Bonnet (currently at INRIA Paris will join for lecture on security and big data 15-09-2013 7
Part 2: NoSQL Distilled Chapters 2 & 3 NoSQL Distilled Chapters 7 15-09-2013 8
Central database server vs. cluster Single database server Database cluster Stored procedures Server Amount of data? Amount of data? Local processing Client Client 15-09-2013 9
Map-Reduce Map-Reduce is inspired by functional programming languages Aggregate data structure Key Value Independent use on each single record => Easily parallelizable 15-09-2013 10
Map-Reduce Reduce function aggregates the key-value pairs 15-09-2013 11
Map-Reduce Multiple reducers can run in parallel Partitions 15-09-2013 12
Map-Reduce Reducing data transfer: combining reducer must give same output format as input format Combiners can begin before map functions have completed 15-09-2013 13
Map-Reduce Not all reduce functions can be combined What would a combinable reduce function look like? 15-09-2013 14
Map-Reduce Limitations of Map-Reduce framework Map functions can only work on one aggregate Reduce functions can only operate on a single key 15-09-2013 15
Map-Reduce Example of calculating averages 15-09-2013 16
Map-Reduce Example of counting number of orders Generated by map function 15-09-2013 17
Map-Reduce Two-stage map-reduce example 15-09-2013 18
Map-Reduce First, monthly sales of a product Composite key 15-09-2013 19
Map-Reduce Second, reduce to product per year New composite key +1 No record being emitted for 2009 15-09-2013 20
Map-Reduce Lastly, merge of records 15-09-2013 21
Map-Reduce Working with map-reduce Any programming language Java, etc. Specialized programming languages Apache Pig (spinout from Hadoop) Hive with SQL-like syntax 15-09-2013 22
Map-Reduce Summary Map function reduces to key-value pairs Map functions only read a single aggregate at a time, so good parallelism Reduce functions take may key-value pairs to give a single output Reduce functions only work on a single key, so can easily be parallelized M/R can be chained and intermediate results be stored 15-09-2013 23
Part 3: MongoDB MongoDB introduction, NoSQL Distilled chapter 9 15-09-2013 24
MongoDB Differences between Oracle database and MongoDB Must be unique 15-09-2013 25
MongoDB Different data structures in same collection ( table ) Array Max doc size: 16 MB 15-09-2013 26
MongoDB Features: replica sets Assigned by user 0 1000 Every write can specify how many writes, e.g., majority Also for writes, WriteConcern Every read can specify if slave node reads, i.e., slaveok 15-09-2013 27
MongoDB Uses of replica sets Data redundancy Automated failover Read scaling Disaster recovery 15-09-2013 28
MongoDB Features: Transactions Not possible in traditional way Writes can be atomic transactions per document only From MongoDB documentation: Write operations are atomic on the level of a single document: no single write operation can atomically affect more than one document or more than one collection. When a single write operation modifies multiple documents, the operation as a whole is not atomic, and other operations may interleave. The modification of a single document, or record, is always atomic, even if the write operation modifies multiple subdocument within the single record. 15-09-2013 29
MongoDB Scaling: making the database handle more READ load When joining replica set, it automatically gets synchronized 15-09-2013 30
MongoDB Scaling: making the database handle more WRITE load Sharding/specialization based on selected field or compound field that exists in all documents, e.g., first name 15-09-2013 31
MongoDB Two types of shard keys Range key: Hash key: 15-09-2013 32
MongoDB Find, update and set documents in the collection inventory 15-09-2013 33
MongoDB Aggregation/map-reduce in MongoDB example the code 15-09-2013 34
Aggregation/map-reduce in MongoDB example the evaluation 15-09-2013 35
MongoDB Use cases Good Event logging Content management / blogging Web analytics, real time analytics E-commerce application Bad Complex transactions spanning multiple operations Varying aggregate structures 15-09-2013 36
Exercise for today Experiment with MongoDB 15-09-2013 37
Excercise 5: Experiment with MongoDB Experiments with MongoDB and your dataset Your CEO has returned home from a conference where he has heard about Map-Reduce and how well it scales. Based on your selected data set (Instagram or Github Archive), you decide to make some experiements with MongoDB: - Consider how to distribute your data among multiple nodes on a single site, for optimizing analysis/read operations, what principle would you use (range or hash based distribution) on what keys - Decide on 3 different analysis of your data where map-reduce would be well suitet, describe what the query, map and reduce functions would be 15-09-2013 38