Big Data Management Big Data Management (BDM) Autumn 2013 Povl Koch September 2, 2013 01-09-2013 1
Overview Today s program 1. Little more practical details about this course 2. Chapter 2 & 3 in NoSQL Distilled 3. Selection of data set 1 (DS1) exercise 1 4. Walkthrough of exercise 2 (storage technologies) 5. New exercise 3 01-09-2013 2
Part 1: Practical details Little more practical details about this course 01-09-2013 3
Course Homepage ITU Intranet http://www.itu.dk/courses/sbdm/e2013/ Course announcements
Teaching Assistants Two teaching assistants for now André Aike Baars <aaba@itu.dk> Ashley Philip Davison-White <ashw@itu.dk> 01-09-2013 5
Course overview Only preliminary for next 4 weeks Lecture Topics covered Litterature 1 Aug. 26 2 Sep. 2 3 Sep. 9 4 Sep. 16 Overview of course. Course details. Big Data use cases. Data Centers. Relational vs. Nonrelational. Exercise 1: Research open datasets Exercise 2: Storage technologies Aggregate data models, graph databases, differences from relational. Selection of Data Set 1 (DS1). Exercise 3: Experiments with DS1. Distribution models, consistency, version stamps. Exercise 4: More experiments with DS1 Map-Reduce Exercise 5: Map-Reduce on DS1 NoSQL Distilled chapter 1 NoSQL Distilled chapter 2-3 NoSQL Distilled chapter 4-6 NoSQL Distilled chapter 7 01-09-2013 6
Course overview Only preliminary for next 4 weeks Lecture Topics covered Litterature 5 Sep. 23 Key-Value Stores Exercise 6: Experiement with Key-Values Exercise 7: Data Set 2 NoSQL Distilled chapter 8 01-09-2013 7
Part 2: NoSQL Distilled Chapters 2 & 3 NoSQL Distilled Chapters 2 & 3 01-09-2013 8
Relational Data Model Relational Data Model Tuples/rows (simple types) Relation 01-09-2013 9
Aggregate Data Models Domain-Driven Design s AGGREGATE data models Collection of related objects Complex data types: lists, etc Treated as a unit for data manipulation Unit for consistency, updates with atomic opreations Easier to handle in cluster Natural unit for replication Easier for application programmers 01-09-2013 10
Relational model Example of e-commerce order in UML NoSQL Distilled p. 15 01-09-2013 11
Relational model Example of e-commerce order NoSQL Distilled p. 15 01-09-2013 12
Aggregate data model Example of e-commerce order in aggregate model JSON notation NoSQL Distilled p. 16 01-09-2013 13
Aggregate data model Example of e-commerce order in aggregate model NoSQL Distilled p. 18 01-09-2013 14
Aggregate orientation Consequences of aggregate orientation Choice of aggregates is crucial to how easy it is to access data later Aggregate-ignorant databases makes it easy to look at data in different ways Aggregate-aware databases makes data easier to distribute in a cluster Transactions is still import to keep consistency In relational database model, ACID may span multipe tables, so can be difficult in cluster environment Consistency more easy in aggregate data model, still ACID properties in some models per aggregate 9/1/2013 15
Document Databases Document database Document databases can be quieried based on fields, a key or any other field Example of document database: MongoDB Key-Value databases only hold a key and its associated value order_1_name -> Martin order_99_customer -> 1 NoSQL Distilled p. 16 01-09-2013 16
Column-Family Store Structure in Column-Family store (inspired by BigTable ) Accessed together Customer-ID 1234 Do not think of this as row but as aggregate Difficult question: how much was sold over last two weeks? 01-09-2013 17
Graph Databases Example of graph databases Edges Node Small records with complex interconnections Example: Neo4j Example query: What does Anna and Barbara both like? 01-09-2013 18
Schemaless Databases Schema vs. schemaless Schema-based databases Easy to format data for reports, web presentation Difficult to accomodate new data, SQL actually allows for change of tables Schemaless databases (key-value, document, column-store, graph) Easy to add new data, best for non-uniform data But formatting becomes more difficult, there is an implicit schema anyway Can be very difficult to deal with both old data and new data 01-09-2013 19
Data access Modelling/optimizing for data access 01-09-2013 20
Column Store vs. Graph versions Column Store version Graph version 01-09-2013 21
Part 3: selection of data set 1 (DS1) Selection of data set 1 (DS1) exercise 1 01-09-2013 22
Received proposals Received student proposals 1. Instagram http://instagram.com/developer/realtime/ 2. Crime of Chicago https://data.cityofchicago.org/public-safety/crimes-2001-topresent/ijzp-q8t2 3. OpenStreetMap http://www.openstreetmap.org/ 4. Transport for London http://www.tfl.gov.uk/businessandpartners/syndication/1649 3.aspx#17615 5. 1000 genomes http://www.1000genomes.org/ 01-09-2013 23
Received proposals Cont. Any kind of weather dataset 6. Datasift social data multiple datasets http://datasift.com/ 7. Facebook stream https://developers.facebook.com/docs/reference/api/realtim e/ 8. Twitter https://dev.twitter.com/docs/streaming-apis 9. Sloan Digital Sky Survey http://www.sdss.org/dr6/index.html 10. GitHub Archive http://www.githubarchive.org 01-09-2013 24
Received proposals Cont. 2 11. Wikipedia, esp. political purposes http://dumps.wikimedia.org/enwiki/ Airline ticket prices 01-09-2013 25
Proposal 1 Instragram http://instagram.com/developer/realtime/ 01-09-2013 26
Proposal 2 Crime of Chicago https://data.cityofchicago.org/public- Safety/Crimes-2001-to-present/ijzp-q8t2 01-09-2013 27
Proposal 3 OpenStreetMap http://www.openstreetmap.org/ http://wiki.openstreetmap.org/wiki/databases_and_data_access_apis 01-09-2013 28
Proposal 4 Transport for London http://www.tfl.gov.uk/businessandpartners/syndication/16493.aspx#17615 01-09-2013 29
Proposal 5 1000 genomes http://www.1000genomes.org/ 01-09-2013 30
Proposal 6 Datasift social data http://datasift.com/ 01-09-2013 31
Proposal 7 Facebook 01-09-2013 32
Proposal 8 Twitter https://dev.twitter.com/docs/streaming-apis 01-09-2013 33
Proposal 9 Sloan Digital Sky Survey http://www.sdss.org/dr6/index.html 01-09-2013 34
Proposal 10 GitHub Archive http://www.githubarchive.org 01-09-2013 35
Proposal 11 Wikipedia http://live.dbpedia.org/ 01-09-2013 36
Short list Mostly likely to succeed Instagram Transport for London Twitter GitHub Selected data set:? 01-09-2013 37
Part 4: Feedback on exercise 2 Storage technologies some takeaways Big data in the size of 128 petabytes is expensive and requires significant amount of space and power Some details compared to 7,200 RPM hard drive 15,000 RPM drives, 10x more expensive, low capacity SATA SSD is 10x more expensive, compact, OK performance and low power PCIe SSD drive, high end, is very fast, but also very costly, 150x SD cards very compact, but slow 01-09-2013 38
Exercise 2: Hard disk vs. Solid State Drives
Exercise for today First data set exercise 01-09-2013 40
Excercise 3: Analysis of Data Set 1 Quick analysis of Data Set 1 (DS1) Your CEO has now changed focus from building the 128 petabytes datacenter due to budget constraints. Instead, the CEO asks you to analyze further the data set that the company s Advisory Group (the BDM class) has selected today. Please write up a new recommendation for the CEO about: - what is the specific big data characteristics of the DS1? - what is the data structure? and will it fit well with relational, key-value, column store, or graph database systems? (overall recommendation is OK) - what other data sources would be relevant to combine DS1 with? 01-09-2013 41