Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

Transcription

1 Big Data Management Big Data Management (BDM) Autumn 2013 Povl Koch November 11,

2 Overview Today s program 1. Little more practical details about this course 2. Recap from last time (Google Bigtable) 3. Hadoop 4. NoSQL Distilled chapters Exercise Practice exam-like exercise feedback

3 Part 1: Practical details Little more practical details about this course

4 Course Homepage ITU Intranet Course announcements Use it for exercises and TA help First dataset selected: GitHub Archive or Instagram Second dataset selected: London Traffic Information The three databases selected: MongoDB - Hadoop - hadoop.apache.org/ Neo4j - Intro article:

5 Teaching Assistants Two teaching assistants for now André Aike Baars Ashley Philip Davison-White

6 Course overview Past schedule Lecture Topics covered Litterature 1 Aug Sep. 2 3 Sep. 9 4 Sep. 16 Overview of course. Course details. Big Data use cases. Data Centers. Relational vs. Nonrelational. Exercise 1: Research open datasets Exercise 2: Storage technologies Aggregate data models, graph databases, differences from relational. Selection of Data Set 1 (DS1). Exercise 3: Experiments with DS1. Distribution models, consistency, version stamps. Exercise 4: More experiments with DS1 MongoDB introduction, basics, and Map- Reduce Exercise 5: Map-Reduce on DS1 NoSQL Distilled chapter 1 NoSQL Distilled chapter 2-3 NoSQL Distilled chapter 4-6 NoSQL Distilled chapter 7 and

7 Course overview Past schedule Lecture Topics covered Litterature 5 Sep Sep Oct. 7 Oct Oct Oct. 28 Key-Value Stores Exercise 6: Analysis with Key-Values Graph databases Exercise 7: Experiment graph database Exercise 8: Data Set 2 NoSQL Distilled chapter 8 NoSQL Distilled chapter 11 External lecturer: Hadoop (IBM, Søren Ravn) NoSQL Distilled chapter 10 Autumn vacation Big Data examples, external lecturer Hanne Breddam Google Bigtable, VoltDB N/A Bigtable: A Distributed Storage System for Structured Data

8 Course overview Preliminary MAY CHANGE Lecture Topics covered Litterature 10 Nov Nov Nov Nov Dec Dec. 9 External lecturer: IBM-Vestas case (IBM, Claus Samuelsen) Schema migrations, polyglot persistence, beyond NoSQL, choosing your database Privacy and Big Data (Philippe Bonnet) Data Mining 1 Data Mining 2 Follow-up, practice exam NoSQL Distilled chapter TBD Data Mining: Concepts and Techniques, chapters 2-3 Data Mining: Concepts and Techniques, chapter 4, maybe 5?

9 Part 2: Recap from last time Recap from last time

10 RECAP: Bigtable Data model Row key is arbitrary string Access to column data in row is atomic Creation is implicit when storing data Row ordered/sorted lexicographically Rows with lexicographically similar keys are typically stored on same or small number of machines Columns are grouped into families Optional qualifier (family:optional-qualifier)

11 RECAP: Bigtable Writing to Bigtable Atomic write

12 RECAP: Bigtable Reading from Bigtable we could restrict the scan above to only produce anchors whose columns match the regular expression anchor:*.cnn.com, or to only produce anchors whose timestamps fall within ten days of the current time

13 RECAP: Bigtable MapReduce Bigtable can be both input and output of MapReduce jobs

14 Bigtable Architecture A tablet is a set of consecutive rows of a table and is the unit of distribution and load balancing

15 Part 3 Part 3: Hadoop

16 Hadoop Hadoop an open source version of Google s Bigtable Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS ): A distributed file system that provides high-throughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets

17 Hadoop Related Hadoop projects Ambari : A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner. Avro : A data serialization system. Cassandra : A scalable multi-master database with no single points of failure. Chukwa : A data collection system for managing large distributed systems. HBase : A scalable, distributed database that supports structured data storage for large tables. Hive : A data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout : A Scalable machine learning and data mining library. Pig : A high-level data-flow language and execution framework for parallel computation. ZooKeeper : A high-performance coordination service for distributed applications

18 Hadoop A multi-node Hadoop cluster

19 Hadoop Quick history at Yahoo 2004 Initial versions of what is now Hadoop Distributed Filesystem and Map-Reduce implemented by Doug Cutting and Mike Cafarella. December 2005 Hadoop runs reliably on 20 nodes. February 2006 Apache Hadoop project officially started to support the stand-alone development of MapReduce and HDFS. April 2006 Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours. May 2006 Yahoo! set up a Hadoop research cluster 300 nodes. Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark). October 2006 Research cluster reaches 600 nodes

20 Hadoop Quick history at Yahoo December 2006 Sort benchmark run on 20 nodes in 1.8 hours, 100 nodes in 3.3 hours, 500 nodes in 5.2 hours, 900 nodes in 7.8 hours. January 2007 Research cluster reaches 900 nodes. April 2007 Research clusters 2 clusters of 1000 nodes. April 2008 Won the 1 terabyte sort benchmark in 209 seconds on 900 nodes. October 2008 Loading 10 terabytes of data per day on to research clusters. March clusters with a total of 24,000 nodes. April 2009 Won the minute sort by sorting 500 GB in 59 seconds (on 1,400 nodes) and the 100 terabyte sort in 173 minutes (on 3,400 nodes)

21 Hadoop Hadoop structure Secondary Name node Primary Name node JobTracker Data node 1 Data node 2 Data node N map red uce map red uce map red uce

22 Hadoop Roles in Hadoop cluster Data nodes Stores actual file blocks on disk Does not store the entire files Report block info to the NameNode Receive instructions from the NameNode Primary Name Node Bookkeeper for HDFS Single point of failure! Not store data or run jobs Manages DataNodes Secondary Name Node Snapshot of NameNode Help minimize downtime/data loss if a NameNode fails

23 Hadoop Roles in Hadoop cluster JobTracker Partition tasks across HDFS cluster Re-start failed tasks on different nodes Speculative execution TaskTracker Track individual map & reduce tasks Report progress to JobTracker

24 Hadoop Example of Hadoop on Amazon Web Services EC2 instances are compute nodes (running Map/Reduce) Storage options: HDFS on EC2 nodes HDFS on EC2 nodes loading data from S3 Native S3 (bypasses HDFS)

25 Hadoop Example of word counting in MapReduce 18. public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 19. String line = value.tostring(); 20. StringTokenizer tokenizer = new StringTokenizer(line); 21. while (tokenizer.hasmoretokens()) { 22. word.set(tokenizer.nexttoken()); 23. output.collect(word, one); 24. } 25. } 26. } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { 29. public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 30. int sum = 0; 31. while (values.hasnext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 35. } 36. }

26 Hadoop MapReduce example function Map is input: integer K1 between 1 and 1100, representing a batch of 1 million social.person records for each social.person record in the K1 batch do let Y be the person's age let N be the number of contacts the person has produce one output record <Y,N> repeat end function function Reduce is input: age (in years) Y for each input record <Y,N> do Accumulate in S the sum of N Accumulate in C the count of records so far repeat let A be S/C produce one output record <Y,A> end function Equivalent in SQL?

27 Part 4 Part 4: NoSQL Distilled chapters (Schema migrations, polyglot persistence, beyond NoSQL, choosing your database)

28 Schema migrations Green field - sequence of migrations applied to a database Version numbers

29 Schema migrations Green field - Exampled added field with tool Big Data?

30 Schema migrations Legacy database migrations

31 Schema migrations Changing fname to fullname using SQL trigger When to drop the trigger?

32 Schema migrations Changes in document database

33 Schema migrations Adding discountedprice and changing price to fullprice

34 Schema migrations Incremental migration using application code Also possible with schemaversion field

35 Schema migrations Transition period of schema changes Mobile applications?

36 Schema migrations Graph databases Just changing the type of edges may render the application useless and is expensive Create new edges and remove old when no longer used Changing properties of nodes is the same as for document databases

37 Polyglot persistence All data in same database

38 Polyglot persistence Different database technologies for different data

39 Polyglot persistence Example of polyglot persistence

40 Polyglot persistence Wrapping data stores into services

41 Polyglot persistence Even more services Anything missing?

42 Polyglot persistence Example with structures search

43 Polyglot persistence Choosing the right technology More skills needed: application programming, monitoring, optimization, backup, recovery, updates, etc. Also licensing, tools, upgrades, drivers, auditing, security IBM example:

44 Beyond NoSQL Other storage technologies Event sourcing

45 Beyond NoSQL Event sourcing stores each event Snapshots persist the state

46 Beyond NoSQL Event sourcing broadcast to multiple display nodes

47 Beyond NoSQL Other approaches Filesystems Version control systems History, branches = different views XML databases Object databases

48 Choosing your database Considerations Programmer productivity Data-access performance Sticking with the default / easier acceptance Hedging your bets /encapsulating DB code / services

49 Part 5: Exercise 10 Exercise

50 Exercise 10 Exercise 10 - Privacy Case Study (Philippe Bonnet warmup) At ITU, we are now able to non-intrusively monitor electricity consumption for each appliance (e.g., laptop/pc, screen, coffee machine) throughout the university (e.g., in ailes offices, in the atrium PCs, in the auditorium and classes). Each measurement is potentially logged and should be used to promote a culture of energy efficiency and flexible energy consumption. We want to design a system to make sense of these measurements (potentially in correlation with other meaningful data sources). We also want to preserve strong privacy guarantees for the employees and the students. For example, it should be not be possible for ANYONE to control whether an employee or a student is at work at any given point in time, on any given day. How would you design such a system?