Thilina Buddhika April 6, 2015
Agenda Course Logistics Quiz 8 Review Giga Sort - FAQ Census Data Analysis - Introduction Implementing Custom Data Types in Hadoop
Course Logistics HW3-PC Component 1 (Giga Sort) is due Wednesday, April 8th by 5:00 p.m.
Quiz 8 Review
Quiz 08 Review 1 The number of reducers in a MapReduce job is not governed by the size of the input. [True/False] 2 Consider a MapReduce job with 1000 Mappers and 100 reducers. Each mapper generates 100 partitions of its intermediate output space. [True/False] 3 Since HDFS stores data in 64 MB blocks (by default), the average space lost to internal fragmentation for a given file is 32 MB i.e. half the block size. [True/False]
Quiz 08 Review 4 Increasing the block size to say 512 MB in HDFS can reduce the degree of concurrency in processing. [True/False] 5 In HDFS Federation, the system name space is shared between multiple namenodes. [True/False] 6 In HDFS Federation, the block pool storage is partitioned between multiple namenodes. [True/False] 7 In HDFS High Availability, individual data nodes can choose to send block reports to either one of the namenodes. [True/False]
Quiz 08 Review 8 The size of the available main memory at a namenode can potentially impact the size and performance of the entire file system. [True/False] 9 Data flow traffic in HDFS passes through the namenode. [True/False] 10 Consider a large file managed by HDFS with a replication level of 3. In HDFS, it is possible that at a particular instant, blocks comprising this file may have a replication factor of 2 or 4 due to failures. [True/False]
Setting Up HDFS - FAQs Bind Exceptions. Mostly due to port conflicts. Identify the conflicting port. Make sure you have not used the same port twice in your configurations. Look out for any hanging processes spawned by you which are not killed. For Yarn: Go to the following link: [default yarn-site.xml] https://hadoop.apache.org/docs/r2.6.0/hadoop-yarn/ hadoop-yarn-common/yarn-default.xml Search for the conflicting port and identify the corresponding property. Override that property in yarn-site.xml. Same procedure applies for hdfs-site.xml as well. Trying to start namenode/resource manager from a different host.
Setting Up HDFS - FAQs Accessing shared cluster. Make sure to run your own cluster in parallel to the shared cluster. There is a typo in the core-site.xml provided for the client configuration. It has an extra <property> element at the end of the file. Please remove it. You should override the HADOOP CONF DIR property only when you run the MapReduce program. Do NOT use the shared cluster for debugging. There is a limit on the number of concurrent jobs that can be handled at a given time.
Giga Sort - FAQs There are duplicates in the input set. You should preserve the duplicates. All the numbers are positive. What should be included in the submission? hash (Root value of the Hash tree) source Ant build file/ Makefile (working) ReadMe CSU ID Input file name (unsorted-0,...) No. of reducers used (16 or 32) Additional notes File name: LASTNAME FIRSTNAME HW3 PC.tar You should be able to submit the second component after this Friday.
HW3-PC - Analyzing 1990 US census data
US Census Dataset Input data set is comprised of a collection of files. Each file contains a set of flat records. A record contains a set of fields. A field can be an identification field or a data field.
US Census Dataset Each record contains 9610 ASCII characters. A record is broken down into two segments, 4805 characters each. First 300 characters in each segment identification/geographic information. The layout of the first 300 characters are identical across both segments. Logical record number uniquely identifies each record. (6 digits starting from index 19) Logical Part number uniquely identifies a segment within a record. (4 digits starting from index 25) Total number of record segments. (4 digits starting from index 29)
US Census Dataset Process only the summary level of 100. State idenfitication code is a 2 character code. (CO, NY, CA, etc.)
US Census Dataset - Mini Datasets Single data file for Arkansas - http://www.cs.colostate. edu/~cs455/hw3-pc-sample-data/stf1bxak.f01 Minimum data set for 5 states - http://www.cs.colostate.edu/~cs455/ hw3-pc-sample-data/census-mini-dataset.tar.gz
Developing your own data types You are not restricted to the primitive data types supported by Hadoop You can implement composite data types Should implement the Writable interface Implement write and readfields methods. If it is going to be used as a key, implement the Comparable interface in addition to Writable and implement compareto method.
Developing your own data types - Examples Custom Type - http://www.cs.colostate.edu/~cs455/ examples/bookmetricinfo.java Comparable Custom Type - http://www.cs.colostate. edu/~cs455/examples/comparableaggregatevalue.java
Wrap Up Questions?