CS455 - Lab 10. Thilina Buddhika. April 6, 2015



Similar documents
Hadoop Setup Walkthrough

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

HADOOP MOCK TEST HADOOP MOCK TEST II

YARN and how MapReduce works in Hadoop By Alex Holmes

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Yuji Shirasaki (JVO NAOJ)

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Recommended Literature for this Lecture

Chapter 7. Using Hadoop Cluster and MapReduce

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Hadoop Architecture. Part 1

How To Use Hadoop

Hadoop MultiNode Cluster Setup

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Apache Hadoop. Alexandru Costan

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Big Data and Scripting map/reduce in Hadoop

HADOOP CLUSTER SETUP GUIDE:

How to install Apache Hadoop in Ubuntu (Multi node setup)

Fundamentals Curriculum HAWQ

Data-intensive computing systems

Developing a MapReduce Application

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Understanding Hadoop Performance on Lustre

Cloudera Backup and Disaster Recovery

COURSE CONTENT Big Data and Hadoop Training

Extreme Computing. Hadoop MapReduce in more detail.

Developing MapReduce Programs

Map Reduce & Hadoop Recommended Text:

MapReduce. Tushar B. Kute,

ITG Software Engineering

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

Hadoop 2.6 Configuration and More Examples

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Introduction to MapReduce and Hadoop

L1: Introduction to Hadoop

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Complete Java Classes Hadoop Syllabus Contact No:

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

How To Install Hadoop From Apa Hadoop To (Hadoop)

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Package hive. January 10, 2011

University of Maryland. Tuesday, February 2, 2010

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster. Dan Şerban

Apache Hadoop new way for the company to store and analyze big data

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Introduction to HDFS. Prasanth Kothuri, CERN

Getting to know Apache Hadoop

A Performance Analysis of Distributed Indexing using Terrier

H2O on Hadoop. September 30,

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

Comparison of Different Implementation of Inverted Indexes in Hadoop

HADOOP MOCK TEST HADOOP MOCK TEST I

Hadoop Parallel Data Processing

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

HADOOP PERFORMANCE TUNING

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

THE HADOOP DISTRIBUTED FILE SYSTEM

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

DMX-h ETL Use Case Accelerator. Word Count

HDFS: Hadoop Distributed File System

Keywords: Big Data, HDFS, Map Reduce, Hadoop

HDFS Cluster Installation Automation for TupleWare

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Hadoop Data Locality Change for Virtualization Environment

Introduction to Hadoop

Cloudera Backup and Disaster Recovery

Distributed Filesystems

Chase Wu New Jersey Ins0tute of Technology

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Qsoft Inc

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200

Introduction to DISC and Hadoop

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

CSE-E5430 Scalable Cloud Computing Lecture 2

HadoopRDF : A Scalable RDF Data Analysis System

Hadoop: The Definitive Guide

Getting Hadoop, Hive and HBase up and running in less than 15 mins

Transcription:

Thilina Buddhika April 6, 2015

Agenda Course Logistics Quiz 8 Review Giga Sort - FAQ Census Data Analysis - Introduction Implementing Custom Data Types in Hadoop

Course Logistics HW3-PC Component 1 (Giga Sort) is due Wednesday, April 8th by 5:00 p.m.

Quiz 8 Review

Quiz 08 Review 1 The number of reducers in a MapReduce job is not governed by the size of the input. [True/False] 2 Consider a MapReduce job with 1000 Mappers and 100 reducers. Each mapper generates 100 partitions of its intermediate output space. [True/False] 3 Since HDFS stores data in 64 MB blocks (by default), the average space lost to internal fragmentation for a given file is 32 MB i.e. half the block size. [True/False]

Quiz 08 Review 4 Increasing the block size to say 512 MB in HDFS can reduce the degree of concurrency in processing. [True/False] 5 In HDFS Federation, the system name space is shared between multiple namenodes. [True/False] 6 In HDFS Federation, the block pool storage is partitioned between multiple namenodes. [True/False] 7 In HDFS High Availability, individual data nodes can choose to send block reports to either one of the namenodes. [True/False]

Quiz 08 Review 8 The size of the available main memory at a namenode can potentially impact the size and performance of the entire file system. [True/False] 9 Data flow traffic in HDFS passes through the namenode. [True/False] 10 Consider a large file managed by HDFS with a replication level of 3. In HDFS, it is possible that at a particular instant, blocks comprising this file may have a replication factor of 2 or 4 due to failures. [True/False]

Setting Up HDFS - FAQs Bind Exceptions. Mostly due to port conflicts. Identify the conflicting port. Make sure you have not used the same port twice in your configurations. Look out for any hanging processes spawned by you which are not killed. For Yarn: Go to the following link: [default yarn-site.xml] https://hadoop.apache.org/docs/r2.6.0/hadoop-yarn/ hadoop-yarn-common/yarn-default.xml Search for the conflicting port and identify the corresponding property. Override that property in yarn-site.xml. Same procedure applies for hdfs-site.xml as well. Trying to start namenode/resource manager from a different host.

Setting Up HDFS - FAQs Accessing shared cluster. Make sure to run your own cluster in parallel to the shared cluster. There is a typo in the core-site.xml provided for the client configuration. It has an extra <property> element at the end of the file. Please remove it. You should override the HADOOP CONF DIR property only when you run the MapReduce program. Do NOT use the shared cluster for debugging. There is a limit on the number of concurrent jobs that can be handled at a given time.

Giga Sort - FAQs There are duplicates in the input set. You should preserve the duplicates. All the numbers are positive. What should be included in the submission? hash (Root value of the Hash tree) source Ant build file/ Makefile (working) ReadMe CSU ID Input file name (unsorted-0,...) No. of reducers used (16 or 32) Additional notes File name: LASTNAME FIRSTNAME HW3 PC.tar You should be able to submit the second component after this Friday.

HW3-PC - Analyzing 1990 US census data

US Census Dataset Input data set is comprised of a collection of files. Each file contains a set of flat records. A record contains a set of fields. A field can be an identification field or a data field.

US Census Dataset Each record contains 9610 ASCII characters. A record is broken down into two segments, 4805 characters each. First 300 characters in each segment identification/geographic information. The layout of the first 300 characters are identical across both segments. Logical record number uniquely identifies each record. (6 digits starting from index 19) Logical Part number uniquely identifies a segment within a record. (4 digits starting from index 25) Total number of record segments. (4 digits starting from index 29)

US Census Dataset Process only the summary level of 100. State idenfitication code is a 2 character code. (CO, NY, CA, etc.)

US Census Dataset - Mini Datasets Single data file for Arkansas - http://www.cs.colostate. edu/~cs455/hw3-pc-sample-data/stf1bxak.f01 Minimum data set for 5 states - http://www.cs.colostate.edu/~cs455/ hw3-pc-sample-data/census-mini-dataset.tar.gz

Developing your own data types You are not restricted to the primitive data types supported by Hadoop You can implement composite data types Should implement the Writable interface Implement write and readfields methods. If it is going to be used as a key, implement the Comparable interface in addition to Writable and implement compareto method.

Developing your own data types - Examples Custom Type - http://www.cs.colostate.edu/~cs455/ examples/bookmetricinfo.java Comparable Custom Type - http://www.cs.colostate. edu/~cs455/examples/comparableaggregatevalue.java

Wrap Up Questions?