L1: Introduction to Hadoop



Similar documents
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Chapter 7. Using Hadoop Cluster and MapReduce

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop implementation of MapReduce computational model. Ján Vaňo

CSE-E5430 Scalable Cloud Computing Lecture 2

Large scale processing using Hadoop. Ján Vaňo

MapReduce with Apache Hadoop Analysing Big Data

A very short Intro to Hadoop

Hadoop IST 734 SS CHUNG

Hadoop Ecosystem B Y R A H I M A.

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

THE HADOOP DISTRIBUTED FILE SYSTEM

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Hadoop Architecture. Part 1

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

A Brief Outline on Bigdata Hadoop

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Hadoop Parallel Data Processing

Application Development. A Paradigm Shift

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Hadoop Distributed File System (HDFS) Overview

Case Study : 3 different hadoop cluster deployments

HADOOP MOCK TEST HADOOP MOCK TEST II

Map Reduce & Hadoop Recommended Text:

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Open source Google-style large scale data analysis with Hadoop

How To Use Hadoop

Yuji Shirasaki (JVO NAOJ)

Apache Hadoop new way for the company to store and analyze big data

Data Mining in the Swamp

NoSQL and Hadoop Technologies On Oracle Cloud

Suresh Lakavath csir urdip Pune, India

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Hadoop & its Usage at Facebook

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Apache Hadoop. Alexandru Costan

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Integrating Big Data into the Computing Curricula

Introduction to Cloud Computing

Click Stream Data Analysis Using Hadoop

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Hadoop & its Usage at Facebook

How To Install Hadoop From Apa Hadoop To (Hadoop)

HADOOP MOCK TEST HADOOP MOCK TEST I

Hadoop: The Definitive Guide

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

CSE-E5430 Scalable Cloud Computing. Lecture 4

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Analysis of Information Management and Scheduling Technology in Hadoop

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

CURSO: ADMINISTRADOR PARA APACHE HADOOP

Introduction to Hadoop

COURSE CONTENT Big Data and Hadoop Training

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

TP1: Getting Started with Hadoop

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Introduction to Hadoop

<Insert Picture Here> Big Data

Implement Hadoop jobs to extract business value from large and varied data sets

How To Scale Out Of A Nosql Database

Big Data With Hadoop

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

BIG DATA TRENDS AND TECHNOLOGIES

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

A Survey on Big Data Concepts and Tools

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Chase Wu New Jersey Ins0tute of Technology

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Generic Log Analyzer Using Hadoop Mapreduce Framework

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

MapReduce. Tushar B. Kute,

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE

Hadoop Job Oriented Training Agenda

Transcription:

L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014

Today we are going to learn... 1 General Information 2 The Big Data background 3 What is Hadoop? 4 Install Hadoop 5 Exercises Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 2 / 19

General Information I Instructor: Feng Li <feng.li@cufe.edu.cn> Language: The course is taught in Chinese. And all the assignments and examinations will be handed out in English. The students are free to choose English or Chinese to answer it. Reception hours: Questions concerned with this course are most welcome to ask during lectures. Of course they can be asked after Thursday s lecture or via email. Literature Holmes, Alex. Hadoop in practice. Manning Publications Co., 2012. White, Tom. Hadoop: The definitive guide, Third Edition. O Reilly Media, Inc., 2012.. Hadoop., 2012. Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 3 / 19

General Information II Lecture notes and updated news are available at http://feng.li/teaching/big-data-computing/ Working load: It is a difficult course and I suggest you to study at least of equivalent lecture hours after each lecture to meet the minimal requirement of the exam. Assignments and examinations: Three sets of take-home group assignments (40% of total course scores). Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 4 / 19

The Big Data background Big data means Volume: the quantity of data Variety: the category of data Velocity: the speed of generation of data Variability: the inconsistency of data Veracity: the quality of the data Big data brings with it two fundamental challenges: how to store and work with voluminous data sizes, and more important, how to understand data and turn it into a competitive advantage. Discussion: How do we evaluate the average height for adolescence nation wide? Hadoop fills a gap in the market by effectively storing and providing computational capabilities over substantial amounts of data. It s a distributed system made up of a distributed filesystem and it offers a way to parallelize and execute programs on a cluster of machines Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 5 / 19

What is Hadoop? I Hadoop is a platform that provides both distributed storage and computational capabilities. Hadoop proper is a distributed master-slave architecture consists of the Hadoop Distributed File System ( HDFS ) for storage and MapReduce for computational capabilities Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 6 / 19

What is Hadoop? II Figure: The Hadoop architecture Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 7 / 19

A Brief History of Hadoop Hadoop was created by Doug Cutting. At the time Google had published papers that described its novel distributed filesystem, the Google File System ( GFS ), and MapReduce, a computational framework for parallel processing. The successful implementation of these papers concepts resulted in the Hadoop project. Who use Hadoop? Facebook uses Hadoop, Hive, and HB ase for data warehousing and real-time appli- cation serving. Twitter uses Hadoop, Pig, and HB ase for data analysis, visualization, social graph analysis, and machine learning. Yahoo! uses Hadoop for data analytics, machine learning, search ranking, email antispam, ad optimization... ebay, Samsung, Rackspace, J.P. Morgan, Groupon, LinkedIn, AOL, Last.fm... Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 8 / 19

Core Hadoop components: HDFS I HDFS is the storage component of Hadoop It s a distributed file system. Logical representation of the components in HDFS : the NameNode and the DataNode. HDFS replicates files for a configured number of times, is tolerant of both software and hardware failure, and automatically re-replicates data blocks on nodes that have failed. HDFS isn t designed to work well with random reads over small files due to its optimization for sustained throughput Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 9 / 19

Core Hadoop components: HDFS II Figure: The HDFS architecture Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 10 / 19

Core Hadoop components: MapReduce I MapReduce is a batch-based, distributed computing framework It allows you to parallelize work over a large amount of raw data. This type of work, which could take days or longer using conventional serial programming techniques, can be reduced down to minutes using MapReduce on a Hadoop cluster. MapReduce allows the programmer to focus on addressing business needs, rather than getting tangled up in distributed system complications. MapReduce doesn t lend itself to use cases that need real-time data access. Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 11 / 19

Core Hadoop components: MapReduce II Figure: The MapReduce workflow. Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 12 / 19

The role of the programmer in Hadoop? I Just need to define map and reduce functions. The map function outputs key/value tuples, which are processed by reduce functions to produce the final output. The power of MapReduce occurs in between the map output and the reduce input, in the shuffle and sort phases Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 13 / 19

The role of the programmer in Hadoop? II Figure: The map function pseudo code. Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 14 / 19

The role of the programmer in Hadoop? III Figure: The map reduce shuffle and sort. Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 15 / 19

Figure: The MapReduce architecture. Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 16 / 19

The Hadoop modes The standalone mode In this mode, you do not need to start any Hadoop daemons. Instead, just call /Hadoop-directory/bin/hadoop that will execute a Hadoop operation as a single Java process. This is recommended for testing purposes. This is the default mode and you don t need to configure anything else. The pseudo mode: In this mode, you configure Hadoop for all the nodes. A separate Java Virtual Machine (JVM) is spawned for each of the Hadoop components or daemons like mini cluster on a single host. The full distributed mode: In this mode, Hadoop is distributed across multiple machines. Dedicated hosts are configured for Hadoop components. Therefore, separate JVM processes are present for all daemons. Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 17 / 19

Install Hadoop on a pseudo mode I Prerequisites Linux OS JDK Dedicated Hadoop system user Configuring SSH Install Open SSH Server Configuring keys The configure files are at hadoop/ect/hadoop/ Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 18 / 19

Assignment (I) Install Hadoop on your own computer (pseudo mode) Try some commands to upload, download, copy, copy from local, move files in HDFS. If you encounter anything, start the help document. Familiar with Hadoop administrative system (configuration files, logs, http interfaces...) Feng Li (SAM.CUFE.EDU.CN) Parallel Computing for Big Data 19 / 19