Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop



Similar documents
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Deploying Hadoop with Manager

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Chapter 7. Using Hadoop Cluster and MapReduce

A very short Intro to Hadoop

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Hadoop Architecture. Part 1

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop and Map-Reduce. Swati Gore

Open source Google-style large scale data analysis with Hadoop

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Workshop on Hadoop with Big Data

Constructing a Data Lake: Hadoop and Oracle Database United!

Qsoft Inc

MapReduce, Hadoop and Amazon AWS

Data-Intensive Computing with Map-Reduce and Hadoop

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop implementation of MapReduce computational model. Ján Vaňo

A Brief Outline on Bigdata Hadoop

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Internals of Hadoop Application Framework and Distributed File System

Certified Big Data and Apache Hadoop Developer VS-1221

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Big Data Too Big To Ignore

Hadoop Job Oriented Training Agenda

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: (O) Volume 1 Issue 3 (September 2014)

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop IST 734 SS CHUNG

Google Bing Daytona Microsoft Research

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

HadoopRDF : A Scalable RDF Data Analysis System

Lecture 10 - Functional programming: Hadoop and MapReduce

Large scale processing using Hadoop. Ján Vaňo

BBM467 Data Intensive ApplicaAons

Apache Hadoop new way for the company to store and analyze big data

Entering the Zettabyte Age Jeffrey Krone

Hadoop Ecosystem B Y R A H I M A.

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

Fault Tolerance in Hadoop for Work Migration

White Paper: Hadoop for Intelligence Analysis

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Big Data With Hadoop

Open source large scale distributed data management with Google s MapReduce and Bigtable

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Data processing goes big

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Big Data on Microsoft Platform

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Apache Hadoop. Alexandru Costan

NoSQL and Hadoop Technologies On Oracle Cloud

Apache Hadoop: Past, Present, and Future

IMPLEMENTING PREDICTIVE ANALYTICS USING HADOOP FOR DOCUMENT CLASSIFICATION ON CRM SYSTEM

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Processing of Hadoop using Highly Available NameNode

Manifest for Big Data Pig, Hive & Jaql

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

How to Hadoop Without the Worry: Protecting Big Data at Scale

MapReduce with Apache Hadoop Analysing Big Data

ITG Software Engineering

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

MapReduce. Tushar B. Kute,

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Cloudera Certified Developer for Apache Hadoop

Big Data - Infrastructure Considerations

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

Design of Electric Energy Acquisition System on Hadoop

Complete Java Classes Hadoop Syllabus Contact No:

Extending Hadoop beyond MapReduce

Click Stream Data Analysis Using Hadoop

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Hadoop Parallel Data Processing

COURSE CONTENT Big Data and Hadoop Training

I/O Considerations in Big Data Analytics

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

HDFS. Hadoop Distributed File System

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

Transcription:

Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1

2

Big Data Problems Data explosion Data from users on social networks Data from mobile devices Data from of things What is big data? Data at large quantity (terabytes), captured at a rapid rate, structured or unstructured, stored or hold at various machines and locations, or some combination of the above Problems? It is difficulty or at high cost to capture, store, manage process, mine using traditional methods 3

Why all the excitement There are many factors contributing to the hype around Big Data Challenges of the problems The appearance of a cost effective practical solutions The expectation on Internet of Things Bringing compute and storage together on commodity hardware. i.e. cloud computing Price performance: The Hadoop big data technology provides significant cost savings with significant performance improvements Linear Scalability: Every parallel technology makes claims about scale up Full access to unstructured data: A highly scalable data store with a good parallel programming model, MapReduce, has been a challenge for the industry for some time, until MapReduce system like Hadoop appears 4

MapReduce A programming model for processing large data sets with a parallel and distributed algorithm on a cluster A MapReduce program is composed of two core procedures: Map() and Reduce() Map() performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) Reduce() performs a summary operation (such as counting the number of students in each queue, yielding name frequencies) A "MapReduce System" (also called "infrastructure" or "framework") runs the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing redundancy and fault tolerance A well-established open-source MapReduce system is Apathe Hadoop 5

Solve the word count example by MapReduce Word count problem Input: a several text files, or one big file Output: the words and their frequencies E.g. file00: Hello World Bye World file01: Hello Hadoop Goodbye Hadoop Solve the problem by MapReduce scheme master file00 < Bye, 1> Map() < Hello, 1> < World, 2> file01 Map() < Goodbye,1> < Hadoop, 2> < Hello, 1> Reduce() < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2> 6

Hadoop architecture 7

MapReduce layer Jobtracker manages MapReduce jobs, hands out tasks to the slave nodes, schedules tasks, monitoring them and re-executes the failed tasks. There is exactly one JobTracker in each cluster Tasktracker is a slaves that carry out map and reduce tasks, usually associated with Datanode 8

9

HDFS layer Namenode manages the namespace, file system metadata, and access control. There is exactly one Namenode in each cluster Datanode holds application input/output data files and map and reduce programs Client is an application launcher to create MapReduce job with provided application specific input data files and map and reduce programs Hadoop launch application from client program: split data file into input chunks, map input chunks and programs to Datanodes. 10

11

Hadoop ecosystem 12

Components in Hadoop echosystem The Apache Hadoop project has two core components 1. the file store called Hadoop Distributed File System (HDFS) 2. the programming framework called MapReduce Other components 1. Hadoop Streaming: A utility to enable MapReduce code in any language: C, Perl, Python, C++, Bash, etc. The examples include a Python mapper and an AWK reducer. 2. Hive and Hue: Hive convert SQL to a MapReduce job. Hue gives a browser-based graphical interface to do Hive work. 3. Pig: A higher-level programming environment to do MapReduce coding. The Pig language is called Pig Latin. 4. Sqoop: Provides bi-directional data transfer between Hadoop and relational database. 13

5. Oozie: Manages Hadoop workflow. 6. HBase: A super-scalable key-value store. It works very much like a persistent hash-map 7. FlumeNG: A real time loader for streaming your data into Hadoop. 8. Whirr: Cloud provisioning for Hadoop. You can start up a cluster in just a few minutes with a very short configuration file. 9. Mahout: Machine learning for Hadoop. Used for predictive analytics and other advanced analysis. 10.Fuse: Makes the HDFS system look like a regular file system so you can use ls, rm, cd, and others on HDFS data 11.Zookeeper: Used to manage synchronization for the cluster. 14

Get started with Hadoop Cloud computing and big data lab Lab tasks 1. create a private cloud, Ubuntu virtual machines 2. Install and test Hadoop https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html 15

Hadoop business For the executives: Hadoop is an Apache open source software project to get value from the incredible volume/velocity/variety of data about your organization. Use the data instead of throwing most of it away For the technical managers: An open source suite of software that mines the structured and unstructured Big Data about your company. It integrates with your existing Business Intelligence ecosystem. Legal: An open source suite of software that is packaged and supported by multiple suppliers. Please see the Resources section regarding IP indemnification. Engineering: A massively parallel, shared nothing, Java-based map-reduce execution environment. Think hundreds to thousands of computers working on the same problem, with built-in failure resilience. Projects in the Hadoop ecosystem provide data loading, higher-level languages, automated cloud deployment, and other capabilities. Security: A Kerberos-secured software suite. 16