Data Science Analytics & Research Centre

Similar documents

Hadoop Shell Commands

Hadoop Shell Commands

HDFS File System Shell Guide

File System Shell Guide

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Introduction to MapReduce and Hadoop

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

How To Write A Mapreduce Program In Java.Io (Orchestra)

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

MapReduce. Tushar B. Kute,

Big Data Management and NoSQL Databases

Extreme computing lab exercises Session one

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Extreme computing lab exercises Session one

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Hadoop WordCount Explained! IT332 Distributed Systems

map/reduce connected components

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

HPCHadoop: MapReduce on Cray X-series

Apache Hadoop new way for the company to store and analyze big data

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Hadoop: Understanding the Big Data Processing Method

CS54100: Database Systems

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

RDMA for Apache Hadoop User Guide

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Hadoop Architecture. Part 1

Getting to know Apache Hadoop

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Hadoop Configuration and First Examples

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Internals of Hadoop Application Framework and Distributed File System

Apache Hadoop. Alexandru Costan

Hadoop Streaming. Table of contents

Programming in Hadoop Programming, Tuning & Debugging

Introduction To Hadoop

Word count example Abdalrahman Alsaedi

Word Count Code using MR2 Classes and API

TP1: Getting Started with Hadoop

How To Use Hadoop

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Data Science in the Wild

Hadoop IST 734 SS CHUNG

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Distributed Filesystems

Intro to Map/Reduce a.k.a. Hadoop

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

HDFS. Hadoop Distributed File System

Big Data 2012 Hadoop Tutorial

Chase Wu New Jersey Ins0tute of Technology

Introduction to HDFS. Prasanth Kothuri, CERN

Reduction of Data at Namenode in HDFS using harballing Technique

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Open source Google-style large scale data analysis with Hadoop

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

Parallel Processing of cluster by Map Reduce

CSE-E5430 Scalable Cloud Computing Lecture 2

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Extreme Computing. Hadoop MapReduce in more detail.

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

The Hadoop Eco System Shanghai Data Science Meetup

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Data-Intensive Computing with Map-Reduce and Hadoop

GraySort and MinuteSort at Yahoo on Hadoop 0.23

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Data-intensive computing systems

A very short Intro to Hadoop

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Big Data With Hadoop

Xiaoming Gao Hui Li Thilina Gunarathne

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Lecture 10 - Functional programming: Hadoop and MapReduce

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

MapReduce, Hadoop and Amazon AWS

Hadoop implementation of MapReduce computational model. Ján Vaňo

Big Data Technologies

Big Data Introduction

Tutorial- Counting Words in File(s) using MapReduce

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Introduc)on to Map- Reduce. Vincent Leroy

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Transcription:

Data Science Analytics & Research Centre Data Science Analytics & Research Centre 1

Big Data Big Data Overview Characteristics Applications & Use Case HDFS Hadoop Distributed File System (HDFS) Overview HDFS Architecture Data replication Node types Jobtracker / Tasktracker HDFS Data Flows HDFS Limitations Hadoop Hadoop Overview Inputs & Outputs Data Types What is MapReduce (MR) Example Functionalities of MR Speculative Execution Hadoop Streaming Hadoop Job Scheduling Data Science Analytics & Research Centre 2

Big Data Overview Characteristics Applications & Use Case Data Footprint & Time Horizon Technology Adoption Lifecycle Data Science Analytics & Research Centre 3

Data Science Analytics & Research Centre 4

Data Science Analytics & Research Centre 5

Data Science Analytics & Research Centre 6

Highly Summarized Aggregated Real Time Near Real Time Hourly Daily Weekly Analytic Marts & Cubes Monthly Quarterly Visualization & Dashboards Yearly 3 Years 5 Years 10 Years Detailed Events / Facts Core ERP & Legacy Applications & Data Warehouse Predictive Analytics Unstructured Web / Telemetry Big Data Hadoop etc. Consumption Source Real Time Daily Monthly Yearly GB TB PB Data Science Analytics & Research Centre 7

Data Science Analytics & Research Centre 8

Data Science Analytics & Research Centre 9

Data Science Analytics & Research Centre 10

Financial Services Detect fraud Model and manage risk Improve debt recovery rates Personalize banking/insurance products Healthcare Optimal treatment pathways Remote patient monitoring Predictive modeling for new drugs Personalized medicine Retail In-store behavior analysis Cross selling Optimize pricing, placement, design Optimize inventory and distribution Data Science Analytics & Research Centre 11

Web / Social / Mobile Location-based marketing Social segmentation Sentiment analysis Price comparison services Government Reduce fraud Segment populations, customize action Support open data initiatives Automate decision making Manufacturing Design to value Crowd-sourcing Digital factory for lean manufacturing Improve service via product sensor data Data Science Analytics & Research Centre 12

Data Science Analytics & Research Centre 13

Hadoop Distributed File System (HDFS) Overview HDFS Architecture Data replication Node types Jobtracker / Tasktracker HDFS Data Flows HDFS Limitations Data Science Analytics & Research Centre 14

Hadoop own implementation of distributed file system. Is coherent and provides all facilities of a file system. Implements ACLs and provides a subset of usual UNIX commands for accessing or querying the filesystem. It has large block size (default 64MB) 128MB recommended for storage to compensate for seek time to network bandwidth. So very large files for storage are ideal. Streaming data access. Write once and read many times architecture. Since files are large time to read is significant parameter than seek to first record. Commodity hardware. It is designed to run on commodity hardware which may fail. HDFS is capable of handling it. E.g.: 420MB file is split as: 128 MB 128 MB 128 MB 36 MB Data Science Analytics & Research Centre 15

Data Science Analytics & Research Centre 16

Data Science Analytics & Research Centre 17

File 1 Create Complete B1 B2 B3 Namenode n1 B1 n1 B2 n1 B1 n2 B1 n2 B2 n2 B3 n3 B2 n3 B3 n3 B3 n4 n4 n4 Rack 1 Rack 2 Rack 3 Data Science Analytics & Research Centre 18

Data Science Analytics & Research Centre 19

Data Science Analytics & Research Centre 20

HDFS Flow Read HDFS Flow Write Data Science Analytics & Research Centre 21

Command Usage Syntax cat Copies source paths to stdout hadoop dfs -cat URI [URI ] chgrp Change group association of files. With -R, make the change recursively through the directory structure. hadoop dfs -chgrp [-R] GROUP URI [URI ] chmod Change the permissions of files. With -R, make the change recursively through the directory structure hadoop dfs -chmod [-R] <MODE[,MODE]... OCTALMODE> URI [URI ] chown Change the owner of files. With -R, make the change recursively through the directory structure hadoop dfs -chown [-R] [OWNER][:[GROUP]] URI [URI ] copyfromlocal Similar to put command, except that the source is restricted to a local file reference. hadoop dfs -copyfromlocal <localsrc> URI Similar to get command, except that the destination hadoop dfs -copytolocal [-ignorecrc] [-crc] URI copytolocal is restricted to a local file reference. <localdst> cp Copy files from source to destination hadoop dfs -cp URI [URI ] <dest> Displays aggregate length of files contained in the du directory or the length of a file in case its just a file. hadoop dfs -du URI [URI ] dus Displays a summary of file lengths. hadoop dfs -dus <args> expunge Empty the Trash hadoop dfs -expunge get Copy files to the local file system hadoop dfs -get [-ignorecrc] [-crc] <src> <localdst> Concatenates files in source into the destination getmerge local file hadoop dfs -getmerge <src> <localdst> [addnl] File - returns stat on the file ls (or) lsr Directory - returns list of its direct children hadoop dfs -ls <args> Data Science Analytics & Research Centre 22

Command Usage Syntax mkdir movefromlocal Takes path uri's as argument and creates directories hadoop dfs -mkdir <paths> dfs -movefromlocal <src> <dst> mv Moves files from source to destination hadoop dfs -mv URI [URI ] <dest> put Copy single src, or multiple srcs from local file system to the destination filesystem hadoop dfs -put <localsrc>... <dst> rm (or) rmr Delete files specified as args. Only deletes non empty directory and files hadoop dfs -rm URI [URI ] setrep Changes the replication factor of a file. -R option is for recursively increasing the replication factor of files within a directory hadoop dfs -setrep [-R] <path> stat Returns the stat information on the path hadoop dfs -stat URI [URI ] tail Displays last kilobyte of the file to stdout hadoop dfs -tail [-f] URI test e - if the file exists z - if the file is zero length d - if the path is directory hadoop dfs -test -[ezd] URI text Takes a source file and outputs the file in text format hadoop dfs -text <src> touchz Create a file of zero length hadoop dfs -touchz URI [URI ] Data Science Analytics & Research Centre 23

Low latency data access. It is not optimized for low latency data access it trades latency to increase the throughput of the data. Lots of small files. Since block size is 64 MB and lots of small files(will waste blocks) will increase the memory requirements of namenode. Multiple writers and arbitrary modification. There is no support for multiple writers in HDFS and files are written to by a single writer after end of each file. Data Science Analytics & Research Centre 24

Hadoop Overview Inputs & Outputs Data Types What is MR Example Functionalities of MR Speculative Execution How Hadoop runs MR Hadoop Streaming Hadoop Job Scheduling Data Science Analytics & Research Centre 25

Hadoop is a framework which provides open source libraries for distributed computing using simple single map-reduce interface and its own distributed filesystem called HDFS. It facilitates scalability and takes cares of detecting and handling failures. Data Science Analytics & Research Centre 26

1.0.X - current stable version, 1.0 release 1.1.X - current beta version, 1.1 release 2.X.X - current alpha version 0.23.X - similar to 2.X.X but missing NN HA. 0.22.X - does not include security 0.20.203.X - old legacy stable version 0.20.X - old legacy version Data Science Analytics & Research Centre 27

Data Science Analytics & Research Centre 28

Risk Modeling: How business/industry can better understand customers and market. Customer Churn Analysis: Why companies really loose customers. Recommendation Engine: How to predict customer preferences. Data Science Analytics & Research Centre 29

AD Targeting: How to increase campaign efficiency. Point of Sale Transaction Analysis: Targeting promotions to make customers buy. Predicting network Failure: Using machine-generated data to identify trouble spots. Data Science Analytics & Research Centre 30

Threat Analysis: Detecting threats and fraudulent analysis. Trade Surveillance: Help business spot the rogue trader. Search Quality: Delivering more relevant search results to customers. Data Science Analytics & Research Centre 31

Framework is introduced by google. Process vast amounts of data (multi-terabyte data-sets) in-parallel. Achieves high performance on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. Splits the input data-set into independent chunks. Sorts the outputs of the maps, which are then input to the reduce tasks. Takes care of scheduling tasks, monitoring them and re-executes the failed tasks. Data Science Analytics & Research Centre 32

The MapReduce framework operates exclusively on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types. The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework. Input and Output types of a MapReduce job: (input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, List(v2)> -> reduce -> <k3, v3> (output) Data Science Analytics & Research Centre 33

Data Science Analytics & Research Centre 34

Data Science Analytics & Research Centre 35

Data Science Analytics & Research Centre 36

Serialization is the process of turning structured objects into a byte stream for transmission over a network or for writing to persistent storage. Hadoop has writable interface supporting serialization There are following predefined implementations available for WritableComparable. 1. IntWritable 2. LongWritable 3. DoubleWritable 4. VLongWritable. Variable size, stores as much as needed. 1-9 bytes storage 5. VIntWritable. Less used! as it is pretty much represented by Vlong. 6. BooleanWritable 7. FloatWritable Data Science Analytics & Research Centre 37

8. BytesWritable. 9. NullWritable. 10. MD5Hash 11. ObjectWritable 12. GenericWritable Apart from the above there are four Writable Collection types 1. ArrayWritable 2. TwoDArrayWritable 3. MapWritable 4. SortedMapWritable Data Science Analytics & Research Centre 38

Input Data <K1, V1> <K2, V2> <K2, List(V2)> <K3, V3> Input Data Format Mapper Reducer MapperClass public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); output.collect(word, one); } } ReducerClass public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Data Science Analytics & Research Centre 39

$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01 Hello World Bye World $ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02 Hello Hadoop Goodbye Hadoop Run the application: $ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.wordcount /usr/joe/wordcount/input /usr/joe/wordcount/output Mapper implementation: Lines: 18-25 The first map emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> The second map emits: < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> Combiner implementation: Line: 46 Output of first map emits: < Bye, 1> < Hello, 1> < World, 2> Output of second map emits: < Goodbye, 1> < Hadoop, 2> < Hello, 1> Reducer implementation: Lines: 29-35 Output of job: < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2> Data Science Analytics & Research Centre 40

A way of coping with individual Machine performance The same input can be processed multiple times in parallel, to exploit differences in machine capabilities Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform Name Value Description mapred.map.tasks. speculative.execution Mapred.reduce.tasks. speculative.execution true true If true, then multiple instances of some map tasks may be executed in parallel. If true, then multiple instances of some reduce tasks may be executed in parallel. Data Science Analytics & Research Centre 41

Data Science Analytics & Research Centre 42

Utility that comes with the Hadoop distribution Allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input myinputdirs \ -output myoutputdir \ -mapper org.apache.hadoop.mapred.lib.identitymapper\ -reducer /bin/wc \ -jobconf mapred.reduce.tasks=2 Data Science Analytics & Research Centre 43

Data Science Analytics & Research Centre 44

Default Scheduler Single priority based queue of jobs. Scheduling tries to balance map and reduce load on all tasktrackers in the cluster. Capacity Scheduler Within a queue, jobs with higher priority will have access to the queue's resources before jobs with lower priority. In order to prevent one or more users from monopolizing its resources, each queue enforces a limit on the percentage of resources allocated to a user at any given time, if there is competition for them. Fair Scheduler Multiple queues (pools) of jobs sorted in FIFO or by fairness limits Each pool is guaranteed a minimum capacity and excess is shared by all jobs using a fairness algorithm. Scheduler tries to ensure that over time, all jobs receive the same number of resources. Data Science Analytics & Research Centre 45

Thank you!! Data Science Analytics & Research Centre Data Science Analytics & Research Centre 46