Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Similar documents
How To Write A Mapreduce Program In Java.Io (Orchestra)

Apache Hadoop new way for the company to store and analyze big data

MapReduce. Tushar B. Kute,

Introduction to MapReduce and Hadoop

Data Science Analytics & Research Centre

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

map/reduce connected components

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

COMPARISON OF FIXED & VARIABLE RATES (25 YEARS) CHARTERED BANK ADMINISTERED INTEREST RATES - PRIME BUSINESS*

COMPARISON OF FIXED & VARIABLE RATES (25 YEARS) CHARTERED BANK ADMINISTERED INTEREST RATES - PRIME BUSINESS*

Big Data 2012 Hadoop Tutorial

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Chapter 7. Using Hadoop Cluster and MapReduce

AT&T Global Network Client for Windows Product Support Matrix January 29, 2015

Tutorial- Counting Words in File(s) using MapReduce

A. Aiken & K. Olukotun PA3

Open source Google-style large scale data analysis with Hadoop

Map Reduce & Hadoop Recommended Text:

Basic Hadoop Programming Skills

Introduction to Cloud Computing

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

MapReduce, Hadoop and Amazon AWS

Single Node Setup. Table of contents

Getting to know Apache Hadoop

Open source large scale distributed data management with Google s MapReduce and Bigtable

Hadoop (pseudo-distributed) installation and configuration

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200

How To Use Hadoop

A very short Intro to Hadoop

Big Data Management and NoSQL Databases

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

2.1 Hadoop a. Hadoop Installation & Configuration

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

MapReduce Tutorial. Table of contents

CS 378 Big Data Programming. Lecture 2 Map- Reduce

To reduce or not to reduce, that is the question

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

CS 378 Big Data Programming

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Chase Wu New Jersey Ins0tute of Technology

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop Installation MapReduce Examples Jake Karnes

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Generic Log Analyzer Using Hadoop Mapreduce Framework

Analysis One Code Desc. Transaction Amount. Fiscal Period

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

CS2510 Computer Operating Systems Hadoop Examples Guide

How To Install Hadoop From Apa Hadoop To (Hadoop)

Case 2:08-cv ABC-E Document 1-4 Filed 04/15/2008 Page 1 of 138. Exhibit 8

TP1: Getting Started with Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Hadoop Streaming. Table of contents

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Lecture 10 - Functional programming: Hadoop and MapReduce

MAPREDUCE Programming Model

Hadoop WordCount Explained! IT332 Distributed Systems

IDS 561 Big data analytics Assignment 1

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Click Stream Data Analysis Using Hadoop

Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

ITG Software Engineering

Hadoop Architecture. Part 1

HADOOP MOCK TEST HADOOP MOCK TEST II

Hadoop IST 734 SS CHUNG

Hadoop Tutorial. General Instructions

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Cloud Computing. Chapter Hadoop

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

Extreme Computing. Hadoop MapReduce in more detail.

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Distributed Filesystems

Internals of Hadoop Application Framework and Distributed File System

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Enhanced Vessel Traffic Management System Booking Slots Available and Vessels Booked per Day From 12-JAN-2016 To 30-JUN-2017

Apache Hadoop. Alexandru Costan

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Hadoop Training Hands On Exercise

Introduction to Hadoop

Sriram Krishnan, Ph.D.

Transcription:

Research Laboratory Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

1. Java Web Crawler Description Java Code 2. MapReduce Overview Example of mapreduce program Code & Run Walk-Through

1:class Mapper 2: method Map(docid a, doc d) 3: for all term t doc d do 4: if term t! HBase 5: Emit(term t, count 1) 1: class Reducer 2: method Reduce(term t, counts [c1, c2,...]) 3: sum 0 4: for all count c counts [c1, c2,...] do 5: sum sum + c 6: Emit(term t, count sum) 7: To Hive

Crawler A web crawler (also known as a web spider or web robot) is a program or automated script which browses the World Wide Web in a methodical, automated manner. This process is called Web crawling or spidering.

Description Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam).

Our Aim 1. Our aim is to search specified word in web pages. 2. We used ibrary Jsoup for it and its commands. http://jsoup.org/

Java Code 1. Using Eclipse 2. Add a Jsoup library to project 3. You can change web site url or text word in this program and see results. 4. results are kept in a file

Mapreduce Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Overview The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master. The MapReduce framework operates exclusively on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types. Input and Output types of a MapReduce job: (input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

Example Before we jump into the details, lets walk through an example MapReduce application to get a flavour for how they work. WordCount is a simple application that counts the number of occurences of each word in a given input set. This works with a local-standalone, pseudo-distributed or fullydistributed Hadoop installation (Single Node Setup)

Code & Run Usage Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is the Hadoop version installed, compile WordCount.java and create a jar: $ mkdir wordcount_classes $ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d wordcount_classes WordCount.java $ jar -cvf /usr/joe/wordcount.jar -C wordcount_classes/. Assuming that: /usr/joe/wordcount/input - input directory in HDFS /usr/joe/wordcount/output - output directory in HDFS

Code & Run Sample text-files as input: $ bin/hadoop dfs -ls /usr/joe/wordcount/input/ /usr/joe/wordcount/input/file01 /usr/joe/wordcount/input/file02 $ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01 Hello World Bye World $ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02 Hello Hadoop Goodbye Hadoop Run the application: $ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.wordcount /usr/joe/wordcount/input /usr/joe/wordcount/output

Output $ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000 Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2

Walk-Through The WordCount application is quite straight-forward.the Mapper implementation (lines 14-26), via the map method (lines 18-25), processes one line at a time, as provided by the specified TextInputFormat (line 49). It then splits the line into tokens separated by whitespaces, via the StringTokenizer, and emits a key-value pair of < <word>, 1>. For the given sample input the first map emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> The second map emits: < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>

Walk-Through WordCount also specifies a combiner (line 46). Hence, the output of each map is passed through the local combiner (which is same as the Reducer as per the job configuration) for local aggregation, after being sorted on the keys. The output of the first map: < Bye, 1> < Hello, 1> < World, 2> The output of the second map: < Goodbye, 1> < Hadoop, 2> < Hello, 1>

Walk-Through The Reducer implementation (lines 28-36), via the reduce method (lines 29-35) just sums up the values, which are the occurence counts for each key (i.e. words in this example). Thus the output of the job is: < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2> The run method specifies various facets of the job, such as the input/output paths (passed via the command line), key/value types, input/output formats etc., in thejobconf. It then calls the JobClient. runjob (line 55) to submit the and monitor its progress.

Electrical Consumption Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Avg 1979 23 23 2 43 24 25 26 26 26 26 25 26 25 1980 26 27 28 28 28 30 31 31 31 30 30 30 29 1981 31 32 32 32 33 34 35 36 36 34 34 34 34 1984 39 38 39 39 39 41 42 43 40 39 38 38 40 1985 38 39 39 39 39 41 41 41 00 40 39 39 45

If the above data is given as input, we have to write applications to process it and produce results such as finding the year of maximum usage, year of minimum usage, and so on. This is a walkover for the programmers with finite number of records. They will simply write the logic to produce the required output, and pass the data to the application written. But, think of the data representing the electrical consumption of all the largescale industries of a particular state, since its formation. When we write applications to process such bulk data, They will take a lot of time to execute. There will be a heavy network traffic when we move data from source to network server and so on. To solve these problems, we have the MapReduce framework.

Thank you for pay attention!!!