CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

Similar documents

Hadoop Shell Commands

Hadoop Shell Commands

HDFS File System Shell Guide

File System Shell Guide

RDMA for Apache Hadoop User Guide

Data Science Analytics & Research Centre

Big Data Technologies

Extreme computing lab exercises Session one

Extreme computing lab exercises Session one

TP1: Getting Started with Hadoop

HDFS Installation and Shell

Apache Hadoop. Alexandru Costan

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Introduction to HDFS. Prasanth Kothuri, CERN

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

HDFS. Hadoop Distributed File System

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

HSearch Installation

Introduction to Cloud Computing

Apache Hadoop new way for the company to store and analyze big data

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

MapReduce Programming with Apache Hadoop Viraj Bhat

How To Install Hadoop From Apa Hadoop To (Hadoop)

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Distributed Filesystems

Hadoop Distributed File System Propagation Adapter for Nimbus

HADOOP MOCK TEST HADOOP MOCK TEST II

Introduction to HDFS. Prasanth Kothuri, CERN

How To Use Hadoop

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

MapReduce. Tushar B. Kute,

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Big Data 2012 Hadoop Tutorial

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Hadoop Architecture. Part 1

研發專案原始程式碼安裝及操作手冊. Version 0.1

Hadoop At Yahoo! (Some Statistics)

Big Data Too Big To Ignore

Hadoop Streaming. Table of contents

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Recommended Literature for this Lecture

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Open source Google-style large scale data analysis with Hadoop

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Hadoop Ecosystem B Y R A H I M A.

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Chase Wu New Jersey Ins0tute of Technology

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

A Performance Analysis of Distributed Indexing using Terrier

CSE-E5430 Scalable Cloud Computing Lecture 2

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Tes$ng Hadoop Applica$ons. Tom Wheeler

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Introduction to MapReduce and Hadoop

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

Big Data With Hadoop

5 HDFS - Hadoop Distributed System

Developing a MapReduce Application

Hadoop Installation MapReduce Examples Jake Karnes

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Hadoop and Hive. Introduction,Installation and Usage. Saatvik Shah. Data Analytics for Educational Data. May 23, 2014

Complete Java Classes Hadoop Syllabus Contact No:

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

CSE-E5430 Scalable Cloud Computing. Lecture 4

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

HADOOP MOCK TEST HADOOP MOCK TEST

Hadoop and Map-Reduce. Swati Gore

Hadoop Training Hands On Exercise

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

Hadoop for HPCers: A Hands-On Introduction. Part I: Overview, MapReduce. Jonathan Dursi, SciNet Michael Nolta, CITA. Agenda

Tutorial for Assignment 2.0

A. Aiken & K. Olukotun PA3

Big Data : Experiments with Apache Hadoop and JBoss Community projects

Apache HBase. Crazy dances on the elephant back

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Hadoop Hands-On Exercises

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Big Data Introduction

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

COURSE CONTENT Big Data and Hadoop Training

map/reduce connected components

Transcription:

CS242 PROJECT Presented by Moloud Shahbazi Spring 2015

AGENDA Project Overview Data Collection Indexing Big Data Processing

PROJECT- PART1 1.1 Data Collection: 5G < data size < 10G Deliverables: Document partner s contribution Design Obstacles Limitations Implementation Runnable + usage instruction Code 1.2 Indexing: Deliverables: Document: Index fields Text analyzer choices Runtime of indexing Limitations Implementation Runnable + usage instruction code

PROJECT- PART2 2.1 Hadoop: Description of Hadoop jobs Indexes built by Hadoop Run time of Index construction Ranking with Hadoop Obstacles Limitations Code + usage instruction 2.2 Web Interface: Minimal is enough Option for Lucene/Hadoop Integration with Hadoop Index Lucene QueryParser integration Limitations Code + usage instructions Demo Screenshots

DATA COLLECTION RESOURCES Shiwen s slides for CS172 h4p://www.cs.ucr.edu/~schen064/cs172/ Covers: robots.txt, jsoup, threading, XML, JSP and Servlet examples! Twi4er Streaming API: h4ps://dev.twi4er.com/docs/streaming- apis Twi4er Blog: h4ps://blog.twi4er.com/2013/drinking- from- the- streaming- api Horsebird Client: h4ps://github.com/twi4er/hbc! Twi4er4j: h4p://twi4er4j.org/en/index.html Examples: h4p://twi4er4j.org/en/code- examples.html#streami

APACHE LUCENE http://lucene.apache.org/core/ http://lucene.apache.org/pylucene/ http://www.lucenetutorial.com/ Version 5.0.0: https://lucene.apache.org/core/5_0_0/demo/src-html/org/apache/lucene/demo/ IndexFiles.html! Note, only use the Lucene Core or PyLucene, Solr is not acceptable

LUCENE Definition : Full Text Index java library Concepts:! Documents ex. web page, tweet, image. Filed Functionality: Indexing Create Document from data Call IndexWriter Searching: Create Query Call IndexSearcher

CONSTRUCTING LUCENE INDEX http://lucene.apache.org/core/5_0_0/core/org/apache/lucene/index/ IndexWriter.html

ADD NEW DOCUMENT

SEARCH LUCENE INDEX https://lucene.apache.org/core/5_0_0/core/org/apache/lucene/search/ IndexSearcher.html

QUERY SEMANTICS http://lucene.apache.org/core/5_0_0/core/org/apache/lucene/ search/query.html http://www.lucenetutorial.com/lucene-query-syntax.html Keyword matching, Wildcard matching, Proximity matching, Range search, Boosts

http://hadoop.apache.org/docs/current/index.html http://hadooptutorials.co.in/tutorials/hadoop/hadoop-fundamentals.html

WHAT IS IT? https://hadoop.apache.org/#what+is+apache+hadoop%3f

BASE COMPONENTS Fault tolerant storage for big data HDFS Replication on multiple machines Parallel I/O I/O balancing Distributed system for processing large quantities of data Parallel processing - Map Reduce Data-Aware job scheduling Failure handling (Tracking/Restarting)

HOW? Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128M and 64M (preferably 128M). These files are then distributed across various cluster nodes for further processing. HDFS, being on top of the local file system, supervises the processing. Blocks are replicated for handling hardware failure. Checking that the code was executed successfully. Performing the sort that takes place between the map and reduce stages. Sending the sorted data to a certain computer. Writing the debugging logs for each job.

HDFS STORE TRACK

HDFS COMMUNICATIONS

HDFS COMMAND LINE http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoopcommon/filesystemshell.html hadoop fs [-fs <local file system URI>] [-conf <configuration file>] [-D <property=value>] [-ls <path>] [-lsr <path>] [-du <path>] [-dus <path>] [-mv <src> <dst>] [-cp <src> <dst>] [-rm <src>] -rmr <src>] [-put <localsrc>... <dst>] [-copyfromlocal <localsrc>... <dst>] [-movefromlocal <localsrc>... <dst>] [-get [-ignorecrc] [-crc] <src> <localdst> [-getmerge <src> <localdst> [addnl]] [-cat <src>] [-copytolocal [-ignorecrc] [-crc] <src> <localdst>] [-movetolocal <src> <localdst>] [-mkdir <path>] [-report] [-setrep [-R] [-w] <rep> <path/file>] [-touchz <path>] [-test -[ezd] <path>] [-stat [format] <path>] [-tail [-f] <path>] [-text <path>] [-chmod [-R] <MODE[,MODE]... OCTALMODE> PATH...] [-chown [-R] [OWNER][:[GROUP]] PATH...] [-chgrp [-R] GROUP PATH...] [-count[-q] <path>] [-help [cmd]]

Working With HDFS Start% Here% Copy%to% HDFS% Check% File% Tail% Move%to% Local% Remove% hadoop dfs ls /users/eruiz009 hadoop dfs copyfromlocal tweets.txt /users/eruiz009 hadoop dfs ls /users/eruiz009/ tweets.txt hadoop dfs tail /users/eruiz009/ tweets.txt hadoop dfs copytolocal /users/ eruiz009/tweets.txt. hadoop dfs rmr /users/eruiz009/ tweets.txt

HDFS JAVA API

HADOOP MAP-REDUCE

MAPREDUCE EXAMPLE

CREATING MAP-REDUCE JOB Input and Output types of a MapReduce job:! (input) <k1, v1> map <k2, v2> combine <k2, v2> reduce <k3, v3> (output) The right level of parallelism for maps seems to be around 10-100 maps per-node each map takes a block if file_size = 5G & block size = 128M -> #blocks > 5G/128M=40 Recommended file size on HDFS => slightly less than block size.

OUR HADOOP CLUSTER Server 1: Task Tracker NameNode Secondary NameNode Servers 2 to 6: JobTracker Data Node

CREATING A MAP-REDUCE JOB USING JAVA API (WORD COUNT EXAMPLE) https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

CUSTOMIZING MAP-REDUCE JOB Data types for key and values Input format output format partitioning of mapper output combiners process mapper output in memory

MAPPER

REDUCER

JOB CONFIGURATION & RUN

RUNNING local data storage (!using HDFS): Using HDFS:

DEBUGGING A MAP-REDUCE JOB Using log files created by Hadoop /var/log/hadoop/job_*.xml - > job configura]on XML logs /var/log/hadoop/userlogs/a4empt_* /stderr - > standard error logs /stdout - > standard out logs /syslog - > log4j logs referring to stderr for uncaught exceptions, pad file path, et Note that exact path may vary after cluster setup

MAP-REDUCE FRIENDLY PATTERNS Counting, Summing, Sorting Log analysis, Data querying, ETL Collating Inverted Index, Facebook friends in common, ETL Filtering, Parsing and Validation Log analysis, ETL, Data validation, Data querying Distributed Task Execution Numerical Analysis, Performance Testing Not so basic: Graph propagation, Distinct values. Cross validation

ANTI-MAP-REDUCE PATTERNS Too much caching Problem: out of memory error Solution: chain jobs, e.g., M1->R1->M2-> R2->... Large Input Records Problem: record cannot fit into memory Solution: partition the record / catch the error Too many input files Problem: thousands of files strain the NameNode Solution: combine files into one large file Overwhelming external data sources Possible to kill an external SQL server using a large cluster

MAP-REDUCE TIPS Work with a small subset of data for development!! Custom FileWriters must write to the job s PWD! Malformed Input / Parsing Errors! Be particular about Hadoop s version and configuration when porting code across clusters! Take care with dynamic memory allocation

Questions?