Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?



Similar documents
Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Chase Wu New Jersey Ins0tute of Technology

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Scalable Network Measurement Analysis with Hadoop. Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL

HDFS. Hadoop Distributed File System

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

ITG Software Engineering

How To Scale Out Of A Nosql Database

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop Hands-On Exercises

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Hadoop. Sunday, November 25, 12

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Workshop on Hadoop with Big Data

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Implement Hadoop jobs to extract business value from large and varied data sets

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Map Reduce & Hadoop Recommended Text:

Hadoop implementation of MapReduce computational model. Ján Vaňo

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

COURSE CONTENT Big Data and Hadoop Training

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Big Data Management and NoSQL Databases

Complete Java Classes Hadoop Syllabus Contact No:

Hadoop Streaming. Table of contents

Cloudera Certified Developer for Apache Hadoop

Hadoop Hands-On Exercises

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Big Data on Microsoft Platform

Bringing Big Data to People

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data With Hadoop

MapReduce. Tushar B. Kute,

Hadoop IST 734 SS CHUNG

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop Ecosystem B Y R A H I M A.

I/O Considerations in Big Data Analytics

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Big Data Too Big To Ignore

BIG DATA TRENDS AND TECHNOLOGIES

Large scale processing using Hadoop. Ján Vaňo

BIG DATA - HADOOP PROFESSIONAL amron

Click Stream Data Analysis Using Hadoop

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

TP1: Getting Started with Hadoop

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Constructing a Data Lake: Hadoop and Oracle Database United!

Big Data for the JVM developer. Costin Leau,

Deploying Hadoop with Manager

THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCE COMPARING HADOOPDB: A HYBRID OF DBMS AND MAPREDUCE TECHNOLOGIES WITH THE DBMS POSTGRESQL

Peers Techno log ies Pv t. L td. HADOOP

Accelerating and Simplifying Apache

Hadoop Job Oriented Training Agenda

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

A very short Intro to Hadoop

Application Development. A Paradigm Shift

The Hadoop Eco System Shanghai Data Science Meetup

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

BIG DATA HADOOP TRAINING

Big Data Course Highlights

HADOOP. Revised 10/19/2015

Hadoop Distributed File System (HDFS)

Open source Google-style large scale data analysis with Hadoop

Hadoop and Map-Reduce. Swati Gore

ITG Software Engineering

A Brief Outline on Bigdata Hadoop

Sentimental Analysis using Hadoop Phase 2: Week 2

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Xiaoming Gao Hui Li Thilina Gunarathne

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Hadoop: The Definitive Guide

Introduction to Big Data & Basic Data Analysis. Freddy Wetjen, National Library of Norway.

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Big Data and Apache Hadoop s MapReduce

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12

Transcription:

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到 大 家 的 需 求... Yes, we hear the feedback of developers... 30

Hadoop 程 式 設 計 二 Hadoop 相 關 子 專 案 Hadoop Common: The common utilities that support the other Hadoop subprojects. HDFS: A distributed file system that provides high throughput access to application data. MapReduce: A software framework for distributed processing of large data sets on compute clusters. 31

Hadoop 相 關 子 專 案 Chukwa: A data collection system for managing large distributed systems. HBase: A scalable, distributed database that supports structured data storage for large tables. Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying. Pig: A high-level data-flow language and execution framework for parallel computation. ZooKeeper: A high-performance coordination service 32 for distributed applications.

Hadoop 生 態 系 (Ecosystem) Pig Chukwa Hive HBase MapReduce HDFS ZooKeeper Hadoop Core (Hadoop Common) Avro Source: Hadoop: The Definitive Guide 33

Avro Avro is a data serialization system. It provides: Rich data structures. A compact, fast, binary data format. A container file, to store persistent data. Remote procedure call (RPC). Simple integration with dynamic languages. For more detail, please check the official document: http://avro.apache.org/docs/current/ 34

Zoo Keeper http://hadoop.apache.org/zookeeper/ ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. 35

http://hadoop.apache.org/pig/ Pig Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties: Ease of programming Optimization opportunities Extensibility

Hive http://hadoop.apache.org/hive/ Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. Hive QL is based on SQL and enables users familiar with SQL to query this data.

HBase HBase is a distributed column-oriented database built on top of HDFS. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate on top of the Hadoop distributed file system (HDFS) or Kosmos File System (KFS, aka Cloudstore) for scalability, fault tolerance, and high availability. Integrated into the Hadoop map-reduce platform and paradigm.

Chukwa http://hadoop.apache.org/chukwa/ Chukwa is an open source data collection system for monitoring large distributed systems. built on top of HDFS and Map/Reduce framework includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.

Mahout http://mahout.apache.org/ Mahout is a scalable machine learning libraries. implemented on top of Apache Hadoop using the map/reduce paradigm. Mahout currently has Collaborative Filtering User and Item based recommenders K-Means, Fuzzy K-Means clustering Mean Shift clustering More...

Hadoop 只 支 援 Java 嗎? Although the Hadoop framework is implemented in Java TM, Map/Reduce applications need not be written in Java. Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer. Hadoop Pipes is a SWIG-compatible C++ API to implement Map/Reduce applications (non JNI TM based).

Hadoop Pipes (C++, Python) Hadoop Pipes allows C++ code to use Hadoop DFS and map/reduce. The C++ interface is "swigable" so that interfaces can be generated for python and other scripting languages. For more detail, check the API Document of org.apache.hadoop.mapred.pipes You can also find example code at hadoop-*/src/examples/pipes About the pipes C++ WordCount example code: http://wiki.apache.org/hadoop/c++wordcount 42

Hadoop Streaming Hadoop Streaming is a utility which allows users to create and run Map-Reduce jobs with any executables (e.g. Unix shell utilities) as the mapper and/or the reducer. It's useful when you need to run existing program written in shell script, perl script or even PHP. Note: both the mapper and the reducer are executables that read the input from STDIN (line by line) and emit the output to STDOUT. For more detail, check the official document of Hadoop Streaming

Running Hadoop Streaming jazz@hadoop:~$ hadoop jar hadoop-streaming.jar help Usage: $HADOOP_HOME/bin/hadoop [--config dir] jar \ $HADOOP_HOME/hadoop-streaming.jar [options] Options: -input <path> DFS input file(s) for the Map step -output <path> DFS output directory for the Reduce step -mapper <cmd JavaClassName> The streaming command to run -combiner <JavaClassName> Combiner has to be a Java class -reducer <cmd JavaClassName> The streaming command to run -file <file> File/dir to be shipped in the Job jar file -dfs <h:p> local Optional. Override DFS configuration -jt <h:p> local Optional. Override JobTracker configuration -additionalconfspec specfile Optional. -inputformat TextInputFormat(default) SequenceFileAsTextInputFormat JavaClassNam e Optional. -outputformat TextOutputFormat(default) JavaClassName Optional. More

Hadoop Streaming 範 例 (1) hadoop:~$ hadoop fs -rmr input output hadoop:~$ hadoop fs -put /etc/hadoop/conf input hadoop:~$ hadoop jar hadoop-streaming.jar -input input -output output -mapper /bin/cat -reducer /usr/bin/wc 45

Hadoop Streaming 範 例 (2) hadoop:~$ echo "sed -e \"s/ /\n/g\" grep." > streamingmapper.sh hadoop:~$ echo "uniq -c awk '{print \$2 \"\t\" \$1}'" > streamingreducer.sh hadoop:~$ chmod a+x streamingmapper.sh hadoop:~$ chmod a+x streamingreducer.sh hadoop:~$ hadoop fs -put /etc/hadoop/conf input hadoop:~$ hadoop jar hadoop-streaming.jar -input input -output output -mapper streamingmapper.sh -reducer streamingreducer.sh -file streamingmapper.sh -file streamingreducer.sh 46