The Hadoop Eco System Shanghai Data Science Meetup

Similar documents
Complete Java Classes Hadoop Syllabus Contact No:

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Xiaoming Gao Hui Li Thilina Gunarathne

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

COURSE CONTENT Big Data and Hadoop Training

Comparing SQL and NOSQL databases

Word Count Code using MR2 Classes and API

Qsoft Inc

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Cloudera Certified Developer for Apache Hadoop

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

ITG Software Engineering

Implement Hadoop jobs to extract business value from large and varied data sets

Internals of Hadoop Application Framework and Distributed File System

Hadoop: The Definitive Guide

Hadoop Ecosystem B Y R A H I M A.

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Peers Techno log ies Pv t. L td. HADOOP

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Constructing a Data Lake: Hadoop and Oracle Database United!

Big Data Course Highlights

BIG DATA APPLICATIONS

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Big Data Management and NoSQL Databases

Hadoop Job Oriented Training Agenda

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

A very short Intro to Hadoop

Getting to know Apache Hadoop

HDFS. Hadoop Distributed File System

Hadoop WordCount Explained! IT332 Distributed Systems

Introduction to MapReduce and Hadoop

Apache HBase. Crazy dances on the elephant back

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop IST 734 SS CHUNG

Integration of Apache Hive and HBase

Open source large scale distributed data management with Google s MapReduce and Bigtable

Workshop on Hadoop with Big Data

Chase Wu New Jersey Ins0tute of Technology

CS54100: Database Systems

How To Scale Out Of A Nosql Database

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop, Hive & Spark Tutorial

HADOOP MOCK TEST HADOOP MOCK TEST II

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Apache HBase: the Hadoop Database

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop implementation of MapReduce computational model. Ján Vaňo

Extreme Computing. Hadoop MapReduce in more detail.

Using distributed technologies to analyze Big Data

Tutorial- Counting Words in File(s) using MapReduce

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

HADOOP. Revised 10/19/2015

Hadoop: Understanding the Big Data Processing Method

How to Install and Configure EBF15328 for MapR or with MapReduce v1

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Cloud Computing Era. Trend Micro

Map Reduce & Hadoop Recommended Text:

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

MySQL and Hadoop. Percona Live 2014 Chris Schneider

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Introduction to Big Data Training

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

Big Data and Apache Hadoop s MapReduce

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Big Data With Hadoop

Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems

Moving From Hadoop to Spark

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

The Cloud Computing Era and Ecosystem. Phoenix Liau, Technical Manager

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

CSE-E5430 Scalable Cloud Computing Lecture 2

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

HiBench Introduction. Carson Wang Software & Services Group

Certified Big Data and Apache Hadoop Developer VS-1221

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

APACHE HADOOP JERRIN JOSEPH CSU ID#

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Big Data and Scripting Systems build on top of Hadoop

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Transcription:

The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space

Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related Apache projects Showing the architecture/functionality of some projects Illustrating the combination of different projects based on a simple example The intention of this talk is to give an overview of the Hadoop Ecosystem for beginners 11/03/2015 Karthik Rajasethupathy, Christian Kuka 2

Example During this talk we will illustrate the usage of some components of the Hadoop Ecosystem based on the following web application. HTTP GET /. Webserver Each search request is transmitted to the web server using AJAX Analyze most frequent search terms in the web form 11/03/2015 Karthik Rajasethupathy, Christian Kuka 3

Example Data Storage and Communication Apache HTTP: Provide basic website with search form HDFS: Hadoop distributed filesystem for log data storage Flume: Connector between Apache webserver and Hadoop Ecosystem Kafka: Distributed messaging system Hbase: NoSQL database for persistent storage Data Analysis and Management Map/Reduce: Estimate frequent search terms Hive: Perform map/reduce jobs using SQL-based query language Zookeeper: Centralized service for maintaining configuration information and synchronization 11/03/2015 Karthik Rajasethupathy, Christian Kuka 4

Example Store web access logs for big data analysis Big Data analysis Apache Log Copy log files to storage scp /var/log/apache/ log Persistent Storage 11/03/2015 Karthik Rajasethupathy, Christian Kuka 5

HDFS Hadoop Distributed File System Supported operations: Write, Delete, Append, Read No Update Client Metadata Operation Name node Block Operation Read/Write Operation Data nodes Replication Data nodes Name node: Stores all metadata Data nodes: Stores each HDFS block in one file Blocks: Default size 64MB http://hadoop.apache.org/ Blocks 11/03/2015 Karthik Rajasethupathy, Christian Kuka 6

Example Run HDFS Setup Hadoop: <property> <name>fs.defaultfs</name> <value>hdfs://localhost:9000</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> HDFS Format the filesystem $> /usr/local/bin/hdfs namenode -format Start HDFS $> /usr/local/sbin/start-dfs.sh 11/03/2015 Karthik Rajasethupathy, Christian Kuka 7

Example Store web access logs to HDFS for big data analysis Big Data analysis Apache Log Copy log files to HDFS scp /var/log/apache/ log hadoop fs -copytolocal log HDFS 11/03/2015 Karthik Rajasethupathy, Christian Kuka 8

Example Simplify movement of web access logs to HDFS Apache Log Move files to HDFS HDFS 11/03/2015 Karthik Rajasethupathy, Christian Kuka 9

Flume Distributed service for collecting, aggregating, and moving large amounts of streaming event data. Agent Apache Log Flume HDFS Source Channel Sink http://flume.apache.org/ 11/03/2015 Karthik Rajasethupathy, Christian Kuka 10

Example Flume Configuration agent.sources = mysource agent.channels = mychannel agent.sinks = mysink agent.sources.mysource.type = avro agent.sources.mysource.bind = localhost agent.sources.mysource.port = 10000 agent.sources.mysource.channels = mychannel agent.sinks.mysink.type = hdfs agent.sinks.mysink.channel = mychannel agent.sinks.mysink.hdfs.path = hdfs://localhost:9000/flume agent.channels.mychannel.type = memory agent.channels.mychannel.capacity = 100 11/03/2015 Karthik Rajasethupathy, Christian Kuka 11

Example Run Flume Start Flume $> flume-ng agent --conf /usr/local/conf -f /usr/local/conf/flumeconf.properties -n agent Apache Log Agent Apache Flume Client AVRO Flume Source Channel 11/03/2015 Karthik Rajasethupathy, Christian Kuka 12 Sink

Example Run Flume Pipe HTTP log entries to Flume client Add the following line in the Apache httpd configuration: CustomLog flume-ng avro-client conf /usr/local/conf -H localhost -p 10000" combined Apache Log Agent Apache Flume Client AVRO Flume Source Channel 11/03/2015 Karthik Rajasethupathy, Christian Kuka 13 Sink

Example Result after a few search requests: /tmp/hadoop-user/dfs/data/current/bp-92512059-127.0.1.1-1446363295938/current/finalized/subdir0/subdir0/blk_1073741825 SEQ^F!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable^@^@^@^@^@^ @<BA>^Hp<A0><FD> G^_5<95>[<9C><B7>Y?<B0>^@^@^@<A5>^@^@^@^H^@^@^AP<C1><F9><C1><9C>^@^@^@< 99>::1 - - [01/Nov/2015:15:36:00 +0800] "GET / HTTP/1.1" 200 792 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0 Iceweasel/38.3.0"^@^@^@<A5>^@^@^@^H^@^@^AP<C1><F9><C1><9E>^@^@^@<99>::1 - - [01/Nov/2015:15:36:09 +0800] "GET / HTTP/1.1" 200 792 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0 Iceweasel/38.3.0 Sequence file with HTTP requests in HDFS 11/03/2015 Karthik Rajasethupathy, Christian Kuka 14

Example Analyze web access log data stored on HDFS to estimate frequent search terms Apache Log Agent Flume HDFS Analyze data Source Channel Sink 11/03/2015 Karthik Rajasethupathy, Christian Kuka 15

Map/Reduce Main execution framework for distributed parallel data processing 2 Phases: Map: Map values to key/value pairs Reduce: Aggregate key/value pairs Map Apache Log Agent Flume HDFS Source Channel Sink Reduce http://hadoop.apache.org/ 11/03/2015 Karthik Rajasethupathy, Christian Kuka 16

Map/Reduce What is map/reduce? Programming paradigm for processing large data sets across multiple servers Composed of a Map and a Reduce procedure Scalable and fault-tolerant Key Value Map Value Reduce Key Value 11/03/2015 Karthik Rajasethupathy, Christian Kuka 17

Map/Reduce Architecture Input Split Input Data Input Split Input Format Mapper Process Mapper Process defines Reader Reader Driver defines defines Mapper Reducer Process Partition, shuffle & sort Mapper Reducer Process defines Reducer Reducer Output Format Writer Writer Output Data Output Data 11/03/2015 Karthik Rajasethupathy, Christian Kuka 18

Map/Reduce Mapper Process Input Data Input Split Input Split Mapper Process can contain 3 parts Mapper: Map incoming key/value pairs to new key/value pairs Combiner: Combine key/values with same key (Mini-Reducer) Partitioner: Partition key/value pairs to reducer processes (Default: Hash partitioner) Mapper Process Reader Mapper Combiner Partitioner Partition, shuffle & sort Mapper Process Reader Mapper Combiner Partitioner Reducer Process Reducer Process Output Data Output Data 11/03/2015 Karthik Rajasethupathy, Christian Kuka 19

Example Perform Map-Step public static class Map extends Mapper<LongWritable, BytesWritable, > { } } public void map(longwritable key, BytesWritable value, Context context) { String line = new String(value.getBytes()); Text word = new Text(); word.set(line.split(" ")[6]); context.write( ); 11/03/2015 Karthik Rajasethupathy, Christian Kuka 20

Example Perform Reduce-Step public static class Reduce extends Reducer<, > { } public void reduce(text key, Iterable<IntWritable> values, Context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); } } context.write( ); 11/03/2015 Karthik Rajasethupathy, Christian Kuka 21

Example Driver public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, searchcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass( ); job.setreducerclass( ); job.setinputformatclass(sequencefileinputformat. class); job.setoutputformatclass(textoutputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true); } 11/03/2015 Karthik Rajasethupathy, Christian Kuka 22

Example Run Map/Reduce Start Hadoop $> /usr/local/bin/hadoop jar hadoop-example.jar searchcount hdfs://localhost:9000/flume hdfs://localhost:9000/result 15/11/03 10:18:03 INFO mapred.localjobrunner: Waiting for map tasks 15/11/03 10:18:03 INFO mapred.maptask: Processing split: hdfs://localhost:9000/flume/flumedata.1446516967102:0+1111 15/11/03 10:18:03 INFO mapred.localjobrunner: reduce task executor complete. 15/11/03 10:18:04 INFO mapreduce.job: map 100% reduce 100% 15/11/03 10:18:04 INFO mapreduce.job: Job job_local1804904239_0001 completed successfully 11/03/2015 Karthik Rajasethupathy, Christian Kuka 23

Example Analyze web access log data stored on HDFS using a SQL-based language Map SELECT FROM WHERE Apache Log Agent Flume HDFS Source Channel Sink Reduce 11/03/2015 Karthik Rajasethupathy, Christian Kuka 24

Hive Run Hive queries in HiveQL (HQL) a dialect of SQL (influenced by MySQL). Hive takes care of converting these queries to a series jobs for execution on the hadoop cluster. Can create: User Defined Functions (UDF) User Defined Aggregation Functions (UDAF) User Defined Table Functions (UDTF) SELECT FROM WHERE Apache Log Agent Flume HDFS Hive Source http://hive.apache.org/ Channel Sink 11/03/2015 Karthik Rajasethupathy, Christian Kuka 25

Hive - Components https://cwiki.apache.org/confluence/display/hive/design 11/03/2015 Karthik Rajasethupathy, Christian Kuka 26

Hive - Components UI - submit query Driver - recieves query Compiler - parses and does semantic analysis of query (plans jobs) Metastore - stores all table info and column types Execution Engine - manages execution of jobs https://cwiki.apache.org/confluence/display/hive/design 11/03/2015 Karthik Rajasethupathy, Christian Kuka 27

Example Hive Schema CREATE EXTERNAL TABLE apache_log ( ip STRING, identd STRING, user STRING, finishtime STRING, request string, status string, size string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.dynamic_type.dynamicserde' WITH SERDEPROPERTIES ( 'serialization.format'='org.apache.hadoop.hive.serde2.thrift.tctlseparatedprotocol', 'quote.delim'='(" \\[ \\])', 'field.delim'=' ', 'serialization.null.format'='-') STORED AS sequencefile LOCATION 'hdfs://path/to/apache/files/'; 11/03/2015 Karthik Rajasethupathy, Christian Kuka 28

Example Hive Query SELECT parse_url(concat("http://www.some_example.com",split(requestline,' ')[1]),'QUERY','q') AS query, count(*) AS co FROM apache_log GROUP BY parse_url(concat("http://www.some_example.com",split(requestline,' ')[1]),'QUERY','q'); 11/03/2015 Karthik Rajasethupathy, Christian Kuka 29

Example Run Hive By Default, Hive sets the following values for hadoop variables: hadoop.bin.path - $HADOOP_HOME/bin/hadoop - The location of hadoop script which is used to submit jobs to hadoop when submitting through a separate jvm. hadoop.config.dir - $HADOOP_HOME/conf - The location of the configuration directory of the hadoop installation Start Hadoop Start Hive job: $> $HIVE_HOME/bin/hive 11/03/2015 Karthik Rajasethupathy, Christian Kuka 30

Example Store frequent search terms in a database Apache Log Agent Flume HDFS Processing Database Source Channel Sink 11/03/2015 Karthik Rajasethupathy, Christian Kuka 31

HBase BigTable architecture supporting loose schema HMaster Client Get Data Location Memstore: In-memory data cache WAL: Write-ahead-log to record all changes Hfile: Specialized HDFS file format http://hbase.apache.org/ Get Data Region server Memstore WAL HFile HDFS Region server Memstore WAL HFile 11/03/2015 Karthik Rajasethupathy, Christian Kuka 32

HBase - Structure Row: Uninterpreted bytes key (rows are lexicographically sorted) Column family: Group for columns Cell: {row, column, version} identifies exactly one cell value { } row :{ column family : { t1 : column family:column name value t2 : column family:column name : value } } 11/03/2015 Karthik Rajasethupathy, Christian Kuka 33

Example With Map/Reduce Start HBase $>./usr/local/bin/hbase start Create the table $> hbase shell hbase (main)> create SearchCount, cf Apache Log Agent Flume HDFS Map/Reduce HBase Source Channel Sink 11/03/2015 Karthik Rajasethupathy, Christian Kuka 34

Example Perform Reduce-Step public static class Reduce extends TableReducer<, > { public void reduce(text key, Iterable<IntWritable> values, Context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); } Put put = new Put(Bytes.toBytes(key.toString())); put.add(bytes.tobytes("cf"), Bytes.toBytes("count"), Bytes.toBytes(sum)); } } context.write(null, ); 11/03/2015 Karthik Rajasethupathy, Christian Kuka 35

Example Driver public static void main(string[] args) throws Exception { Job job = new Job(conf, "searchcount"); HBase already provides a job configuration FileInputFormat.addInputPath(job, new Path(args[0])); Configuration conf = HBaseConfiguration.create(); TableMapReduceUtil.initTableReducerJob("SearchCount", Reduce.class, job); } job.waitforcompletion(true); 11/03/2015 Karthik Rajasethupathy, Christian Kuka 36

Example Query HBase List tables: hbase(main)> list TABLE SearchCount => ["SearchCount"] Scan column: hbase(main)> scan 'SearchCount' ROW COLUMN+CELL /index.html?q=hdfs column=cf:count, timestamp=1, value=\x00\x00\x00\x01 /index.html?q=hadoop column=cf:count, timestamp=2, value=\x00\x00\x00\x04 11/03/2015 Karthik Rajasethupathy, Christian Kuka 37

HBase With Hive Use Hive HBase Integration to store processing result into Hbase: CREATE TABLE result(...) STORED BY 'org.apache.hadoop.hive.hbase.hbasestoragehandler' TBLPROPERTIES ('hbase.table.name' = searchcount'); Apache Log Agent Flume HDFS Hive HBase Source Channel Sink 11/03/2015 Karthik Rajasethupathy, Christian Kuka 38

Example Distribute analyzes of web access log data Messaging System Application Apache Log Agent Flume Application Source Channel Sink HDFS Analyze data 11/03/2015 Karthik Rajasethupathy, Christian Kuka 39

Kafka Kafka is a distributed, partitioned, replicated commit log service Producer A Consumer A Kafka Cluster Consumer B Producer B Consumer C http://kafka.apache.org/ 11/03/2015 Karthik Rajasethupathy, Christian Kuka 40

Kafka Kafka is a distributed, partitioned, replicated commit log service Producer A Topic Partition 1 Consumer A Partition 2 Consumer B Producer B Partition 3 Consumer C 11/03/2015 Karthik Rajasethupathy, Christian Kuka 41

Kafka Storage Partition 1 Deletes Reads Appends Active Segment List 34477849968 35551592051 35551592052 36625333894... 81722490797 82796232651 82796232652 83869974631 Segment Files topic/34477849968.kafka Message 34477849968 Message 34477850175.. Message 35551591806 Message 35551592051... 11/03/2015 Karthik Rajasethupathy, Christian Kuka 42

Zookeeper ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Server Server Leader Server Client Client Client Client Client http://zookeeper.apache.org/ 11/03/2015 Karthik Rajasethupathy, Christian Kuka 43

Zookeeper - Structure Structured like a filesystem Each node can have a value and multiple children Clients can register to changes on nodes /app1 / /app2 /app1/p1 /app1/p2 /app2/p1 11/03/2015 Karthik Rajasethupathy, Christian Kuka 44

YARN Client Resource Manager Scheduler Applications Manager Node Manager Container Node Manager Container Node Manager Container App Master App Master Container Resource Manager: Overall manager Node Manager: Per-machine framework agent App Master: Negotiating appropriate resource containers Container: Memory, cpu, disk, network etc. Scheduler: Allocating resource Applications Manager: Handling of job-submissions 11/03/2015 Karthik Rajasethupathy, Christian Kuka 45

Anything else? What was not covered by this talk: Spark Cassandra Mahout Pig. http://projects.apache.org/ 11/03/2015 Karthik Rajasethupathy, Christian Kuka 46