The Cloud Computing Era and Ecosystem. Phoenix Liau, Technical Manager

Similar documents
Cloud Computing Era. Trend Micro

CS54100: Database Systems

Internals of Hadoop Application Framework and Distributed File System

Hadoop Ecosystem B Y R A H I M A.

Word Count Code using MR2 Classes and API

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop 2.0 Introduction with HDP for Windows. Seele Lin

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

The Hadoop Eco System Shanghai Data Science Meetup

Hadoop IST 734 SS CHUNG

Xiaoming Gao Hui Li Thilina Gunarathne

Deploying Hadoop with Manager

BIG DATA TRENDS AND TECHNOLOGIES

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Data processing goes big

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop & Spark Using Amazon EMR

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

BIG DATA APPLICATIONS

Introduction to MapReduce and Hadoop

Qsoft Inc

Hadoop Job Oriented Training Agenda

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Introduction to Cloud Computing

Architectures for Big Data Analytics A database perspective

Hadoop and Map-Reduce. Swati Gore

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

CSE-E5430 Scalable Cloud Computing Lecture 2

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

Workshop on Hadoop with Big Data

How To Scale Out Of A Nosql Database

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Big Data Too Big To Ignore

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

BIG DATA What it is and how to use?

HDP Hadoop From concept to deployment.

Word count example Abdalrahman Alsaedi

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Big Data With Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

How To Handle Big Data With A Data Scientist

Virtualizing Apache Hadoop. June, 2012

Big Data for the JVM developer. Costin Leau,

Big Data Course Highlights

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Testing Big data is one of the biggest

A Brief Outline on Bigdata Hadoop

Apache Hadoop: Past, Present, and Future

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

HPCHadoop: MapReduce on Cray X-series

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Hadoop: Understanding the Big Data Processing Method

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Hadoop. Sunday, November 25, 12

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

So What s the Big Deal?

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Lecture 10 - Functional programming: Hadoop and MapReduce

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Peers Techno log ies Pv t. L td. HADOOP

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

The Future of Data Management

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Big Data on Microsoft Platform

IBM Big Data Platform

Big Data Analytics* Outline. Issues. Big Data

Introduction to Big Data Training

Big Data and Apache Hadoop s MapReduce

Accelerating and Simplifying Apache

Big Data Weather Analytics Using Hadoop

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data: Tools and Technologies in Big Data

Cloudera Certified Developer for Apache Hadoop

Transcription:

The Cloud Computing Era and Ecosystem Phoenix Liau, Technical Manager

Three Major Trends to Chang the World Cloud Computing Big Data Mobile

Mobility and Personal Cloud My World! My Way!

What is Personal Cloud 讓 消 費 者 利 用 智 慧 手 機 多 媒 體 平 板 裝 置 電 視 與 個 人 電 腦 等 各 種 聯 網 裝 置, 透 過 網 路 無 縫 儲 存 同 步 作 業 串 流 並 分 享 內 容 -- Gartner Multiple Screens Diverse Platforms Sync Store Stream Share

User Behaviors (Data source: Gartner)

Personal Cloud Challenge Gartner 預 估 消 費 者 在 2012 年 花 在 數 位 科 技 產 品 與 服 務 的 費 用 約 為 2.2 兆 美 元, 或 將 近 一 般 家 庭 平 均 可 支 配 收 入 的 10% 2015 年 前, 全 球 消 費 者 花 在 連 接 裝 置 相 關 執 行 與 內 容 傳 輸 服 務 的 費 用 將 達 2.8 兆 美 元 But if you are a developer.. 80% 的 開 發 者 並 沒 有 賺 到 足 夠 的 錢 以 成 立 獨 立 業 務 59% 的 開 發 者 無 法 回 收 開 發 時 所 投 入 的 資 金 63% 的 開 發 者 所 開 發 出 來 的 應 用 下 載 量 少 於 50,000 次 75% 的 開 發 者 在 每 款 應 用 上 只 能 夠 獲 得 US$5,000 或 更 少 的 收 入

Personal Cloud Opportunity Business Model: Social network, Location-base, Mobile powered "Services" not just "Apps"

Cloud Computing - What Come to Your Mind?

Cloud Computing - What Come to Your Mind? Virtualization Social Media SaaS Cloud Computing Hadoop Internet Storage Mobility

The NIST Definition of Cloud Computing Essential Characteristics Service Models Deployment Models 以 服 務 (as-a-service) 的 商 業 模 式, 透 過 Internet 技 術, 提 供 具 有 擴 充 性 (scalable) 和 彈 性 (elastic) 的 IT 相 關 功 能 給 使 用 者

It s About the Ecosystem Structured, Semi-structured Cloud Computing Enterprise Data Warehouse SaaS PaaS IaaS Generate Big Data Lead Business Insights create Competition, Innovation, Productivity

Top Cloud Computing Prediction for 2012 In 2012, 80% of new commercial enterprise apps will be deployed on cloud platforms. -- IDC Amazon Web Services will exceed $1 billion in cloud services business in 2012 with Google s Enterprise business to follow within 18 months. -- IDC By 2015, low-cost cloud services will cannibalize up to 15% of top outsourcing players' revenue. -- Gartner By 2016, 40% of enterprises will make proof of independent security testing a precondition for using any type of cloud service. -- Gartner At year-end 2016, more than 50% of Global 1000 companies will have stored customer-sensitive data in the public cloud. -- Gartner Estimated more than 20% percent of organizations have already begun to selectively store their customer-sensitive data in a hybrid architecture

The Need of Business Agility on Infrastructure Business Owner We got a mission from CEO to introduce a new SaaS services Developers Just getting the infrastructure to develop is so slow! Operations How do we get the h/w, manage the app and deliver the SLA in production? We need to: Get capacity now Get s/w stacks deployed Simulate production Once in prod, we need Plan capacity for app Place on Tier 1 capacity Provision the App Server, web, database Set up the load balancer Set up the firewall Set up data protection Set up mgmt Manage the app

Cloud Infrastructure Landscape Cloud Infrastructure As-a-Product As-a-Service Virtualization.. a lot more Storage

Cloud IaaS Providers Cloud IaaS is the future of outsourced hosting Every company that offers Web hosting is being forced to evolve its business. On-demand, pay-as-you-go capability has been the norm. Primary Customers: 1. Traditional Web hosting customers 2. Interning in what cloud computing is able to do for their business.

Cloud Infrastructure Management Central management solution for enterprises to manage servers in multiple public clouds and private clouds, or hybrid clouds. Solution Providers:

CIO s concern on Cloud Computing Concerns which drive the cloud adoption strategy Security and Compliance Performance and SLAs Availability and Data Protection Intellectual Property

Private Cloud Infrastructure Option Enterprise Private Cloud Options Commercial option: VMWare vcloud (with VMWare-base virtualization) Open-Source option: CloudStack, OpenStack, Eucalyptus (with XEN base virtualization) PROs Infrastructure is dedicated owned by organization thus more secure. CONs Not give enterprise the benefits of the cloud: true elasticity and CapEx and OpEx elimination. Virtual Private Data Center (VPDC) from Public Cloud service provider Options Amazon (Virtual Private Cloud), Savvis, Rackspace, Terremark PROs Meet business agility and flexibility requirement with relatively low cost Better security than Public Cloud Enterprise can access servers in VPDC via the secured tunnel Dedicated hardware for single customer under VPDC offering is available from some vendors (e.g. Amazon EC2 Dedicated Instances) Enterprise level managed services were offered by most of the vendros CONs Isolation occurs at network layer. Information store in VPDC are still be shared with other companies data on the actual servers

Hybrid Cloud Infrastructure A survey of 500 CIOs across UK, France, Germany, Spain and Benelux in 2011 highlights 16% has company-wide implementations of cloud computing to date 60% believe that the Cloud will be their most significant IT operating method by 2014 21% prefer Hybrid Cloud method Hybrid Cloud balance the security strengths of a private cloud with lower costs and elasticity available when using a public cloud service while maintain business agility. Example: Traffic bursting for newly introduced services like SafeSync Hybrid Cloud Private Cloud Bridging Public Cloud Security Lower cost, Elasticity

Comparison of Different Cloud Deployment Models ROI TCO 5 4 3 Average 2 1 0 Security Performance Elasicity Private Cloud Hybrid Cloud Public Cloud

What is BigData? A set of files A database A single file

The Data-Driven World Modern systems have to deal with far more data than was the case in the past Organizations are generating huge amounts of data That data has inherent value, and cannot be discarded Examples: Yahoo over 170PB of data Facebook over 30PB of data ebay over 5PB of data Many organizations are generating data at a rate of terabytes per day

What is the problem Traditionally, computation has been processor-bound For decades, the primary push was to increase the computing power of a single machine Faster processor, more RAM Distributed systems evolved to allow developers to use multiple machines for a single job At compute time, data is copied to the compute nodes

What is the problem Getting the data to the processors becomes the bottleneck Quick calculation Typical disk data transfer rate: 75MB/sec Time taken to transfer 100GB of data to the processor: approx. 22 minutes!

What is the problem Failure of a component may cost a lot What we need when job fail? May result in a graceful degradation of application performance, but entire system does not completely fail Should not result in the loss of any data Would not affect the outcome of the job

Big Data Solutions by Industries The most common problems Hadoop can solve

Threat Analysis/Trade Surveillance Challenge: Detecting threats in the form of fraudulent activity or attacks Large data volumes involved Like looking for a needle in a haystack Solution with Hadoop: Parallel processing over huge datasets Pattern recognition to identify anomalies i.e., threats Typical Industry: Security, Financial Services

Big Data Use Case Smart Protection Network Challenge Information accessibility and transparency problems for threat researcher due to the size and source of data (volume, variety and velocity) Size of Data Overall Data Data sources: 20+ Data fields: 1000+ Daily new records: 23 Billion+ Daily new data size: 4TB+ SPN Smart Feedback Feedback components: 26 Data fields : 300+ Daily new file counts: 6 Million+ Daily new records: 90 Million+ Daily new data size: 261GB+

Index= vsapi zbot

Recommendation Engine Challenge: Using user data to predict which products to recommend Solution with Hadoop: Batch processing framework Allow execution in in parallel over large datasets Collaborative filtering Collecting taste information from many users Utilizing information to predict what similar users like Typical Industry ISP, Advertising

Hadoop!

inspired by Apache Hadoop project inspired by Google's MapReduce and Google File System papers. Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware Open Source Software + Hardware Commodity IT Costs Reduction

Hadoop Concepts Distribute the data as it is initially stored in the system Individual nodes can work on data local to those nodes Users can focus on developing applications.

Hadoop Components Hadoop consists of two core components The Hadoop Distributed File System (HDFS) MapReduce Software Framework There are many other projects based around core Hadoop Often referred to as the Hadoop Ecosystem Pig, Hive, HBase, Flume, Oozie, Sqoop, etc Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

Hadoop Components: HDFS HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster Two roles in HDFS Namenode: Record metadata Datanode: Store data Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

How Files Are Stored: Example NameNode holds metadata for the data files DataNodes hold the actual blocks Each block is replicated three times on the cluster

HDFS: Points To Note When a client application wants to read a file: It communicates with the NameNode to determine which blocks make up the file, and which DataNodes those blocks reside on It then communicates directly with the DataNodes to read the data

Hadoop Components: MapReduce MapReduce is a method for distributing a task across multiple nodes It works like a Unix pipeline: cat input grep sort uniq -c cat > output Input Map Shuffle & Sort Reduce Output Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

Features of MapReduce Automatic parallelization and distribution Automatic re-execution on failure Locality optimizations MapReduce abstracts all the housekeeping away from the developer Developer can concentrate simply on writing the Map and Reduce functions Hue (Web Console) Oozie (Job Workflow & Scheduling) Mahout (Data Mining) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

Example : word count Word count is challenging over massive amounts of data Using a single compute node would be too time-consuming Number of unique words can easily exceed the RAM MapReduce breaks complex tasks down into smaller elements which can be executed in parallel More nodes, more faster

Word Count Example Key: offset Value: line Key: word Value: count Key: word Value: sum of count 0:The cat sat on the mat 22:The aardvark sat on the sofa

The Hadoop Ecosystems

Growing Hadoop Ecosystem The term Hadoop is taken to be the combination of HDFS and MapReduce There are numerous other projects surrounding Hadoop Typically referred to as the Hadoop Ecosystem Zookeeper Hive and Pig HBase Flume Other Ecosystem Projects Sqoop Oozie Hue Mahout

The Ecosystem is the System Hadoop has become the kernel of the distributed operating system for Big Data No one uses the kernel alone A collection of projects at Apache

Relation Map Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

Zookeeper Coordination Framework Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

What is ZooKeeper A centralized service for maintaining Configuration information Providing distributed synchronization A set of tools to build distributed applications that can safely handle partial failures ZooKeeper was designed to store coordination data Status information Configuration Location information

Why use ZooKeeper? Manage configuration across nodes Implement reliable messaging Implement redundant services Synchronize process execution

ZooKeeper Architecture All servers store a copy of the data (in memory) A leader is elected at startup 2 roles leader and follower Followers service clients, all updates go through leader Update responses are sent when a majority of servers have persisted the change HA support

Hbase Column NoSQL DB Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

Structured-data vs Raw-data

I Inspired by Apache open source project Inspired from Google Big Table Non-relational, distributed database written in Java Coordinated by Zookeeper

Row & Column Oriented

Hbase Data Model Cells are versioned Table rows are sorted by row key Region a row range [start-key:end-key]

Architecture Master Server (HMaster) Assigns regions to regionservers Monitors the health of regionservers RegionServers Contain regions and handle client read/write request

Hbase workflow

When to use HBase Need random, low latency access to the data Application has a variable schema where each row is slightly different Add columns Most of columns are NULL in each row

Flume / Sqoop Data Integration Framework Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

What s the problem for data collection Data collection is currently a priori and ad hoc A priori decide what you want to collect ahead of time Ad hoc each kind of data source goes through its own collection path

(and how can it help?) A distributed data collection service It efficiently collecting, aggregating, and moving large amounts of data Fault tolerant, many failover and recovery mechanism One-stop solution for data collection of all formats

Flume: High-Level Overview Logical Node Source Sink

Architecture basic diagram one master control multiple node

Architecture multiple master control multiple node

An example flow

Flume / Sqoop Data Integration Framework Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

Sqoop Easy, parallel database import/export What you want do? Insert data from RDBMS to HDFS Export data from HDFS back into RDBMS

What is Sqoop A suite of tools that connect Hadoop and database systems Import tables from databases into HDFS for deep analysis Export MapReduce results back to a database for presentation to end-users Provides the ability to import from SQL databases straight into your Hive data warehouse

How Sqoop helps The Problem Structured data in traditional databases cannot be easily combined with complex data stored in HDFS Sqoop (SQL-to-Hadoop) Easy import of data from many databases to HDFS Generate code for use in MapReduce applications

Sqoop - import process

Sqoop - export process Exports are performed in parallel using MapReduce

Why Sqoop JDBC-based implementation Works with many popular database vendors Auto-generation of tedious user-side code Write MapReduce applications to work with your data, faster Integration with Hive Allows you to stay in a SQL-based environment

Sqoop - JOB Job management options E.g sqoop job create myjob import connect xxxxxxx --table mytable

Pig / Hive Analytical Language Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

Why Hive and Pig? Although MapReduce is very powerful, it can also be complex to master Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code Many organizations have programmers who are skilled at writing code in scripting languages Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce Hive was initially developed at Facebook, Pig at Yahoo!

Hive Developed by What is Hive? An SQL-like interface to Hadoop Data Warehouse infrastructure that provides data summarization and ad hoc querying on top of Hadoop MapRuduce for execution HDFS for storage Hive Query Language Basic-SQL : Select, From, Join, Group-By Equi-Join, Muti-Table Insert, Multi-Group-By Batch query SELECT * FROM purchases WHERE price > 100 GROUP BY storeid

Pig Initiated by A high-level scripting language (Pig Latin) Process data one step at a time Simple to write MapReduce program Easy understand Easy debug A = load a.txt as (id, name, age,...) B = load b.txt as (id, address,...) C = JOIN A BY id, B BY id;store C into c.txt

Hive vs. Pig Hive Language HiveQL (SQL-like) Pig Latin, a scripting language Schema Table definitions that are stored in a metastore Programmait Access JDBC, ODBC Pig A schema is optionally defined at runtime PigServer

WordCount Example Input Hello World Bye World Hello Hadoop Goodbye Hadoop For the given sample input the map emits < Hello, 1> < World, 1> < Bye, 1> < World, 1> < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2> the reduce just sums up the values

WordCount Example In MapReduce public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true); }

WordCount Example By Pig A = LOAD 'wordcount/input' USING PigStorage as (token:chararray); B = GROUP A BY token; C = FOREACH B GENERATE group, COUNT(A) as count; DUMP C;

WordCount Example By Hive CREATE TABLE wordcount (token STRING); LOAD DATA LOCAL INPATH wordcount/input' OVERWRITE INTO TABLE wordcount; SELECT count(*) FROM wordcount GROUP BY token;

Oozie Job Workflow & Scheduling Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

What is? A Java Web Application Oozie is a workflow scheduler for Hadoop Crond for Hadoop Job 1 Job 2 Job 3 Job 4 Job 5

Why Why use Oozie instead of just cascading a jobs one after another Major flexibility Start, Stop, Suspend, and re-run jobs Oozie allows you to restart from a failure You can tell Oozie to restart a job from a specific node in the graph or to skip specific failed nodes

High Level Architecture Web Service API database store : Workflow definitions Currently running workflow instances, including instance states and variables Oozie WS API Tomcat web-app Hadoop/Pig/HDFS DB

How it triggered Time Execute your workflow every 15 minutes 00:15 00:30 00:45 01:00 Time and Data Materialize your workflow every hour, but only run them when the input data is ready. Input Data Exists? Hadoop 01:00 02:00 03:00 04:00

Exeample Workflow

Oozie use criteria Need Launch, control, and monitor jobs from your Java Apps Java Client API/Command Line Interface Need control jobs from anywhere Web Service API Have jobs that you need to run every hour, day, week Need receive notification when a job done Email when a job is complete

Hue Web Console Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

Hue developed by Hadoop User Experience Apache Open source project HUE is a web UI for Hadoop Platform for building custom applications with a nice UI library

Hue HUE comes with a suite of applications File Browser: Browse HDFS; change permissions and ownership; upload, download, view and edit files. Job Browser: View jobs, tasks, counters, logs, etc. Beeswax: Wizards to help create Hive tables, load data, run and manage Hive queries, and download results in Excel format.

Hue: File Browser UI

Hue: Beewax UI

Mahout Data Mining Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

What is Machine-learning tool Distributed and scalable machine learning algorithms on the Hadoop platform Building intelligent applications easier and faster

Why Current state of ML libraries Lack Community Lack Documentation and Examples Lack Scalability Are Research oriented

Mahout scale Scale to large datasets Hadoop MapReduce implementations that scales linearly with data Scalable to support your business case Mahout is distributed under a commercially friendly Apache Software license Scalable community Vibrant, responsive and diverse

Mahout four use cases Mahout machine learning algorithms Recommendation mining : takes users behavior and find items said specified user might like Clustering : takes e.g. text documents and groups them based on related document topics Classification : learns from existing categorized documents what specific category documents look like and is able to assign unlabeled documents to appropriate category Frequent item set mining : takes a set of item groups (e.g. terms in query session, shopping cart content) and identifies, which individual items typically appear together

Use case Example Predict what the user likes based on His/Her historical behavior Aggregate behavior of people similar to him

Conclusion Today, we introduced: Why Hadoop is needed The basic concepts of HDFS and MapReduce What sort of problems can be solved with Hadoop What other projects are included in the Hadoop ecosystem

Recap Hadoop Ecosystem Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

Questions?

Thank you!