Case Study : 3 different hadoop cluster deployments

Similar documents

Hadoop Ecosystem B Y R A H I M A.

How To Create A Data Visualization With Apache Spark And Zeppelin

How to Hadoop Without the Worry: Protecting Big Data at Scale

Hadoop Architecture. Part 1

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop and Map-Reduce. Swati Gore

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Upcoming Announcements

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

CSE-E5430 Scalable Cloud Computing Lecture 2

HDP Hadoop From concept to deployment.

L1: Introduction to Hadoop

Hadoop Job Oriented Training Agenda

BIG DATA TRENDS AND TECHNOLOGIES

Workshop on Hadoop with Big Data

Unified Big Data Analytics Pipeline. 连城

Enabling High performance Big Data platform with RDMA

The Evolving Apache Hadoop Eco-System

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Implement Hadoop jobs to extract business value from large and varied data sets

How Companies are! Using Spark

Certified Big Data and Apache Hadoop Developer VS-1221

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop IST 734 SS CHUNG

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Deploying Hadoop with Manager

Bringing Big Data to People

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Large scale processing using Hadoop. Ján Vaňo

Apache Hadoop. Alexandru Costan

Hadoop: Embracing future hardware

Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage Division

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Open source Google-style large scale data analysis with Hadoop

Chase Wu New Jersey Ins0tute of Technology

HADOOP MOCK TEST HADOOP MOCK TEST II

Peers Techno log ies Pv t. L td. HADOOP

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Information Builders Mission & Value Proposition

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

BIG DATA What it is and how to use?

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

HDFS. Hadoop Distributed File System

Unified Big Data Processing with Apache Spark. Matei

Hadoop Distributed File System (HDFS)

Hadoop Distributed File System (HDFS) Overview

Apache Hadoop: Past, Present, and Future

Introduction to HDFS. Prasanth Kothuri, CERN

Accelerating and Simplifying Apache

Big Data Infrastructure at Spotify

Modern Data Architecture for Predictive Analytics

I/O Considerations in Big Data Analytics

Hadoop Big Data for Processing Data and Performing Workload

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Hadoop & Spark Using Amazon EMR

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

Big Data and Data Science: Behind the Buzz Words

Moving From Hadoop to Spark

HADOOP. Revised 10/19/2015

Why Spark on Hadoop Matters

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Modernizing Your Data Warehouse for Hadoop

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Application Development. A Paradigm Shift

Hadoop & its Usage at Facebook

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Hadoop: The Definitive Guide

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop & its Usage at Facebook

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

HDFS Users Guide. Table of contents

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Ali Ghodsi Head of PM and Engineering Databricks

Open Source for Cloud Infrastructure

ITG Software Engineering

Complete Java Classes Hadoop Syllabus Contact No:

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Big Data Too Big To Ignore

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Transcription:

Case Study : 3 different hadoop cluster deployments Lee moon soo moon@nflabs.com

HDFS as a Storage Last 4 years, our HDFS clusters, stored Customer 1500 TB+ data safely served 375,000 TB+ data to customer Delivery using HTTP, RTSP, RTMP, MMS, FTP protocol Storage survived from 336 disk failures 3 name node failures

HDFS as a Storage Pros Easy to scale Commodity hardware Fault tolerant High throughput Cost dropped to 20% of SAN including maintenance Cons Not mountable Does not handle well a lot of files IO is unpredictable (replication, map-reduce) Most existing delivery server can not access HDFS Couldn t store MANY small files like jpg, gif, html, txt Buffering on video streaming

HDFS Mountable? hdfs-fuse With very high load, we faced memory leak, hanging Windows system can not use fuse. Webhdfs NFS gateway CloudVFS - we built our own Run as Java daemon. support FUSE, CIFS, NFS, FTP. Including cache Can run on windows (CIFS) FTP Application OS FUSE CIFS CloudVFS

Many small files When a disk or a datanode fails. - starting to replicate, speed of (number of nodes * 2 ) blocks / dfs.replication.interval(default 3) seconds - If you have 1M jpg files to replicate, on 10 nodes cluster, it ll take more than 40 hours. Data node periodically scan blocks - does not handle many files well Namenode keeps metadata in it s memory - and memory is limited

Handling small files Replace hdfs implementation <property> <name>fs.hdfs.impl</name> <value>com.nflabs.cloudvfs.hdfs.smallhdfs</value> </property> SmallHDFS driver first looking for._dir_.har. /dir file1 file2... file10000 scan directory tree and create._dir_.har archive using mapreduce /dir._dir_.har

Large scale log analysis system Delivery servers are starting to generate logs So, we built log analysis system

Large scale log analysis system The first log analysis system Calculates simple statistics like, throughput, hit/sec a web server RRD database / graph HTTP PUT a python script As service growing up, processing speed of a web server couldn t catch up the log generation speed.

Large scale log analysis system Logs are uploaded to HDFS Python code are converted to M/R java code a web server RRD database / graph HTTP PUT cron job Map-Reduce a hadoop cluster Now, processing speed catches up log generation speed We could add more analysis like, top url rank, where clients come from. RRD database wasn t flexible enough. RRD file becomes too big.

Large scale log analysis system Processed results are saved into Hbase Web server implements google chart API a server HTTP PUT cron job Hbase Web server google chart API Map-Reduce a hadoop cluster Hbase provides better flexibility, scalability Now, writing MR becomes pain

Large scale log analysis system Hive helped a lot to quickly develop new statistics features a server HTTP PUT cron job Hbase Web server Hive google chart API Map-Reduce a hadoop cluster As more hive jobs are added, controlling, scheduling job becomes complicated, problematic

Large scale log analysis system oozie replaces a cron job. a server HTTP PUT oozie Hbase Web server Hive google chart API Map-Reduce a hadoop cluster

Large scale log analysis system Flume replaces a HTTP log collector a server oozie Hbase Web server Hive Map-Reduce google chart API Flume HDFS a hadoop cluster

Large scale log analysis system for last 4 years 1328TB+, 3658400M+ records

Hadoop for data scientist What is hadoop for data scientist? Data scientist is a human, and human want Analytical language & environment Many Libraries Interactive Visualization Share

Hadoop for data scientist Tools, Languages. M/R Java Hive Pig R Scala (spark)... Recently many opensource ML libraries are born Mahout (http://mahout.apache.org/) cloudera-ml (https://github.com/cloudera/ml) Mlbase (http://mlbase.org/) Cascade Pattern (http://www.cascading.org/pattern/)

Demonstration

Hadoop landscape ML-base Cloudera-ML HCatalog MRQL Stinger Pig Drill Shark Hive Impala tajo

An opensource analytical tool/environment for hadoop Is there any? Zeppelin Interactive Data Visualization Runtime Environment (abstract, connect different libraries / computing platform) Sharing network like CRAN or CPAN https://github.com/nflabs/zeppelin https://groups.google.com/forum/#!forum/zeppelin-developers

Thanks!