Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Similar documents

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Hadoop Ecosystem B Y R A H I M A.

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop implementation of MapReduce computational model. Ján Vaňo

HADOOP. Revised 10/19/2015

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Dominik Wagenknecht Accenture

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Dell In-Memory Appliance for Cloudera Enterprise

ITG Software Engineering

Large scale processing using Hadoop. Ján Vaňo

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Hadoop. Sunday, November 25, 12

Upcoming Announcements

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Peers Techno log ies Pv t. L td. HADOOP

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Complete Java Classes Hadoop Syllabus Contact No:

Comprehensive Analytics on the Hortonworks Data Platform

HDP Hadoop From concept to deployment.

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Workshop on Hadoop with Big Data

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Map Reduce & Hadoop Recommended Text:

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Constructing a Data Lake: Hadoop and Oracle Database United!

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

How To Scale Out Of A Nosql Database

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop Job Oriented Training Agenda

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Chase Wu New Jersey Ins0tute of Technology

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Big Data Management and Security

TRAINING PROGRAM ON BIGDATA/HADOOP

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data With Hadoop

Why Spark on Hadoop Matters

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Big Data Course Highlights

BIG DATA HADOOP TRAINING

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

How to Hadoop Without the Worry: Protecting Big Data at Scale

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Hadoop: The Definitive Guide

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

BIG DATA - HADOOP PROFESSIONAL amron

ITG Software Engineering

HDP Enabling the Modern Data Architecture

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Deploying Hadoop with Manager

The Internet of Things and Big Data: Intro

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

and Hadoop Technology

BIG DATA & HADOOP DEVELOPER TRAINING & CERTIFICATION

Big Data Analytics for Cyber

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases

Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler

Data Security in Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

YARN Apache Hadoop Next Generation Compute Platform

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Self-service BI for big data applications using Apache Drill

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

How To Handle Big Data With A Data Scientist

Information Builders Mission & Value Proposition

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Big Data and Industrial Internet

Training Catalog. Summer 2015 Training Catalog. Apache Hadoop Training from the Experts. Apache Hadoop Training From the Experts

DANIEL EKLUND UNDERSTANDING BIG DATA AND THE HADOOP TECHNOLOGIES NOVEMBER 2-3, 2015 RESIDENZA DI RIPETTA - VIA DI RIPETTA, 231 ROME (ITALY)

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Big Data and Apache Hadoop s MapReduce

Cloudera Certified Developer for Apache Hadoop

Big Data and Data Science: Behind the Buzz Words

Apache Flink Next-gen data analysis. Kostas

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Moving From Hadoop to Spark

How Companies are! Using Spark

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Bringing Big Data to People

Internals of Hadoop Application Framework and Distributed File System

Transcription:

Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future lectures Discuss potential use cases for each project

Topics HDFS MapReduce YARN Sqoop Flume NiFi Pig Hive Streaming HBase Accumulo Avro Parquet Mahout Oozie Storm ZooKeeper Spark SQL-on-Hadoop In-Memory Stores Cassandra Kafka Crunch Azkaban

HDFS Hadoop Distributed File System High-performance file system for storing data We ve talked about this enough

Hadoop MapReduce High-performance fault-tolerance data processing system We ve also talked about this enough

YARN Abstract framework for distributed application development Split functionality of JobTracker into two components ResourceManager ApplicationMaster TaskTracker becomes NodeManager Containers instead of map and reduce slots Configurable amount of memory per NodeManager

MapReduce 2.x on YARN MapReduce API has not changed Binary-level backwards compatible (no recompile) Application Master launches and monitors job via YARN MapReduce History Server to store history Enabled Yahoo! to scale beyond 4,000 nodes

Hadoop Ecosystem Core Technologies Hadoop Distributed File System Hadoop MapReduce Many other tools Which we will be discussing now

Apache Sqoop Apache project designed for efficient transfer between Apache Hadoop and structured data stores Use through CLI and extendable

Apache Flume Distributed, reliable, available service for collecting, aggregating, and moving large amounts of log data Configure agents using simple files, extendable

Apache NiFi A service to reliably move and manipulate files between clusters using a web front-end Uses a GUI to drop processors and connect them to build workflows

Apache Pig Platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs Infrastructure compiles language to a sequence of MapReduce programs

Apache Hive Data warehouse facilitating querying and managing large datasets Compiles SQL-like queries into MapReduce programs

Hadoop Streaming Utility to create and run MapReduce jobs with any executable or script as the mapper or reducer Just a jar file, not a real project

Which high-level API is for you? What are you comfortable with? What are you being told to use?

Apache HBase Distributed, scalable, big data store Data stored as sorted key/value pairs, with the key consisting of a row and column

Apache Accumulo Robust, scalable, high-performance data storage and retrieval key/value store Cell-based access controls i.e. cell-level security

Apache Avro Data serialization system for the Hadoop ecosystem

Apache Parquet Columnar storage format for Hadoop

Apache Mahout Machine learning library to build scalable machine learning algorithms implemented on top of Hadoop MapReduce

Apache Oozie Workflow scheduler system to manage Apache Hadoop jobs

Apache Storm Distributed real-time computation system Didn t have a logo until June 2014 How is this different than MapReduce?

Apache ZooKeeper Effort to develop and maintain and opensource server enabling highly reliable distributed coordination

Apache Spark Fast and general engine for large-scale data processing Write applications in Java, Scala, or Python

SQL on Hadoop Apache Drill, Cloudera Impala, Facebook s Presto, Hortonworks s Hive Stinger, Pivotal HAWQ, etc. SQL-like or ANSI SQL compliant MPP execution engines using HDFS as a data store Non use cases?

Sample Architecture Flume Agent SQL Oozie Webserver Flume Agent Website Sales MapReduce Pig HBase Storm Flume Agent HDFS Call Center SQL

We [maybe] won t be covering these in detail later on OTHER HADOOP PROJECTS

Redis, Memcached, etc. Open-source in-memory key/value stores

Apache Cassandra NoSQL database for managing large amounts of structured, semi-structured, and unstructured data Support for clusters spanning multiple datacenters Unlike HBase and Accumulo, data is not stored on HDFS Non use cases?

Apache Crunch Java framework for writing, testing, and running MapReduce pipelines with a simple API Same code executes as a local job, as a MapReduce job, or as a streaming Spark job * *Not the real logo, but truly fantastic

Apache Kafka High-throughput distributed publish-subscribe message service

Azkaban Batch workflow job scheduler to run Hadoop jobs

Review A lot of projects available to you for your grou project Think of a problem you are interested in, then choose the appropriate projects to solve it Keep in mind data ingest, storage, processing, and egress Feel free to explore and use other projects than the ones I have listed here Get permission if you plan on using it as part of your project quota

References All those logos are the property of their owners *.apache.org redis.io