Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future lectures Discuss potential use cases for each project
Topics HDFS MapReduce YARN Sqoop Flume NiFi Pig Hive Streaming HBase Accumulo Avro Parquet Mahout Oozie Storm ZooKeeper Spark SQL-on-Hadoop In-Memory Stores Cassandra Kafka Crunch Azkaban
HDFS Hadoop Distributed File System High-performance file system for storing data We ve talked about this enough
Hadoop MapReduce High-performance fault-tolerance data processing system We ve also talked about this enough
YARN Abstract framework for distributed application development Split functionality of JobTracker into two components ResourceManager ApplicationMaster TaskTracker becomes NodeManager Containers instead of map and reduce slots Configurable amount of memory per NodeManager
MapReduce 2.x on YARN MapReduce API has not changed Binary-level backwards compatible (no recompile) Application Master launches and monitors job via YARN MapReduce History Server to store history Enabled Yahoo! to scale beyond 4,000 nodes
Hadoop Ecosystem Core Technologies Hadoop Distributed File System Hadoop MapReduce Many other tools Which we will be discussing now
Apache Sqoop Apache project designed for efficient transfer between Apache Hadoop and structured data stores Use through CLI and extendable
Apache Flume Distributed, reliable, available service for collecting, aggregating, and moving large amounts of log data Configure agents using simple files, extendable
Apache NiFi A service to reliably move and manipulate files between clusters using a web front-end Uses a GUI to drop processors and connect them to build workflows
Apache Pig Platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs Infrastructure compiles language to a sequence of MapReduce programs
Apache Hive Data warehouse facilitating querying and managing large datasets Compiles SQL-like queries into MapReduce programs
Hadoop Streaming Utility to create and run MapReduce jobs with any executable or script as the mapper or reducer Just a jar file, not a real project
Which high-level API is for you? What are you comfortable with? What are you being told to use?
Apache HBase Distributed, scalable, big data store Data stored as sorted key/value pairs, with the key consisting of a row and column
Apache Accumulo Robust, scalable, high-performance data storage and retrieval key/value store Cell-based access controls i.e. cell-level security
Apache Avro Data serialization system for the Hadoop ecosystem
Apache Parquet Columnar storage format for Hadoop
Apache Mahout Machine learning library to build scalable machine learning algorithms implemented on top of Hadoop MapReduce
Apache Oozie Workflow scheduler system to manage Apache Hadoop jobs
Apache Storm Distributed real-time computation system Didn t have a logo until June 2014 How is this different than MapReduce?
Apache ZooKeeper Effort to develop and maintain and opensource server enabling highly reliable distributed coordination
Apache Spark Fast and general engine for large-scale data processing Write applications in Java, Scala, or Python
SQL on Hadoop Apache Drill, Cloudera Impala, Facebook s Presto, Hortonworks s Hive Stinger, Pivotal HAWQ, etc. SQL-like or ANSI SQL compliant MPP execution engines using HDFS as a data store Non use cases?
Sample Architecture Flume Agent SQL Oozie Webserver Flume Agent Website Sales MapReduce Pig HBase Storm Flume Agent HDFS Call Center SQL
We [maybe] won t be covering these in detail later on OTHER HADOOP PROJECTS
Redis, Memcached, etc. Open-source in-memory key/value stores
Apache Cassandra NoSQL database for managing large amounts of structured, semi-structured, and unstructured data Support for clusters spanning multiple datacenters Unlike HBase and Accumulo, data is not stored on HDFS Non use cases?
Apache Crunch Java framework for writing, testing, and running MapReduce pipelines with a simple API Same code executes as a local job, as a MapReduce job, or as a streaming Spark job * *Not the real logo, but truly fantastic
Apache Kafka High-throughput distributed publish-subscribe message service
Azkaban Batch workflow job scheduler to run Hadoop jobs
Review A lot of projects available to you for your grou project Think of a problem you are interested in, then choose the appropriate projects to solve it Keep in mind data ingest, storage, processing, and egress Feel free to explore and use other projects than the ones I have listed here Get permission if you plan on using it as part of your project quota
References All those logos are the property of their owners *.apache.org redis.io