Savanna Hadoop on OpenStack Sergey Lukjanov Savanna Technical Lead Mirantis, 2013
Agenda Savanna Overview Savanna Use Cases Roadmap & Current Status Architecture & Features Overview Hadoop vs. Virtualization
Agenda Savanna Overview Savanna Use Cases Roadmap & Current Status Architecture & Features Overview Hadoop vs. Virtualization
Savanna - Elastic Hadoop on OpenStack Open source native OpenStack component Supports different Hadoop distributions Solves both bare cluster provisioning use case and "analytics as a service" Managed through REST API Web UI as part of the OpenStack Dashboard Flexible templates of Hadoop configurations
Savanna - Elastic Hadoop on OpenStack Project home - https://launchpad.net/savanna bug tracking blueprints answers Code review (gerrit) - https://review.openstack.org Sources - https://github.com/stackforge/savanna Mailing list - savanna-all@lists.launchpad.net CI - https://jenkins.openstack.org and http://jenkins.savanna.mirantis.com
Savanna - Participants Contributors: large core team from Mirantis teams from RedHat, Hortonworks several minor contributors Intel joined recently Several upcoming customers
Agenda Savanna Overview Savanna Use Cases Roadmap & Current Status Architecture & Features Overview Hadoop vs. Virtualization
Savanna Use Cases Administrators - centralized cluster management and monitoring Dev and QA teams - fast clusters provisioning Data Scientists/Analysts - API to run the analytic jobs with infrastructure provisioning happening under the hood Making resources dedicated to IaaS cloud available for Hadoop workload
Administrators Use Case Central point of control over infrastructure Enables self-service capabilities, including choice of Hadoop distribution to be used Integration with vendor tooling: Ambari for Apache/HortonWorks Cloudera Management Console Intel Hadoop Utilization of free IaaS capacity for Hadoop tasks
Dev and QA Use Cases Fast on-demand provisioning of the environments Increase agility and speed of innovation Controlled access to data from production
Analytics Use Cases Simplified tasks execution - complexity of provisioning and managing cluster hidden under the hood Access to higher level interfaces (e.g. pig, hive) Bursty workload: ad-hoc queries requiring a significant resource only for short time period Utilization of free IaaS capacity for Hadoop tasks
Agenda Savanna Overview Savanna Use Cases Roadmap & Current Status Architecture & Features Overview Hadoop vs. Virtualization
Roadmap for Hadoop in Cloud Phase 1 Basic cluster provisioning of Apache Hadoop Phase 2 Cluster operation support and integration with tooling, advanced configuration (HDFS, Swift, etc.) Phase 3 "Analytics as a service": job execution framework, support different scripting languages, deeper integration with OS
Phase 1 - Basic Cluster Operation Cluster provisioning Deployment Engine implementation for preinstalled images Templates for Hadoop cluster configuration REST API for cluster startup and operations Web UI integrated into OpenStack Dashboard
Roadmap for Hadoop in Cloud Phase 1 [Released - April, 10] Basic cluster provisioning of Apache Hadoop Phase 2 Cluster operation support and integration with tooling, advanced configuration (HDFS, Swift, etc.) Phase 3 "Analytics as a service": job execution framework, support different scripting languages, deeper integration with OS
Phase 2 - Advanced Configuration Hadoop cluster configuration support: Solutions for HDFS data reliability issue Configurable storage location Configurable topology of, NN, TT, JT Add/remove nodes More Hadoop parameters Integration with vendor deployment/management tooling Basic monitoring support
Roadmap for Hadoop in Cloud Phase 1 [Released - April, 10] Basic cluster provisioning of Apache Hadoop Phase 2 [In progress - July 15] Cluster operation support and integration with tooling, advanced configuration (HDFS, Swift, etc.) Phase 3 "Analytics as a service": job execution framework, support different scripting languages, deeper integration with OS
Phase 3 - Analytics as a Service API to execute Map/Reduce jobs without exposing details of underlying infrastructure (similar to AWS EMR) User-friendly UI for ad-hoc analytics queries based on Hive or Pig
Roadmap for Hadoop in Cloud Phase 1 [Released - April, 10] Basic cluster provisioning of Apache Hadoop Phase 2 [In progress - July 15] Cluster operation support and integration with tooling, advanced configuration (HDFS, Swift, etc.) Phase 3 [Planned - October 15] "Analytics as a service": job execution framework, support different scripting languages, deeper integration with OS
Further Roadmap Autoscaling HA for NameNode Deeper HDFS and Swift integration Caching of Swift data on HDFS Integration with logging and error handling HBase support
Agenda Savanna Overview Savanna Use Cases Roadmap & Current Status Architecture & Features Overview Hadoop vs. Virtualization
Architecture Overview Keystone Horizon Hadoop VM Hadoop VM Hadoop VM Hadoop VM Savanna Pages Auth Savanna Python Client REST API Swift Cluster Configuration Manager DAL Provisioning Plugin Instance Interop Helper Image Registry Nova Glance
Hadoop vs. Virtualization HDFS Reliability Data Persistence I/O Performance etc.
Hadoop vs. Virtualization HDFS Reliability Data Persistence I/O Performance etc.
Hadoop vs. Virtualization HDFS Reliability Data Persistence I/O Performance etc.
Hadoop vs. Virtualization HDFS Reliability Data Persistence I/O Performance etc.
HDFS Reliability: the issue Data Block Compute Compute
HDFS Reliability: the issue Data Block Compute Compute
HDFS Reliability: the issue Data Block Compute Compute
HDFS Reliability: single per host Compute Compute Compute TT Cluster A Cluster B
HDFS Reliability: Hadoop-8468 hypervisor-awareness for HDFS scheduler Compute Compute Compute HDFS Data Block
HDFS Reliability: Hadoop-8545 enables Swift for Hadoop t pu n i al i init Swift Hadoop Job #1 Hadoop Job #2 fin al o ut pu t... Hadoop Job #N HDFS
Configurable topology of, NN, TT, JT Master node(s) JT NN JT + NN Worker nodes 10 6 8 TT TT
HDFS Placement Options Ephemeral drive /var/lib/nova/instances/instance-xxx/disk -> /mnt/ephemeral Block storage volume Cinder Volume -> /mnt/volume Bare hard drive support /dev/sdb -> /mnt/sdb
Q&A
We are hiring!
Phase 1 deployment mechanism Provision VMs with pre-installed Hadoop Savanna Configure Hadoop Cluster Hadoop VM Hadoop VM Hadoop VM Hadoop VM
Tool usage scenarios Scenario I Tool Manage Hadoop Cluster Hadoop VM Hadoop VM Hadoop VM Hadoop VM VM VM VM VM Scenario II Tool Provision & Manage Hadoop Cluster
Extensible Provisioning S a v a n n a Plugin get extra configs validate input launch/terminate cluster add/remove nodes Image registry register image in Savanna add/remove tags get image by tag Instance Interop launch/terminate VMs get VM status ssh/scp to VM
Provisioning Interaction get extra parameters for the plugin launch cluster U s e r add/remove nodes S a v a n n a get extra parameters validate cluster parameters launch cluster add/remove nodes P l u g i n launch cluster add/remove nodes
Provisioning: Launching a Cluster get image by tag P L U G I N Image Registry launch VMs install and configure Hadoop launch VMs Instance Interop Helper pass commands via ssh, scp Hadoop VM Hadoop VM Hadoop VM Hadoop VM
Q&A
We are hiring!