CONDOR CLUSTERS ON EC2 Val Hendrix, Roberto A. Vitillo Lawrence Berkeley National Lab ATLAS Cloud Computing R & D 1
INTRODUCTION This is our initial work on investigating tools for managing clusters and contextualizing cluster nodes. Today we will be presenting A tool for creating virtual appliances Our efforts creating Condor Clusters on both Starcluster and CloudCRV Scalr cloud management software ATLAS Cloud Computing R & D 2
BOXGRINDER BoxGrinder creates and deploys Scientific Linux EC2 AMI s from simple plain text configuration files Packages, repositories, virtual hardware and OS can be easily configured Support for virtualization platforms like Xen, KVM and VMware To find out more visit http://boxgrinder.org/video/ name: LBNL_SLC6 summary: LBNL SLC6 basic appliance version: 1 os: name: sl version: 6 hardware: partitions: "/": size: 10 appliances: packages: Boxgrinder is an opensource project produced by Project:odd which is staffed by Red Hat. http://projectodd.org/ ATLAS Cloud Computing R & D 3
STARCLUSTER Overview Open source cluster-computing toolkit for Amazon s EC2 Simple to configure (e.g. EBS Volumes, Security Options) Manages multiple clusters (list, start, stop, reboot, terminate, login, add/remove node) A flexible plugin architecture provides an easy way to extend the platform Elastic Load Balancing supported by shrinking and expanding a cluster based on a SGE queue ATLAS Cloud Computing R & D 4
STARCLUSTER Architecture Starcluster acts as a wrapper for the EC2 tools State of the clusters is maintained by parsing the instance names and their running state No dedicated support needed on the AMI s ATLAS Cloud Computing R & D 5
STARCLUSTER Condor Cluster Starcluster Ubuntu EC2 AMI Out-of-the-box Configuration Condor Plugin ATLAS Cloud Computing R & D 6
STARCLUSTER Condor Cluster Tried Amazon Linux 64-bit, at first. Starcluster expects to ssh as the root user. Amazon Linux disallows login as root Amazon Redhat 6.2 64-bit Rolled our own EBS back AMI from Redhat 6.2 with Condor Extended Out of box starcluster configuration with a simple change to the NODE_IMAGE_ID Using Amazon s Redhat 6.2 AMI Cluster started up and was able to run condor jobs One start-up error: portmap service did not start but was able to start manually once I logged into the master node ATLAS Cloud Computing R & D 7
CLOUDCRV Architecture Built-in webserver serving the user-interface Asynchronous event controller to manage the lifecycle of Clusters, Roles and VMs A thin client runs on each of the VMs for communication and control purposes ATLAS Cloud Computing R & D 8
CLOUDCRV Server Created a CloudCRV Server EC2 AMI that has the following cluster appliances Tier3 Analysis Cluster Local XROOTD Cluster ATLAS Cloud Computing R & D 9
CLOUDCRV Condor Cluster Tier 3 analysis cluster * Green means it s running! Cluster nodes were Centos 5.7 EBS back AMIs with 30 GB EBS mounted Volumes. In the future, we will use a variant of RHEL 6. Started CloudCRV Server Entered Amazon Information (Credential, Security Groups, Machine Details...) Started Tier3 Analysis Cluster with the following services: NFS, LDAP, Proxy, Condor, CVM-FS. Once the cluster was up and running Created LDAP user for running analysis jobs Submitted a simple condor job Successfully probed cvmfs, ran ROOT, and performed dq2-get. Ran D3PD Analysis code on clusters of varying sizes (see next slide). ATLAS Cloud Computing R & D 10
CLOUDCRV Condor Cluster Used the same ATLAS t-tbar analysis code and dataset as Sergey Panitkin did on his PROOF tests. See details: https://twiki.cern.ch/twiki/bin/viewauth/atlasprotected/physicsanalysisworkbookrel16d3pdanalysisexample Test Configuration D3PD Dataset prefetched to an EBS volume on each worker node Analysis code run in NonProof mode from the EBS backed NFS mounted user home. Cluster nodes were c1.xlarge instance types consisting of 8 virtual cores of 2.5 EC2 compute units each Test: Saturated the virtual cores with t-tbar analysis jobs. 8 t-tbar analysis jobs per node Split the dataset up amongst the 8 jobs ATLAS t-tbar analysis tests ATLAS Cloud Computing R & D 11
CLOUDCRV Condor Cluster ATLAS t-tbar analysis test results # Worker Nodes Virtual Cores Condor Job Average Event Rate Event Rate (per second) 1 8 77.06 616.48 2 16 77.41 1238.50 4 32 87.23 2791.37 6 48 85.15 4087.31 8 64 82.65 5289.71 10 80 92.21 7376.43 ATLAS Cloud Computing R & D 12
CLOUDCRV Local XrootD Cluster The goal of this exercise was to: Confirm that multiple clusters could be defined in CloudCRV Determine how easy it is to implement additional clusters * Green means it s running! Cluster nodes were Centos 5.7 EBS back AMIs with 30 GB EBS mounted Volumes. In the future, we will use a variant of RHEL 6. The steps accomplished: Created puppet modules for xrootd redirector and data server and added them to CloudCRV Define xrootd cluster appliance in CloudCRV Started cluster and confirmed cluster was functioning properly by copying files to and from xrootd using xrdcp command. ATLAS Cloud Computing R & D 13
SCALR Open source cloud management software Scalr is a commercial product that allows you to launch, auto-scale and configure servers on your public or private cloud. http://scalr.net Their cloud management software has been released under an open source license. There is an active scalr open source community http://groups.google.com/forum/#!forum/scalr-discuss Some features it boasts are: Integration with Chef, a configuration management tool by Opscode Customization: Scaling algorithms, pre-configured images, scripting interface, api control Management: DNS, image snapshotting, load visualization, ssh Failover: backups, fault tolerance, multi-cloud deployments We have done some initial investigation of their commercial deployment. Bundled our own EC2 machine images as Scalr roles Used their Scripting interface to apply puppet manifests we used in CloudCRV. We plan to test out a Scalr open source deployment on our own Amazon EC2 instance. ATLAS Cloud Computing R & D 14
COMPARISON CloudCRV Starcluster Scalr Doesn t require any external support tools Allows to manage the clusters through a web interface Automatic load balancing can be easily implemented No support required on the AMIs Mature project with a rich community Easy to extend through its flexible plugin system Multi-cloud deployments are possible. Commercial product that is actively developed and released under an open source license Customization allowed through its scripting engine, role importer and API control. Software is in alpha stage Needs to be maintained and enhanced Requires a server and thin clients Cluster not aware of itself Monitoring and automatic load balancing can be more difficult to implement Provides scaling algorithms for it s server farm roles. Open source release may be difficult to deploy and maintain ATLAS Cloud Computing R & D 15
SUMMATION We have shown our initial work on investigating tools for managing clusters and contextualizing cluster nodes. Introduced Boxgrinder; a tool that can be used for creating virtual appliances We started a Condor cluster using our own EBS back Redhat 6.2 AMI Using CloudCRV, we configured and successfully started up both Tier 3 analysis and XrootD clusters. We presented test results from running t-tbar analysis code on EC2 using CloudCRV for cluster management and node contextualization. We introduced, Scalr, a commercially developed cloud management tool that is available as open source. We will be furthering our investigation and come out with a final report. ATLAS Cloud Computing R & D 16
REFERENCES Boxgrinder: http://boxgrinder.org/ Starcluster: http://web.mit.edu/star/cluster/ Scalr: http://scalr.net/ CloudCRV and Virtual Cluster Appliance, Yushu Yao, LBNL: https://indico.cern.ch/getfile.py/access? contribid=26&sessionid=5&resid=0&materialid=poster&confid=92498 Running XrootD Cluster on Amazon EC2, Sergey Panitkin, BNL: https://indico.cern.ch/getfile.py/access? contribid=1&resid=4&materialid=slides&confid=177552 ATLAS Cloud Computing R & D 17