1 iplant Atmosphere: A Gateway to Cloud Infrastructure for the Plant Sciences Edwin Skidmore Sriramu Singaram Seung-jin Kim Nirav Merchant University of Arizona Sangeeta Kuchimanchi Dan Stanzione Texas Advanced Computing Center ABSTRACT The cloud platform complements traditional compute and storage infrastructures by introducing capabilities for efficiently provisioning resources in a self-service, on-demand manner. The new provisioning model promises to accelerate scientific discovery by improving access to customizable and task-specific computing resources. This paradigm is well-suited, especially for those applications tailored to leverage cloud-style of infrastructure capabilities. Adoption of the cloud model has been challenging for many domain scientists and scientific software developers due to the technical expertise required to effectively utilize this infrastructure. Some of the key limitations of cloud infrastructure are: limited integration with institutional authentication and authorization frameworks, lack of frameworks to enable domainspecific configurations for instances, and integration with scientific data repositories alongside existing computational clusters and grid deployments. Specifically designed to address some of these operational barriers towards adoptions by the plant sciences community, the cloud platform, aptly named Atmosphere, is an open-source, robust, configurable gateway that extends established cloud infrastructure to meet the diverse computing needs for the plant science. Atmosphere manages the Virtual Machine (VM) lifecycle while maximizing the utilization of cloud resources for scientific workflows. Thus, Atmosphere allows researchers developing novel analytical tools to deploy them with ease while abstracting the underlying computing infrastructure, at the same time making it relatively easy for the users to access these tools via web browser. Atmosphere also provides a rich extensible Application Programming Interface (APIs) for integration and automation with other services. Since its launch, Atmosphere has seen a wide adoption by the plant sciences community for a broad array of applications that range from image processing to next generation sequence (NGS) analysis and can serve as a template for providing similar capabilities to other domains. Categories and Subject Descriptors C.2.4 [Distributed Systems]: Distributed Applications cloud computing and storage. General Terms Management, Experimentation, Performance Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Keywords cloud, cloud computing, cloud storage, virtualization, plant sciences, cyberinfrastructure 1. INTRODUCTION The was established with a very broad mandate from the National Science Foundation (NSF) to support the cyberinfrastructure (CI) needs of researchers addressing the grand challenges of plant biology . Plant biology itself is a very diverse field. The rapid expansion of the omics era (genomics, proteomics, metabolomics, transcriptomics, etc.) is transforming what has traditionally been a bench- and field-based research discipline into a computationally intense, data-driven area of science. The research problems of plant biology range from those requiring intense computation (genome assembly, genome wide association studies) to those requiring little computation but intensive data integration (genome and functional annotations). The data sets range in scale and type from molecular phenotype to satellite maps of species range, with most scales falling somewhere in between. In most areas of inquiry around plants, the algorithms and workflows are still rapidly evolving, as are the types of questions being explored. The entire cyberinfrastructure foundation, therefore, must support the needs of a diverse group of plant science researchers, ranging from biologists to computer scientists. The cyberinfrastructure comprises of high performance computing (HPC) cluster and data storage resources to the support large-scale data and computational needs of plant science research. While the HPC cluster and storage resources are suitable for many applications, not all scientific computations require the use of supercomputer-scale resources . Some computational workflows require a dedicated server to provide their own interfaces, some of which are web based or native operating system based graphical user interfaces (GUI) with associated local databases, or compute environments. Furthermore, the nature of algorithm development and data analyses as applied to plant biology domain requires customized software development environments with significant software version dependencies, while requiring simultaneous access to large-scale compute and storage infrastructures. Many existing plant science-related computational tools are a natural fit for the cloud based infrastructure, many of which are highly serial or data parallel. Traditionally, such tools have been designed for single-threaded, small-scale computations, easily executed on a single workstation. Large-scale compute environments typically discourage the execution of these types of tools, as they do not make efficient use of cluster resources. Additionally, large shared clusters are not always amenable to customized execution environments for the domain-specific tasks,
2 such as configuring a wide range of system library versions to support legacy versions of applications, or the latest bleeding edge versions of bioinformatics software . The deployment of algorithms and tools thus becomes a challenge in shared computing environments. Atmosphere addresses these issues by providing preconfigured, domain-specific virtual instances of common bioinformatic tools that are integrated within other parts of iplant's cyberinfrastructure. In this manner, Atmosphere users are able to quickly develop algorithms and deploy workflows, thereby reducing the extensive time, resources, and overhead needed to set up analyses. Users can access the subset of data required by a specific analysis from iplant s large-scale storage by staging the data into their virtual machine (VM). In addition, users are able to preserve the state of their VM instances, saving not only a specific workflow and analysis but an entire system state, introducing new opportunities for algorithm development and experimental reproducibility. A fundamental practice within biosciences research, particularly in the plant sciences community, is the sharing of data sets, analysis tools, and computational workflows. Tool developers have their own community of collaborators and often seek convenient access to computational resources in order to provide early and reliable access to the analytical tools for their community. This iterative step is often a tedious limitation for biologists, as the data and resource provisioning to validate the tools is a time consuming process. While many cloud solutions focus on the Infrastructure-as-a-Service (IAAS) or Platform-as-a- Service (PAAS) model, Atmosphere reinforces the Software-as-a- Service (SAAS) approach by allowing users to collaborate, share their images, and conveniently launch applications using one-click buttons. SAAS allows the tool developers to provide web-based, secure single click access to their collaborators to precisely configured versions of the application and their dependencies, allowing them to provide quick assessment and feedback before publishing the tool for public consumption. This capability also allows the biologists to adjust the underlying computational infrastructure and modify items such as the number of cores and RAM depending on the size of the data set being analyzed, without having to encumber the developer for additional resources. In addition to the various cloud service models that Atmosphere supports, the iplant cloud services platform exposes its functionality by providing HTTP-based application programming interfaces (APIs) through its middleware. The APIs enable functionality available through the web front end as well as mechanisms to customize notifications, resource management, and metadata management. The purpose of the Atmosphere APIs is to encourage deeper integration with other applications and services while making Atmosphere a first-class citizen of the institutional infrastructure through the integration capabilities. Virtualization software typically exists at a low level of most infrastructures and requires proficient systems administrators to provision resources for end-users. The nature of virtualization and the provisioning of virtual resources often demand an intimate knowledge of underlying physical resources and the low level access controls. Additionally, the user interfaces for virtualization software are typically command-line tools or, at best, desktop applications requiring direct access to the systems and hardware. Many academic and open-source cloud projects began in an attempt to fulfill an unmet need to provide the dynamic, usercentric provisioning of virtual resources. Some projects began as research projects and have continued as open-source projects, such as OpenNebula . Other projects have attempted to model themselves after successful private industry services. One such project is Eucalyptus Cloud , which provides API-compatible web services to Amazon Elastic Compute Cloud (EC2)  and a very basic web interface to managing VMs and storage resources. Other projects aim to be more of a toolkit or API to service, such as Nimbus  and OpenStack . Most of these projects cater more toward service providers who wish to build a cloud rather than delivering the cloud more directly to the end-users themselves. A major distinction between Atmosphere and these other projects is that Atmosphere attempts to close the usability gap between a cloud provider and that of cloud users, particularly for biologists and plant science researchers. 3. ARCHITECTURE Conceptually, Atmosphere can be separated into three logical layers accompanied by a set of toolkits which reside within the virtual machine (VMs) (see Figure 1). The three layers of Atmosphere are the cloud engine, the middleware, and the web frontend. The toolkits facilitate the configuration of the virtual machines, communication with the middleware tier, and the interfacing with other parts of iplant s cyberinfrastructure. 2. RELATED WORK Before the inception of Atmosphere in 2010, there were few mature open-source cloud-centric middleware or portal projects focused toward a biological sciences community. Below are some of the projects that iplant evaluated before developing a targeted cloud infrastructure for the plant science community. Many IAAS clouds utilize virtualization software, such asxen , VMWare , or Kernel Virtual Machine (KVM) . Figure 1. High-level illustration of Atmosphere s components.
4 atmo_init atmo_init executes as part of the boot-time process and facilitates the configuration of a VM before it is available to a user. Users will never need to access this tool directly. condor tools iplant provides access to a small compute grid based on Condor . The condor tools are essentially the condor executables necessary to submit jobs to iplant s grid for some types of computations. configuration management Traditional configuration management enables systems administrators to control and automate the configuration of system software and services. Atmosphere uses Puppet , a configuration management tool, to dynamically configure systems on virtual machines. Future versions of Atmosphere will enable users to dynamically configure their own VMs. image_request This command-line tool collects the necessary metadata from a VM, which will ultimately be displayed within Atmosphere s graphical catalogue. iplant Data Store utilities iplant utilitizes irods Data Grid  for large scale storage. Users have access to both commandline and GUI-based clients to manage data within iplant s Data Store. The second type of storage is the iplant Data Store. Users can pull, push, or synchronize data in parallel using irods command-line utilities or graphical clients. Another method of managing data is using a Filesystem in Userspace (FUSE)  client, which translates the irods API calls into filesystem calls. One important distinction to using the iplant Data Store is that users can readily access the data across the entirety of iplant s cyberinfrastructure, including the HPC resources and the Discovery Environment. 6. CLOUD UTILIZATION When Atmosphere was launched as a preview to the public in January 2011, access was limited. Shortly after its initial launch, iplant Atmosphere opened access to researchers. The diversity of current users represents 16 countries and 87 institutions. Within the United States, where 89.6% of the total users reside, 30 states are represented (see Chart 1). 4. SECURITY AND AUTHENTICATION A function of the Atmosphere middleware is to mediate both the cloud engine s authentication and iplant s central authentication services. An authentication service, called CloudAuth, provides a pluggable framework to integrate with different authentication mechanisms via modules. Currently, CloudAuth supports an internal database or LDAP . Planned authentication modules include CAS  and Shibboleth . In a typical authentication use case, users authenticate to the web frontend. Secure sockets are used for every layer of network communication. When a user authenticates using the web frontend, a user s iplant credentials are mapped to the corresponding credential provided by the cloud engine. The cloud credentials allow the user to provision resources within their namespace, allowing Atmosphere to leverage the cloud engine s mechanism for resource isolation. Atmosphere web services APIs employ a simplified version of a token-based authentication system. After authentication, external services use a token to call methods on behalf of a user. Tokens have a finite lifetime, configurable by the cloud administrator. Users authenticate to their VMs primarily using their iplant credentials. SSH access is automatically configured to allow ssh access to the specific user and by cloud administrators. Secure VNC  access is enabled using a RealVNC  Enterprise Server embedded on the VM. 5. DATA Plant science is a data intensive science . To address their extensive data needs within the cloud, Atmosphere users have access to two types of storage. The first type is provisioned through the underlying cloud engine and is exclusive to a user s virtual machines. The cloud engine s storage is recognized by the VM as a native block device and uses typical system utilities for managing devices, such as fdisk, parted and mount. Chart 1. This chart illustrates the cumulative growth in the number of users since Atmosphere s public launch in January Atmosphere has been utilized in three workshops and one graduate-level bioinformatics course. Atmosphere is in active use by five research laboratories to share image data using the iplant data store and analyze it using custom command line and GUI based applications developed in MATLAB, these tools and dependencies were deployed as compiled binaries as a bundled VM; this custom VM is made available to the community to use with their own data sets through Atmosphere (See Figure 4). 7. USE CASES iplant s cyberinfrastructure provides multiple entry points to its storage and compute resources, where Atmosphere fills specific needs unmet by the other infrastructure services. The iplant Discovery Environment provides a structured way for users to integrate tools and perform analysis via a web portal. Web service APIs, through the Foundational and Semantic APIs, programmatically expose this functionality to developers to integrate with their existing science portals and tools. Direct access to complex, large scale compute and storage is generally available to plant scientists through XSEDE providers, such as Texas Advanced Computing Center (TACC). Given these modes
5 Education, workshops, and training: Another large category of users utilize Atmosphere for workshops and training events. Oftentimes, workshop organizers need preconfigured VMs, loading with sample data, for their participants. The iplant staff works closely with workshop organizers to structure their environments and sample data. Figure 4. A remote VNC connection to a Atmosphere instance showing the image analysis toolkit and iplant data store windows. of access, there was a need for highly configurable environments for algorithm exploration, tool development, small- to mediumsized analyses, or analyses that might not be traditionally suited for HPC environments. The following provides a glimpse of the typical use cases that Atmosphere has used over the past several months since Atmosphere has been released to the public: Algorithm and Tool Development: Many of Atmosphere s users do not have convenient access to a UNIX environment to design their own tools, whether via shell access or a graphic desktop. The cloud, with its self-service, on-demand model, is an obvious fit for these users. Typically, the period of algorithm and tool development is finite between releases, after which the environment may be saved or terminated until the next stage of development begins. In some cases, algorithm and tool developers publish their virtual machine images for other community users. One example is the Phytomorph VM, developed by Nathan Miller from the University of Wisconsin, which provides machine vision tools to correlate seed morphology to seed development. On-Demand, Standalone Analysis: As mentioned previously, some users utilize virtual machines, configured by tool developers, as part of their data analysis pipeline. In other cases, users have developed their own analysis pipeline want to deploy it virtually. A common problem mentioned by some users is that their analysis pipeline exists on a desktop or laptop, lacking the reliability or performance they need to scale their analysis. In many cases, users choose Atmosphere as a facility to share their analysis pipeline with other lab members or collaborators. In these cases, providing wholly contained, reproducible computational environments is the most attractive features of Atmosphere. On-Demand, Integrated Analysis: Integrated analysis refers to analysis that may be partially performed using other parts of iplant s infrastructure. For example, some users may have part of their analysis performed using the Discovery Environment and wish to further process the data using preinstalled tools on a virtual machine in Atmosphere. In other cases, Atmosphere VMs have been used to prepare an analysis to be later targeted for HPC resources. For integrated analyses, the iplant DataStore is used for sharing data across iplant s various services. 8. CONCLUSION AND FUTURE WORK The goal for Atmosphere is to provide ease of access to highly customizable computational infrastructure, functioning as a gateway that integrates and augments cloud resources with capabilities such as HPC and data grids. Atmosphere also serves as a platform to allow computational tool designers and developers the ability to collaborate, and rapidly deploy their analysis pipelines for broad use by the community. Atmosphere plays a key role in providing customized computational resources for pre and post analysis; e.g. Atmosphere has custom VM of eight popular genome visualizers which are used to visualize a large genome assembly, the compute intensive tasks were performed on a HPC system where these resources intensive GUI applications are not typically installed. Atmosphere s ability to provide easy web interface to preserve data, tools, and workflows with minimal effort and skill overcomes the limitations and technological barriers that prevent adoption of the cloud. Deploying a cloud for the plant sciences hasn t always been smooth. One of the most salient challenges initially faced when the project began was selecting the best cloud technology to use at a time when cloud technologies were emergent. Selecting a dominant cloud technology was more comparable to hedging a bet than making a thorough competitive analysis. To move forward with the project, we selected the best technologies at the time with a design philosophy accounting for the fact that that the underlying technologies would rapidly change and most likely be replaced in the future. Current and ongoing development module / features include: OpenStack and public cloud integration; replacing the use of euca2ools python library with a more generic, flexible library, such as Apache s Libcloud . Utilization-based scheduling of resources, including backfilling instances during low utilization periods Multi-cloud support; the scheduling and provisioning of resources across multiple, geographically dispersed clouds Tighter integration with iplant Discovery Environment, Grid, and HPC resources Automated, user-initiated VM image bundling Expansion of support for mobile devices. Currently, an Android application is available to view, launch, and terminate instances. Authentication with common academic institutional authentication standards, such as Jasig Central Authentication Service (CAS) or Shibboleth; Integration with InCommon . More intelligent metadata management and search capabilities; integration with semantic approaches to metadata search.