MAGELLAN 54 S CIDAC REVIEW S PRING 2010 WWW. SCIDACREVIEW. ORG

MAGELLAN Exploring CLOUD Computing for DOE s Scientific Mission Cloud computing is gaining traction in the commercial world, with companies like Amazon, Google, and Yahoo offering pay-to-play cycles to help organizations meet cyclical demands for extra computing power. But can such an approach also meet the computing and data storage demands of the nation s scientific community? The exploratory project has been named Magellan in honor of the Portuguese explorer, and also for the Magellanic Clouds, the two closest galaxies to our Milky Way visible from the Southern Hemisphere. A new $32 million program funded by the American Recovery and Reinvestment Act through the U.S. Department of Energy (DOE) will examine cloud computing as a cost-effective and energyefficient computing paradigm for mid-range science users to accelerate discoveries in a variety of disciplines, including analysis of scientific datasets in biology, climate change, and physics. DOE is a world leader in providing high-performance computing resources for science with the National Energy Scientific Research Computing Center (NERSC) at Lawrence Berkeley National Laboratory (LBNL) to support the high-end computing needs of over 3,000 DOE Office of Science researchers and the Leadership Computing Facilities at Argonne and Oak Ridge National Laboratories serving the largest-scale computing projects across the broader science community through the Innovative and Novel Computational Impact on Theory and Experiment Program (INCITE). The focus of these facilities is on providing access to some of the world s most powerful supercomputing systems that are specifically designed for high-end scientific computing. Interestingly, some of the science demands for DOE computing resources do not require the scale of these well-balanced petascale machines. A great deal of computational science today is conducted on personal laptops or desktop computers or on small private computing clusters set up by individual researchers or small collaborations at their home institution. Local clusters have also been ideal for researchers that co-design complex problem-solving software infrastructures for the platforms in addition to running their simulations. Users with computational needs that fall between desktop and petascale systems are often referred to as mid-range and are the target for Magellan cloud projects. In the past, mid-range users were enticed to set up their own purpose-built clusters for developing codes, running custom software or solving computationally inexpensive problems because hardware has been relatively cheap. However the cost incurred by ownership, including ever-rising energy bills, space constraints for hardware, ongoing software maintenance, security, operations and a variety of other expenses, are forcing mid-range researchers and their funders to look for more costefficient alternatives. Some experts suspect that cloud computing may be a viable solution. Cloud computing refers to a flexible model for on-demand access to a shared pool of configurable computing resources (such as networks, servers, storage, applications, services, and software) that can be easily provisioned as needed. Cloud computing centralizes the resources to gain efficiency of scale and permit scientists to scale up to solve larger science problems while still allowing the system software to be configured as needed for individual application requirements. To test cloud computing for scientific capability, NERSC and the Argonne Leadership Computing Facility (ALCF) will install similar mid-range computing hardware, but will offer different computing environments (figure 1). The systems will create a cloud test bed that scientists 54 S CIDAC REVIEW S PRING 2010 WWW. SCIDACREVIEW. ORG

R. KALTSCHMIDT, LBNL Figure 1. Cloud control. The Magellan management and network control racks at NERSC. To test cloud computing for scientific capability, NERSC and the Argonne Leadership Computing Facility (ALCF) installed purpose-built test beds for running scientific applications on the IBM idataplex cluster. can use for their computations while also testing the effectiveness of cloud computing for their particular research problems. Since the project is exploratory, it has been named Magellan in honor of the Portuguese explorer who led the first effort to sail around the globe. It is also named for the Magellanic Clouds, which are the two closest galaxies to our Milky Way and visible from the Southern Hemisphere. What is Cloud Computing? In the report Above the Clouds: A Berkeley View of Cloud Computing (see Further Reading, p59) a team of luminaries from the Electrical Engineering and Computer Sciences Department at the University of California Berkeley noted that cloud computing refers to both the applications delivered as services over the Internet and the hardware and systems software in the datacenters that provide those services. The services themselves have long been referred to as software as a service (SaaS). The datacenter hardware and software is referred to as a cloud. When a cloud is made available in a pay-as-yougo manner to the general public, it is a public cloud; the service being sold is utility computing. Current examples of public utility computing include Amazon Web Services (AWS), Google AppEngine, and Microsoft Azure. As a successful example, Elastic Compute Cloud (EC2) from AWS sells 1.0 GHz x86 ISA slices, or instances, for $0.10 per hour, and a new instance can be added in two to five minutes. An instance is the allocated memory and collection of processes running on the server. Meanwhile, Amazon s Scalable Storage Service (S3) charges $0.12 to $0.15 per gigabyte-month, with additional bandwidth charges of $0.10 to $0.15 per gigabyte to move data into and out of AWS over the Internet. The advantages of SaaS to both end users and service providers are well understood. Service S CIDAC REVIEW S PRING 2010 WWW. SCIDACREVIEW. ORG 55

MAGELLAN Magellan Hardware This purpose-built test bed for running scientific applications will be built on the IBM idataplex chassis and based on InfiniBand technology, the system will offer high density with front-access cabling and will be liquid-cooled using rear-door heat exchangers (figure 2). Total computer performance across both sites will be on the order of 100 teraflop/s. The NERSC portion of the system will include: 61.5 teraflop/s peak performance 720 compute nodes (5,760 cores) with Intel Nehalem quad-core processors 21.1 TB DDR3 memory QDR InfiniBand fabric Meanwhile, the Argonne system will have: 43 teraflop/s peak performance 504 Compute nodes (4,032 cores) with Intel Nehalem quad-core processors 12 TB memory QDR InfiniBand fabric R. KALTSCHMIDT, LBNL providers enjoy greatly simplifed software installation and maintenance and centralized control over versioning; end users can access the service anytime, anywhere, share data and collaborate more easily, and keep their data stored safely in the infrastructure. Cloud computing does not change these arguments, but it does give more application providers the choice of deploying their product as SaaS without provisioning a datacenter: just as the emergence of semiconductor foundries gave chip companies the opportunity to design and sell chips without owning a semiconductor fabrication plant, cloud computing allows deploying SaaS and scaling on demand without building or provisioning a datacenter. Mid-Range Users on a Cloud Realizing that not all research applications require petascale computing power, the Magellan project will explore several areas: Understanding which science applications and user communities are most well-suited for cloud computing (sidebar Metagenomics on a Cloud? p58) Understanding the deployment and support issues required to build large science clouds. Is it cost effective and practical to operate science clouds? How could commercial clouds be leveraged? How does existing cloud software meet the needs of science and could extending or enhancing current cloud software improve utility? How well does cloud computing support dataintensive scientific applications? What are the challenges to addressing security in a virtualized cloud environment? Figure 2. Staying cool. By building the Magellan test bed at NERSC on IBM s idataplex chassis, the facility can take advantage of the machine s innovative halfdepth design and liquid-cooled door, reduce cooling costs by as much as half, and reduce floor space requirements by 30%. The orange tubes in the picture will carry coolant to chill the system. By installing the Magellan systems (sidebar Magellan Hardware ) at two of DOE s leading computing centers, the project will leverage staff experience and expertise as users put the cloud systems through their paces. The Magellan test bed will be comprised of cluster hardware built on IBM s idataplex chassis and based on Intel s Nehalem CPUs and QDR InfiniBand interconnect (figure 3). Total computer performance across both sites will be on the order of 100 teraflop/s. Researchers at ACLF and NERSC will look into the Eucalyptus toolkit, an open-source package that is compatible with Amazon Web Services, as a potential tool for allocating Linux virtual machine images. In addition, the teams researching Magellan s suitability will also investigate the performance of Apache s Hadoop and Google s MapReduce, two 56 S CIDAC REVIEW S PRING 2010 WWW. SCIDACREVIEW. ORG

R. KALTSCHMIDT, LBNL The Magellan test bed will be comprised of cluster hardware built on IBM s idataplex chassis and based on Intel s Nehalem CPUs and QDR InfiniBand interconnect, and the total computer performance across both sites will be on the order of 100 teraflop/s. Figure 3. Magellan systems at both NERSC and the ALCF will be built using QDR InfiniBand fabric like the one pictured here. S CIDAC REVIEW S PRING 2010 WWW. SCIDACREVIEW. ORG 57

MAGELLAN Metagenomics on a Cloud? One goal of the Magellan project is to understand which science applications and user communities are best suited for cloud computing, but some DOE researchers have already given public clouds a whirl. For example, Jared Wilkening, a software developer at Argonne National Laboratory, recently tested the feasibility of employing Amazon EC2 to run a BLASTbased metagenomics application. Metagenomics is the study of metagenomes, genetic material recovered directly from environmental samples. By identifying and understanding bacterial species based on sequence similarity, some researchers hope to put microbial communities to work mitigating global warming and cleaning up toxic waste sites, among other tasks. BLAST is the community standard for sequence comparison. It enables researchers to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. Wilkening notes that the BLAST-based codes, like the one he used on the Amazon EC2, are perfect for cloud computing because there is little internal synchronization, therefore it does not rely on high-performance interconnects. Nevertheless, the study s conclusion was that Amazon is significantly more expensive than locallyowned clusters, due mainly to EC2 s inferior CPU hardware and the premium cost associated with ondemand access, although an increased demand for compute-intensive workloads could change that. Wilkening s paper was published in Cluster 2009, and slides are available at: http://www.cluster2009.org/9.pdf J. WILKENING, ANL Figure 4. Metagenomics is the study of genetic material recovered directly from environmental samples. related software frameworks that deal with large distributed datasets. Currently, one of the challenges in building a private cloud is the lack of software standards. Although these frameworks are not widely supported at traditional supercomputing facilities, large distributed datasets are a common feature of many scientific codes and are natural fits for cloud computing. The team will also be experimenting with other commercial cloud offerings such as those from Amazon, Google, and Microsoft. By making Magellan available to a wide range of DOE science users, the researchers will be able to analyze the suitability for a cloud model across the broad spectrum of the DOE science workload. They will also use performance-monitoring software to analyze what kinds of science applications are being run on the system and how well they perform on a cloud. The science users will play a key role in this evaluation as they bring a very broad scientific workload into the equation and will help the researchers learn which features are important to the scientific community. Data Storage and Networking To address the challenge of analyzing the massive amounts of data being produced by scientific 58 S CIDAC REVIEW S PRING 2010 WWW. SCIDACREVIEW. ORG

R. KALTSCHMIDT, LBNL R. KALTSCHMIDT, LBNL Figure 6. Networking. When completed, the Magellan system at NERSC will be interconnected using QDR, 10 Gbps Ethernet, multiple 1 Gbps Ethernet, and 8 Gbps fiber channel SAN. Figure 5. Main system console for Magellan at NERSC. instruments ranging from powerful telescopes photographing the Universe to gene sequencers unraveling the genetic code of life, the Magellan test bed will also provide a storage cloud with a little over a petabyte of capacity. The NERSC Global File (NGF) system will provide most storage needs for projects running on the NERSC portion of the Magellan system. Approximately 1 PB of storage and 25 gigabits per second (Gbps) of bandwidth have been added to support use by the test bed. Archival storage needs will be satisfied by NERSC s High Performance Storage System (HPSS) archive, which is being increased by 15 PB in capacity. Meanwhile, the Magellan system at ACLF will have 250 TB of local disk storage on the compute nodes and additional 25 TB of global disk storage on the GPFS system. NERSC will make the Magellan storage available to science communities using a set of servers and software called Science Gateways, as well as experiment with Flash memory technology to provide fast random access storage for some of the more data-intensive problems. Approximately 10 TB will be deployed in NGF for highbandwidth, low-latency storage class and metadata acceleration. Around 16 TB will be deployed as local SSD in one SU for data analytics, local read-only data and local temporary storage. Approximately 2 TB will be deployed in HPSS. The ALCF will provide active storage, using HADOOP over PVFS, on approximately 100 compute/storage nodes. This active storage will increase the capacity of the ALCF Magellan system by approximately 30 TF of compute power, along with approximately 500 TB of local disk storage and 10 TB of local SSD. The NERSC and ALCF facilities will be linked by a groundbreaking 100 Gbps network, developed by DOE s Energy Sciences Network (ESnet) with funding from the American Recovery and Reinvestment Act. Such high bandwidth will facilitate rapid transfer of data between geographically dispersed clouds and enable scientists to use available computing resources regardless of location. The Magellan program will run for two years, and the initial clusters will be installed in the next few months. At NERSC installation was slated to begin in November 2009, with early users getting access in December. The NERSC system (figures 5 and 6) was slated to go into production use in mid-january 2010. At ALCF, installation was planned to begin in January 2010, with early users gaining access in February, and the system opening up for full access in March. Contributors Horst Simon, Kathy Yelick, Jeff Broughton, Brent Draney, Jon Bashor, David Paul, and Linda Vu from NERSC at LBNL; Pete Beckman, Susan Coghlan, and Eleanor Taylor from ALCF at Argonne National Laboratory. Further Reading Above the Clouds: A Berkeley View of Cloud Computing http://www.eecs.berkeley.edu/pubs/techrpts/2009/ EECS-2009-28.pdf S CIDAC REVIEW S PRING 2010 WWW. SCIDACREVIEW. ORG 59