Introduction to Arvados A Curoverse White Paper
Contents Arvados in a Nutshell... 4 Why Teams Choose Arvados... 4 The Technical Architecture... 6 System Capabilities... 7 Commitment to Open Source... 12 Copyright 2014-15 Curoverse, Inc. All rights reserved. First published April 2014. We believe the information in this paper is accurate as of the publication date, however it is subject to change without notice. The Curoverse 1.0 release described in this paper is still under development with a planned release in 2015. We make no representations or warranties of any kind with respect to the content. Curoverse, the Curoverse logo, Arvados, and the Arvados logo are trademarks of Curoverse. All other trademarks used herein are the property of their respective owners.
Research IT leaders face significant challenges posed by the need for systems that can scale for new genomic and biomedical datasets. Along with this data tsunami, an array of changes is transforming data center architectures and service delivery models. At the same time, the expectations of researchers are shifting. They want more self-service, better computational reproducibility, as well as faster performance and they still don t care about the little things such as massive data duplication, unpredictable scaling, and the limitations of your budget. We built Arvados to address these challenges.* * Yes, that s a bold claim. In reality, we won t solve all your problems, but we can put a big dent in them read on to see how.
Arvados in a Nutshell Curoverse solutions are all based on Arvados (http://arvados.org). Arvados is an open source platform for managing, processing, and sharing genomic and biomedical data. The system provides capabilities that bioinformaticians and computational biologists use to manage and analyze their data. We call it a platform because you can run pipelines and applications on top of it. Arvados is built for big data. We designed it for sequencing data such as genomes, tumor/normal pairs, and microbiomes. People also use it for imaging, sensor, and other data. We call these data big because the files can be large (for example, 100+ GB), there can be a lot of them (for example, billions of mass spectrometer files), and the total amount of data ranges from tens of terabytes to petabytes. You can download and run Arvados yourself, but in production, it works best in a modern, hyperconverged, elastic computing environment. We provide Arvados as a SaaS solution on public cloud providers, and we support deploying Arvados clusters on-premise in your data center. Why Teams Choose Arvados Today, Arvados is used by bioinformaticians, computational biologists, and developers informaticians for short. In the future, biologists, geneticists, pathologists, and ultimately clinicians will make discoveries and deliver precision medicine with applications that run on Arvados. For Informaticians Informaticians use Arvados to track, organize, and manage their data. They use it both as a workspace to develop new analyses and as a platform for scaled, distributed, computational analysis with custom and common pipelines. Arvados offers informaticians four major benefits: 1. Streamlined Work With the functionality in Arvados, informaticians work more productively, accomplish more, and gain easy access to powerful computational capacity. 4
2. Efficient Computation Many computational pipelines combine tools that each use widely different compute resources and runtime environments. Arvados handles these by provisioning the right resource for each tool. 3. Reproducible Analyses Arvados provides a breakthrough in computational reproducibility. The combination of data and job management capabilities makes it radically easier to track, record, and reproduce complex pipelines on large datasets. 4. Efficient Collaboration An array of features in Arvados help informaticians and researchers collaborate with each other, sharing pipelines, data, and results in ways that are fast, reliable, and secure. For IT Arvados isn t just for informaticians. We ve built the service to help IT managers solve their problems. Arvados offers IT leaders four major benefits: 1. Addresses Researcher Needs With Arvados, IT can give researchers a flexible, scalable selfservice platform that meets their requirements. 2. Delivers Operational Excellence With clear visibility into how the system is being used and strong support from the Arvados administration team, you can identify issues before they become problems and better manage user expectations. 3. Uses Infrastructure Cost-efficiently With Arvados, you can lower your total-cost-ofownership (TCO). Elastic computing, userlevel compute management, and usage tracking let you manage compute costs. A data management system helps you manage storage costs by automatically eliminating duplication and by helping you identify datasets that can be deleted. Finally, fine-grain usage tracking makes it easy to handle your budgeting and billing. 4. Avoids Vendor Lock-in Arvados is entirely built with open source software, primarily the Arvados platform. That means you always have the option to stop using our service and deploy the same software yourself. 5
The Technical Architecture Arvados uses a multilayer integrated stack of technologies built with proven services and open source software. All the layers work together as a complete solution, as described below. Curoverse platform architecture 1. Cloud Infrastructure The Cloud Infrastructure layer includes raw storage, compute management, network services, and low-level security. 2. System Services The System Services layer includes innovative approaches to data and job management that make it simple to manage massive datasets and implement large-scale, easily reproducible distributed computations. 3. API All the services in the system are accessed through a RESTful API. The API can be used directly from any language or a command line interface. In addition, we provide SDKs for Python, Perl, Ruby, Java, and Go. 4. System Interfaces At the Interface layer, Arvados provides a number of different ways for users and administrators to access the capabilities of the system. 5. Security Security is woven throughout the system. At the Cloud Infrastructure layer, security is implemented with a mix of physical and network security controls, as well as encryption. At the Access Control level, a flexible system governs who can access different datasets, pipelines, and other objects in the system. The Authentication layer is implemented with industry-standard federated identity protocols. 6
System Capabilities Working together, the layers in the system solve the major problems informaticians face as they organize and analyze data. Cloud Infrastructure Curoverse hosts Arvados as a SaaS service on public cloud providers, such as Amazon Web Services (AWS) and Google Cloud Platform (GCP). We also support on-premise clusters that run in your datacenter. On-premise, we deploy Arvados in state-of-the-art hyperconverged clusters that combine storage, compute, RAM, and high-performance networking. Placed in your data center, our clusters are designed to easily scale. Most informatics is still done with a traditional highperformance computing architecture that combines networkattached storage (NAS) with a storage area network and a compute cluster. While the scientific community has stuck with this architecture, other industries working with very large datasets have transitioned to cloud computing architectures. These industries have significantly reduced their costs and gained powerful new functionality without sacrificing performance. Arvados is designed to use a hyperconverged elastic computing architecture that leverages low-cost hardware and uses software for fault tolerance. It takes advantage of virtualization and nodes that can more seamlessly scale, allowing the system to move computation closer to data, instead of moving data to compute. Arvados clusters can be integrated with existing HPC systems. This provides a new way to increase utilization and improve data management in existing infrastructure at the same time that it creates a pathway to a new more scalable and lower cost architecture. Storage Management with Keep Keep is a data management system designed to solve the challenges of managing biomedical big data for scientific and clinical analysis. Content Addressing Keep identifies datasets using content addresses globally unique cryptographic hashes generated from the bits in a dataset. With content addressing, Keep can provide a number of data management services: 7
6 Benefits of Keep 1. Reliable File Addresses Ensures reliable and durable data retrieval. 2. De-duplication Eliminates duplicate data storage by checking for duplication on write. 3. Origin and Use Tracks the origin of datasets and how they are used across the system. 4. Fast Throughput Manages data distribution within the underlying file system to optimize for distributed computation. 5. Flexible Metadata Enables the application of multiple metadata schema without file duplication. 6. Portable API Provides a consistent API across cloud providers. Data Validation By design, the system ensures that when a dataset is retrieved it is, in fact, the dataset requested. This makes reproducible computations possible without depending on inherently impermanent file names or directory paths. De-duplication Content addressing automatically eliminates file duplication. If a user tries to save data that already exist in the system, Keep will not save another copy. Flexible Organization The most popular way to organize metadata in traditional POSIX file systems is to use the directory structure (for example, \study1\participantid\). When users want to reorganize data, they duplicate it and change the directory structure. Within Arvados, they can reorganize data and change how it s tagged without ever making duplicate copies. Content addressing is a powerful storage technology that has been used for many years in other fields. Data Organization with Datasets Keep gives informaticians the ability to create datasets from multiple files without physically reorganizing those files on disk. Keep defines datasets with a simple manifest that contains a structured list of the content addresses for the files in the dataset. Each manifest is content addressed, providing a cryptographically verifiable canonical reference for the dataset. This approach results in both inexpensive descriptions of datasets (for instance, PBs of data can be described in MBs) as well as highly durable representations. At the same time, Keep datasets eliminate a common pattern of unnecessary file duplication as informaticians attempt to reorganize files into new datasets for different computations. Distributed Storage Keep uses a well-established pattern for storing large data sets and large files first developed for the Google File System. Large files are chunked into 64 MiB blocks and small files are packed into 64 MiB blocks. These blocks are then replicated across multiple disks on multiple nodes. As a result, Keep has a high degree of fault tolerance to disk and node failures. Also, the system makes it possible to move distributed computations near the data for faster throughput. High Throughput Keep is optimized for throughput on file access using several strategies for disk and network management. Keep does not maintain a name node or specialized database to reference file locations. Clients can find files algorithmically across 8
nodes through their content addresses, which increases system reliability by eliminating another potential point of failure. Provenance (Origin) and Usage Tracking Working with Crunch (see below), the platform tracks metrics on dataset creation and usage. (For example, it may track who created the dataset, how much compute it took, how long it took, how often the dataset is used, and if it can be reliably reproduced.) These metrics help informaticians manage temp data and redundant datasets more easily and efficiently. Integration with Existing Storage Keep can be tightly integrated with existing storage systems using a variety of approaches. For example, you could use an Isilon NAS as primary storage. In this scenario, Keep would index the data on the NAS, load it when it needs to be processed, and then write the output files back to the NAS. Compute Management with Crunch Crunch is a distributed job manager designed to ensure computational reproducibility. Crunch makes it easy for informaticians to create, schedule, provision, and track distributed computing jobs on large datasets. 10 Benefits of Crunch 1. Reproducibility Reliably reproduce complex analyses. 2. Origin and Use Tracking Record the origin and use of every dataset. 3. Fault Tolerance Automatically recover from disk and node failures. 4. Portability Easily move computations between clouds. 5. Sharing Quickly and reliably share pipeline templates between users. 6. Self-service Run jobs without assistance in cluster management. 7. Scaling Easily scale jobs to run in parallel on multiple nodes. 8. Status Reporting Access job status reports during and after job execution. 9. Optimized Re-running Save time and money by skipping jobs that don t need to be re-run. 10. Pipeline Comparisons Quickly compare multiple pipeline runs. Creation and Invocation Users can run almost any algorithm written in any language as a Crunch job; it s particularly well suited to distributed jobs that can run in parallel. A user invokes a job by simply specifying the desired script version, inputs and parameters, and optionally the worker node configuration. Crunch handles everything else. Pipelines (computational workflows) can be written for Crunch using a Python script or a JSON document. We plan to add support for Common Workflow Language in 2015. Scheduling Crunch schedules jobs, deciding which jobs to run and when to run them based on the rules established for prioritization. Provisioning Crunch sets up and configures nodes, attaches storage, and ensures the runtime environment is properly configured. Jobs are run inside Linux containers using Docker. This design provides a reliable transition from testing to scaled deployment, and enables reproduction of the complete runtime environment. More importantly, Crunch efficiently manages complex heterogeneous 9
pipelines where each job requires different computing resources. Many pipelines use tools that require different types of nodes and runtime environments. For example, one job may be a single-threaded Perl app with specialized libraries, the next a multi-threaded Java tool that needs more RAM, and the third a distributed process that can run in parallel on multiple nodes. For each job, Crunch dynamically provisions the correct computing resources and ensures they are properly configured at runtime. Supervising As jobs run, Crunch supervises their operation. It reports status, identifies problems, and automatically restarts jobs when nodes fail. In real time, users can watch the provenance graph that shows the results as each job in a pipeline is completed. System Interfaces Users access Arvados through several interfaces. Virtual Private Servers (VPS) In a typical configuration, we give each users their own virtual machine or virtual private server on the Arvados cluster with an operating system, informatics tools, popular pipelines, and a command line interface to the API. Users have root access in their VPS, and several configurations are available to accommodate different use cases. Workbench Users and administrators can use the browser-based tools in Workbench to manage data, initiate and track jobs, visualize pipeline provenance, administer security, see operating data, and complete other tasks associated with using the service. Data Transfer Interfaces Users can transfer data into and out of their Arvados account through several different mechanisms. Virtual Private Clouds SFTP for routine, manual, or automated data ingestion Import/Export (sending drives directly) for large batch data ingest On-premise Private Clouds SFTP for routine, manual, or automated data ingestion Data ingestion from or export to NFS-mounted volumes 10
Third-party Applications Arvados is a platform, and it supports deploying third-party applications that use the API to provide a wide range of functionality to users. Security We ve woven security and compliance capabilities throughout the Arvados platform. Arvados can operate compliant with HIPAA, SOC2, and FISMA. Infrastructure Security At the infrastructure layer we ve taken several steps to create a secure environment. Virtual Private Cloud We leverage the security levels that AWS and GCP have achieved, including significant physical and network security capabilities (see summary of AWS compliance or summary of GCP compliance). In Team Accounts, we provide a single-instance VPC that isolates data and network access. User and administrator access uses a least-privileges model. Data are encrypted in transit and at rest. On-premise Private Cloud On-premise clouds live on your network, usually behind a firewall or in a DMZ. We can work with your team to ensure compliance with your HIPAA standards and help enforce physical and digital access controls. Arvados local clouds can be integrated with your SSO; they also leverage the Secure Shell cryptographic network protocol (SSH) for VPS access (see below). Authentication User authentication for Workbench and the data transfer interfaces use OAuth2.0. If you use another SSO standard (such as SAML or LDAP), we can work with you to support it. VPS access is authenticated with SSH, and users can manage their public keys through Workbench. Access Control There are several access control mechanisms, including API keys. At the data management level, users and administrators can control access to specific datasets. For example, Researcher 1 could share a 20,000-file dataset with Researcher 2. Then Researcher 1 could create a second dataset that includes 10,000 files from the first dataset and 5,000 new files from a different dataset, and 11
share those with Researcher 3 without duplicating data on disk, managing filelevel permissions on thousands of files, or moving data between directories. HIPAA-specific Controls Curoverse plans to sign BAAs for HIPAA compliance. Users who want to store and use data that are covered by HIPAA will be required to go through further manual verification. Commitment to Open Source Arvados is open source. The platform was first developed at the Harvard Medical School to handle the challenges of large-scale distributed computing with genomic and other biomedical data. The project is managed through an open source community, and we plan to form a nonprofit foundation to oversee the project. The core system is licensed under the AGPL v3 license. All of the SDKs and client libraries are licensed under the Apache 2 open source license, so you can confidently deploy proprietary applications on the platform. Arvados is deployed in an integrated solution that also leverages a wide range of other open source software systems such as Debian, Xen, Docker, and others. Curoverse on-premise clusters use standard hardware components and don t rely on custom ASICs or esoteric components that are not readily available. As a result, you never face the challenge of vendor lock-in. Curoverse, Inc. 212 Elm St., 3 rd Floor Somerville, MA 02144 617-500-6568 http://curoverse.com https://arvados.org 12