1 Compute Canada Technology Briefing November 12, 2015
2 Introduction Compute Canada, in partnership with regional organizations ACENET, Calcul Québec, Compute Ontario and WestGrid, leads the acceleration of research innovation by deploying state-of-the-art advanced research computing (ARC) systems, storage and software solutions. Together we provide essential digital research services and infrastructure for Canadian researchers and their collaborators in all academic and industrial sectors. Our world-class team of more than 200 experts employed by 35 partner universities and research institutions across the country provide direct support to research teams and industrial partners. Advanced research computing accelerates research and discovery and helps solve today s grand scientific challenges. Using Compute Canada resources research teams and their international partners work with industry giants in the automotive, ICT, life sciences, aerospace and manufacturing sectors to drive innovation and new products to market.canadian researchers leverage their access to expert support and infrastructure to participate in international initiatives.researchers using advanced research computing rate significantly higher in citations than the average from Canada s top research universities and any international discipline average.
3 Key Facts: The investment of $75 million in funding from the Canada Foundation for Innovation (CFI) and provincial partners will address urgent and pressing needs and replace aging high performance computing systems across Canada. Compute Canada and its regional partners have more than 18 years of experience in accelerating results from industrial partnerships in advanced research computing and Canada s major science investments. Compute Canada currently manages more than 20 petabytes of storage and 2 petaflops of computing resources and supports all of Canada s major science investments and programs. ņ ņ With the implementation of this technology deployment plan Compute Canada will manage more than 60 petabytes of storage and 13.4 petaflops of computing resources. What this means for Canada s Research Community These improvements will allow Compute Canada to continue to support the wide array of excellent Canadian research identified in the proposal. The purchase of significantly more storage, deployed as part of an enhanced national storage infrastructure, will accelerate data-intensive research in Canada. The ability to purchase a single Large Parallel machine of over 65,000 cores will provide Canada s largest compute-intensive users with a new resource which far exceeds any machine in the Compute Canada fleet today. ņ ņ This investment is more than an opportunity to increase the size of storage systems and a raw number of cores. The new systems replace old technology with new technology and will be deployed with national services, coherent policies and a new operational model for the organization. This enhanced service level will allow more researchers to exploit the planned four new systems in an efficient and effective way. Compute Canada Technology Briefing - November
4 Overview Compute Canada is the national resource provider for advanced research computing and big data, delivering a full range of systems and services to researchers. Funding from the Canada Foundation for Innovation (CFI), matching funds from provincial partners and from vendors in the form of in-kind contributions, will enable the significant technology refresh program described below. This technology briefing document is intended to be circulated to Compute Canada stakeholders and suppliers. It provides status and planning for the technology refresh program resulting from CFI s cyberinfrastructure initiative, and will be implemented from 2015 through early It also anticipates planning for future growth. The total value of this capital program is $75M, to be spent mainly in 2016 and This reflects a $30M capital grant from CFI, a further $30M from provincial and institutional sources, and $15M of vendor in-kind1. By the end of 2017, many legacy systems will have been replaced by new computational systems and storage, totalling over 123,000 CPU cores and 60 PB of storage. 1 See:
5 New Systems at Four National Sites Through a formal competition among Compute Canada member institutions, four sites were selected to host the new systems and associated services. They are the University of Victoria, Simon Fraser University (SFU), the University of Waterloo, and the University of Toronto. New Computational Systems Planning for the new systems at the four sites has been responsive to user demand, site affinity and experience, and shifts in timing and funding. Envisioned system characteristics follow. University of Victoria: The GP1 system will be an OpenStack cloud, with emphasis on hosting virtual machines and other cloud workloads. At least 3,000 CPU cores 2 are anticipated by early 2016, with a 40% expansion planned in Simon Fraser University: The GP2 system will mainly focus on a mix of batch-oriented parallel and serial workloads with several different node types. It will also have a relatively small OpenStack partition that will federate with GP1 and GP3. Node types will include some large memory nodes, as well as approximately GPU nodes. At least 18,000 CPU cores is anticipated for mid-2016, with a 40% expansion planned in University of Waterloo: The GP3 system will have a similar design to GP2, and it is anticipated that GP2 and GP3 together will provide features for workload portability and resiliency. Plans for GP3 include at least 19,000 CPU cores in late 2016, with approximately 64 GPU nodes. A 40% expansion is planned in University of Toronto: The LP system will be deployed by approximately mid-2017, anticipated to have at least 66,000 CPU cores. This will be a balanced, tightly coupled high performance computing resource, designed mainly large parallel workloads. National Storage Architecture A new national storage architecture spanning the four sites will offer important benefits to users. Compute Canada will utilize concepts of generic storage building blocks, which will use software-defined storage techniques to deploy capacity and performance in a flexible, easily expandable, highly interoperable, and cost-effective manner. In addition to providing file systems for file-based storage, there will be object storage services. Object storage services will provide ease of use and built-in features including resiliency, georeplication, enhanced metadata, and combinations of public access and data isolation. Approximately 20 petabytes (PB) of persistent storage is planned to be deployed across the four sites in early and mid-2016, with expansion to over 60PB by early An offline/nearline tier of over 20PB will provide lower-cost capacity for backups and hierarchical storage management. High performance parallel filesystems for GP2, GP3 and LP will also be deployed 2 CPU core count equivalents are based on Intel Haswell computational capabilities. 3 All future plans for nodes, CPUs and other specifications are intended as conservative estimates Compute Canada Technology Briefing - November
6 Other Compute Canada member sites will be able to benefit from the national storage architecture, including those sites operating legacy resources. For example, users may need to migrate data to the new systems, or they might have use cases that will benefit from object storage or the larger capacity and higher performance the newer systems will offer. Delivery Timeline The Challenge 2 Stage 1+2 technology refresh will span 2 years of staged deployment. By the end of calendar year 2017, essentially all Challenge 2 Stage 1+2 funds will have been expended. The total supply at that time is forecasted to be at least 126,500 CPU cores ( Haswell equivalents) and 62 petabytes of usable persistent storage. Storage does not include near-line or backup storage, nor high-speed parallel scratch space. Challenge 2 Stage 1+2 Technology Planning. Compute is in Haswell-equivalent cores, Storage is in usable petabytes. Timeframe is calendar year quarters (i.e., Q is January-March 2016), and is approximate. The core and storage targets are estimates only. During the same two-year period, much of Compute Canada s existing equipment will be defunded and removed from the allocations process. Users will be moved to one of the new systems, and needed data will be migrated. Planning in 2014 for the site selection process identified 26 systems with 82,000 CPU cores from older generations (nearly 1PF total), for retirement by early calendar year A schedule for the remaining systems will be developed in conjunction with planning for further technology expansion, with some of the remaining systems likely to be removed from the allocations process in Much of the 15PB of allocatable storage available in 2015 will also be defunded and removed from the allocations process during the period.
7 Organizational Cooperation and Planning Planning for site selection and the ensuing technology refresh has included deep coordination among the four national host sites for all aspects of procuring, deploying, configuring, operating and supporting Compute Canada s suite of systems and services. The Compute Canada Technological Leadership Council (TLC) is responsible for developing specifications for the new systems, and will lead the procurement evaluation. TLC includes representatives from each national site, as also includes the four regional CTOs. It is led by the national CTO. New national teams, which will draw from Compute Canada member institutions, will run the systems and services, provide user support, and engage in cross-site coordination on major themes such as monitoring, storage, cloud services, and networking. The new systems and services will share practices for security. The teams for all national systems and services will provide defined coverage and response levels. Procurement Processes All four sites are working with the Compute Canada team to ensure an open and fair acquisition process. Resources will be purchased and owned by each site. Formation of specifications, and evaluation of bids, will be by national teams with full engagement by site procurement officers. Flexibility in Planning Plans described here will be modified as needed, based on discussions among the four sites, Compute Canada, and the national and provincial funding agencies. Re-scaling of expectations for system size and capabilities, if needed, will be based on experience with vendor pricing and the influence of the Canadian dollar s exchange rate. There will also be assessment of anticipated user demand, including for new technologies or configurations. This will be via the SPARC process described below, as well as through discussions with funding agencies and their researchers. By late 2016, updates will be considered for any needed revisions to planning for the expansion of the three GP systems, and of the scale, configuration and timing of the LP system. Alignment of supply and demand will be re-assessed for computation and storage. Planning will also be responsive to any new information concerning additional funding, the selection of additional hosting sites, shifts to Canada s digital research infrastructure strategy, or other factors. Funding and Governance SFU is the lead organization for the CFI capital program and is executing an interinstitutional agreement with the three other hosting institutions and Compute Canada. The Compute Canada membership will be involved in many broader aspects of organizational governance and planning. CFI retains oversight for capital spending for the technology refresh, as well as operational expenses via the Major Science Initiatives (MSI) program. Compute Canada Technology Briefing - November
8 Usage and Capabilities As Canada s national platform for advanced research computing, Compute Canada serves thousands of users in essentially every scientific discipline. Compute Canada is continually engaged in renewal and expansion of its services and its audience. Beyond Canada s academic community, this includes engagement with industry and with international partners. Some of the current and expanded services within Compute Canada are described below. Workload Portability: Users will find it easier than before to run their jobs on any of the new systems. This will be facilitated by deploying a single HPC batch system, having a common naming scheme for software, modules, and filesystem mount points, and incorporating mechanisms for data movement with the workload manager. For projects involving a live stream of observational data, or other time-sensitive characteristics, workload portability will help to ensure the jobs run on time, wherever appropriate HPC resources are available. Cloud Computing: Building on Compute Canada s successful early deployment of cloud systems and services, the GP1, GP2 and GP3 systems will comprise a federated cloud including single sign-on, shared data services, a common cloud scheduler, and other features for resiliency and ease of use. Additional cloud resources within Compute Canada will be able to become part of the federated cloud, simply by using the same authentication and configuration parameters. Big Data: The storage architecture and cloud services will facilitate big data workloads, including data analytics. Storage will include database capabilities, and cloud services will support virtual machines with user-selected software and features. National Operations and Support: The national teams will work together to provide a consistent and well-supported environment for computation and data. This will include all aspects of configuration and support. Users will have a single point of contact to the national helpdesk, and will also be able to benefit from the expertise of on-campus support personnel. Resource Allocations: Compute Canada will continue to allocate compute and storage resources through a fair and open process. Workload portability and the consistency of configuration and support will give users extra flexibility, when desired, in their choice of computing resources.
9 National Services Consultations have helped inform planning for systems and services in Service demands were articulated while consulting with applicants for CFI s Challenge 1 Stage 1, including these middleware services that were identified by multiple applicants: Identification and Authorization Service: Provide common login across systems. Software Distribution Service: Version-controlled software distribution to multiple sites. Data Transfer Service: To move datasets among collaborators and their repositories. Monitoring Service: Track uptime and availability of services and platforms. Resource Publishing Service: Current information about available resources. These services will be deployed beginning in 2016 for all new systems as part of the infrastructure investment. Additional services will be identified, and developed, deployed and supported based on demand. It is Compute Canada s intention to provide a useful and effective set of middleware services, accessible to any user or group. These will provide a high performance and well-supported baseline upon which users or groups may build their own custom applications. Compute Canada views these tools as needed software infrastructure, and is devoting some of the Challenge 2 Stage 1+2 funds to developing that infrastructure. Compute Canada views many of the new services identified above as essential enabling tools for Research Data Management (RDM). As data volumes grow, there is a growing demand for RDM. Compute Canada will provide a common set of middleware services for users with this need. RDM will continue to mature during the period, and will include cooperation with other digital research infrastructure providers in Canada. Future Consultations on this Plan In early 2016, Compute Canada will embark on a second round of SPARC consultations. SPARC2 will help to identify current and future needs, as well as to parameterize growth in user demand. As with the previous SPARC, scientists and engineers from the across Canada will be invited to submit descriptions of their research goals, and the needed advanced research computing capabilities and capacities required to achieve those goals. Compute Canada Technology Briefing - November
10 Projections for Future Supply and Demand Technology Impact of Challenge 2 Stage 1+2 By the end of calendar year 2017, Compute Canada will have delivered essentially all of the new computational and storage capacity facilitated by CFI s Challenge 2 Stage 1+2 award. The $75M value of capital investment will replace most legacy systems and associated storage. Modernization and capacity resulting from Challenge 2 Stage Primary disk does not include offline/nearline storage for backups or near-line storage. It does include a variety of disk- or disk-like technologies, including object storage, block storage, storage replicas, and storage for filesystems.
11 During this technology refresh program, CPUs will be replaced with the latest generation, along with more memory. New nodes will be augmented by GPUs and accelerators. A typical node in service in 2015 has dual 6- or 8-core CPUs and 16-32GB of memory. A typical node to be deployed in 2016 will have dual 14- or 16-core CPUs, with 128GB of memory or greater. Challenge 2 Stage 1+2 is an important and necessary modernization of the DRI provided by Compute Canada. Sustained investment is needed to accommodate the needs of current and future users of Canada s national platform for computation and storage. Scenarios of Increasing Demand There are several factors impacting planning future demand for advanced research computing: 1. Demand by users who engage in computational modelling, for additional CPU resources: a. To increase spatial or temporal resolution; b. To add physics or other simulation factors that were previously too slow or computationally expensive to calculate; c. To test additional parameters or scenarios; d. For projects and users new to Compute Canada, especially in nontraditional fields. 2. Demand for additional storage resources for computational modellers: a. Larger input and output datasets, due to larger or more complex models; b. The need to keep some datasets beyond the end of a computational campaign, to assist in future modelling or to support publications. 3. Demand for portals and gateways, including from new user populations: a. May include needs for highly resilient services and systems; b. May include needs for high-end storage subsystems for database operations; c. Bring a user base that may be quite large, and may include the general public. 4. Demand for projects emphasizing instruments and observational data gathering and analysis: a. May have irreplaceable or highly valuable data, which needs to have multiple copies at multiple locations; b. Include Compute Canada s largest storage users, many of whom have new instruments in development; c. Require computational resources for post-processing, analysis, portals, visualization and/or reanalysis. Compute Canada Technology Briefing - November
12 5. Demand by data-focused projects: a. May require isolation of data or computation from inappropriate disclosure; b. Includes some usage (such as personal health information) with regulatory concerns; c. Includes emphasis on data analytic methods that are not yet generally available on Compute Canada resources. 6. Demand from projects being directed by funding agencies to consider utilizing Compute Canada resources: a. Include a range of use cases that might not be in Compute Canada s current service catalog, but will be developed; b. Some of these projects are very large and demanding; c. Projects view Compute Canada as a partner, not just a resource provider. 7. Demand from industry: a. May require isolation of data or computation from inappropriate disclosure; b. Interested in the expertise of Compute Canada, perhaps more than the computational resources. 8. Demand from government: a. Exist within a regulatory environment that might not be in part of Compute Canada s current service catalog, but will be developed; b. Can involve a long planning timeline.
13 For Challenge 2 Stage 1+2, planning emphasized modernization of the computational and storage resources. Planning has been sensitive to anticipated demand growth and changing patterns of utilization, informed via Challenge 1 and the SPARC consultations. The annual allocations process by Compute Canada is a major indicator of growth trends, since it aggregates hundreds of existing projects. Data for the 2016 allocation period are now available, and reflect the impact of Challenge 2 Stage 1+2. In 2017, further growth is anticipated, along with retirement of legacy resources. Through the SPARC process, Compute Canada has identified the expected growth in community demand for storage (15x) and compute (7x) resources through This data has been converted into a doubling time to project future demand in equivalent core years and terabytes of storage. Demand indicators support using a doubling time of 1.8 years for computational demand, and 1.3 years for storage demand. By forecasting demand based on these doubling times, we can project a trend into the future. For this projection, we present a range where the lower bound represents no growth in the Compute Canada user base, and the upper bound represents ongoing increases in the user base, following historical trends Trends project a demand for 1-3 million Haswell-equivalent cores by 2020, and more than an exabyte of persistent storage. These projections may turn out to be underestimates, since some existing disciplines making extensive use of Compute Canada resources today anticipate needing over 1 million cores or 1 exabyte of data just for their own projects by It is hoped that Compute Canada will be stewards, along with members, regions and provincial partners, of the sustained capital investment that will be required to meet these demands. Compute Canada Technology Briefing - November
15 Vision 2020 Compute Canada, as a leading provider of digital research infrastructure (DRI), is taking an integrated approach to data and computational infrastructure in order to benefit all sectors of society. As a result of the technology refresh and modernization supported by CFI s Challenge 2 Stage 1+2, excellent research will benefit from modern and capable resources for computationally-based and data-focused work. Compute Canada is coordinating with government funding agencies and with other DRI providers to develop a vision of coordinating to provide the world s most advanced, integrated and capable systems, services and support for research. Future researchers will have seamless access to DRI resources, integrated together for maximum efficiency and performance, without needing to be concerned with artificial boundaries based on different geographical locations or providers. By 2020, Compute Canada will offer a comprehensive catalog of resources to support the full data research cycle, allowing researchers and their industrial and international partners to compete at a global scale. In cooperation with Canada s other DRI providers, Compute Canada s systems and services will facilitate workflows that easily span different resources: from the lab or campus, to national computational resources, analytical facilities, publication archives, and with collaborators. Local support and engagement will remain a hallmark of delivering excellent service to all users. The pathway to this future has begun, with the modernization of Compute Canada s advanced research computing cyberinfrastructure through the CFI Challenge 2 Stage 1+2 program. Compute Canada Technology Briefing - November
16 36 York Mills Road, Suite 505, Toronto, Ontario, Canada M2P 2E9