UNINETT Sigma2 AS: architecture and functionality of the future national data infrastructure

UNINETT Sigma2 AS: architecture and functionality of the future national data infrastructure Authors: A O Jaunsen, G S Dahiya, H A Eide, E Midttun Date: Dec 15, 2015 Summary Uninett Sigma2 provides High Performance Computing (HPC) and Data services to the national research sector. On October 30, 2015 the Sigma2 board decided that the storage infrastructure should be based on a distributed data centre (DDC) model in which the storage resources are spread across two sites that appear and function as a single data infrastructure. Each of the two HPC systems that will be procured during 2016 2018 will also be located in the same facilities as the storage resources. This comes at a cost increase due to two sites and an overhead on resource capacities, but will provide the benefit of dual site reliability. The user requirements suggest that data must be connected to resources that allow the users to process, analyse, share and publish the data. The data must be accessible in a simple and consistent manner that accommodates not only traditional compute intensive users, but also new user communities that are data driven. Infrastructures for research and larger laboratory facilities that generate large amount of data must be enabled to ingest their data via effective protocols that are supported by the data infrastructure. Data must be stored safely with a redundancy that will withstand failure of on any level (disk, server, rack, site), while at the same time avoiding unnecessary duplication of data based on project based policy. The requirements for a given research dataset may change during its lifetime and the infrastructure should enable seamless migration of data between the relevant storage devices based on data management policies and usage. The future national data infrastructure must be a scalable, reliable and flexible facility that can accommodate the vast majority of relevant users and communities. In the distributed infrastructure, the physical storage resources are located at separate geographical locations (hundreds to a thousand kms apart) and thus consist of at least two data centres. This infrastructure relies on replicating data between the two data centres with a redundancy that allows one of the data centres to be lost or become unavailable without losing access to the data as this is stored on the other site. This solution relies on an adequate network bandwidth between the data centres. Services are provided on all data centres and each of the HPC systems are directly connected to the corresponding data store to achieve a good integration among all infrastructure components. 1

A major challenge for any future data infrastructure is how to tackle the growth of data. The data does not only need to be stored safely, but also need to be connected to resources to enable services such as e.g. compute, visualization, analyses and other miscellaneous services. Data accessibility is therefore of key importance. Transporting data is already an increasing challenge with the growth of data exceeding the network capacity increments. Dedicated high bandwidth network links come at a significant cost. A data centric architecture is therefore an appealing concept that is well suited to meet the challenges of a future national e infrastructure for scientific research. Data Life Cycle Data is categorised in different classes that represent the relevant stages of the data life cycle. This cycle typically starts with the creation of data in lab/experiments or computing environments, encompasses an active phase in which the data is processed and prepared for interpretation and analysis. When the data has been thoroughly analysed it is expected that the data is archived and possibly published. The illustration shows the typical phases of the scientific process from a data life cycle point of view. 2

Data classification Currently research data are dispersed over four HPC systems and one data infrastructure in Norway. The goal is to consolidate these data in one infrastructure. On the HPC systems three different type of data stores are found; i) runtime data produced during computations ( /work), ii) user s own data on the system ( /home ) and iii) project data stored on dedicated disks ( /project ). The runtime data storage (i) has strict performance and integration requirements and will therefore be procured as part of the HPC system. The remaining data stores (ii+iii) will be hosted on the new data infrastructure, including non HPC /project data. The following table provides a description of the various data classes and suggesting what type of data may be found in the respective data classes. Class raw (/project) hot (/home, /project, /work) cold Description Data that is not reproducible and in its pristine state is considered raw. Such data is typically the result of recorded measurements by an instrument or experiment, typically requiring further processing to become meaningful. Raw data is static by definition and may be required in the future to enable reproduction of previous results or to verify/dismiss claimed errors. This class of data is therefore valuable and best practices often recommend that such data must be secured for the future. Data that is in active use and accessed frequently is dubbed hot. Hot data is typically accessed, processed and can serve as input to new calculations on an HPC system or analytics service for instance. It is therefore necessary to maintain this data on a storage technology with high read and write performance while at the same time coping with multiple users and process. The performance is determined by the connectivity between the storage system and the compute resources (such as HPC). The storage resource performance for the filesystem may be improved by use of fast storage mediums (SSDs) for caching data. Data that is still relevant, but accessed less frequently. Cold data should 3

(/project, /home) copy (/project/copy) published (not mounted) curated (not mounted) be accessible via various protocols (i.e. POSIX, S3, HTTP REST), but can be stored on consumer hardware (i.e. SATA drives). Data that serves as a (backup) copy only. This is a subclass of cold data and it is only accessed in the event of the (external) data becoming corrupted. Data of this type must be stored with a reference checksum value and can be stored on very cost effective high density disks such as Shingled Magnetic Recording disks Data that is archived and published by issuing a DOI. Such data is typically archived for several years, but with no curation requirements. It is expected that the data is no longer useful after a decade and can, in principle, be deleted after such time. Published data that has a (documented) need for long term preservation and permanent storage requirement must be curated by a data librarian and set up with a preservation plan. Infrastructure configuration To safeguard against loss of data in the event of catastrophic events such as a fire or flooding it is necessary to have data redundancy between two sites. In this way one data centre site can be lost due to a fire or flooding while the other site will still provide access to the data and services. In the unlikely event of such a scenario the data redundancy must be restored within the remaining site (provided there is sufficient resources available) or to a third (backup) site. A national data infrastructure will therefore have a minimum of two sites for reliability and availability reasons. Below we describe a possible architecture to achieve this in two sites scenario. Distributed Data Centre In the Distributed Data Centre (DDC) configuration, the two sites are geographically separated by distances of typically hundreds of kilometers. This means that two data centres are required and the interconnect between the two data centres will rely on Wide Area Network (WAN), either connected via the national research network (Forskningsnettet) or a dedicated fiber. Latency limits the performance of synchronous data replication over such distances and it is necessary to rely on asynchronous replication between sites. Within a site this configuration retains the performance between the compute services e.g. HPC, data analytics, visualization services and the stored data, but the data replication between site A and B can/will be asynchronous (data sync is not achieved/guaranteed at all times). 4

This configuration has the benefit that it resists catastrophic events that would take out one entire data centre, while at the same time maintaining all data intact on the remaining site. It does require a higher degree of storage redundancy. User Requirements UNINETT Sigma2 completed a user survey in June, 2015 to get current and future requirements from scientific communities in Norway. The result from this survey shows a strong dependence on the infrastructure and frequent use of the key NorStore data storage services such as the project area, NorStore archive and services for sensitive data. Considering these use cases, users have requested the data to be kept for three (3) months to several years. Users are currently required to use the traditional SSH/SFTP based tools to access their data and the available resources. There is a demand for more user friendly and flexible services, e.g. Desktop client using WebDAV/SMB or Dropbox like Sync n Share to interact with data services. Currently when users need to process data stored in the national infrastructure, they are required to copy this data manually to the HPC project space. This results in duplication of data as well as limiting the users to the storage capacity available on the HPC facility. Moreover, it results in data getting out of sync due to manual copy and modification which further results in bad user experiences. In the user survey, users have asked that they would like to have data directly accessible on the HPC system to process it. This will result in a better utilization of storage resources by avoiding duplication and offering a smoother user experience. Finally, there is an increasing need for storage and data management services for non HPC users, and 5

in particular shared project areas with fine grained access control, metadata and publishing services. In addition to the current data services, users have expressed interest in dedicated (compute) resources, data analytics and visualisation services. Dedicated resources enable users to have a certain amount of reserved compute and storage to perform certain tasks with short notice and high priority. The compute and storage resources would be permanently reserved for the duration of resources allocated to a user. The data analytics service is about analysing big data using frameworks such as e.g. Apache Spark/Hadoop. The visualization service offers a service to remotely visualize large datasets using dedicated hardware e.g. GPUs. In a data centric infrastructure, laboratories and research infrastructures generating large or steady streams of data should be able to store and use the national data infrastructure. Research institutions can benefit from permanently connecting their resources to the national data infrastructure via suitable protocols such as i.e. S3, REST or Sync n share (Dropbox like). Examples of such facilities may be genome sequencers or other high data volume research instruments. 6

A Service Oriented National e Infrastructure The HPC system requires a fast and low latency scratch storage space, which is usually available under /work. This storage space will be part of a separate storage system and is procured with the HPC system. In addition to scratch space, an HPC system requires access to users /home directory and project data storage. The /home storage represents the space allocated to users to store their code or configuration data. The project storage represents the space allocated to large data sets e.g. databases of genomes, reference data etc. This space is usually shared among a group of users. Access to these data stores should be provided to the HPC users using a parallel POSIX file system and/or HTTP REST based protocols. Access to data is not only to compute intensive services such as HPC, but also needed for data intensive services. These are services where the number of operations per byte is very small, like scaling all entries in a file, changing format on an image, a genome, or transcoding a video etc. Other data centric services include visualisation and animations. This often require large datasets to be processed to make images that are either displayed or sequenced to make animations. With the recent increase in amount of unstructured/machine generated data, many open source frameworks has been developed to process large data sets in parallel running on commodity hardware. Such processing requires access to vast quantities of data collected from different sources like sensor arrays, data from the internet and genomic data. The frameworks, e.g. Apache Spark/Hadoop, process such datasets in a distributed fault tolerance way and enable analytics at large scale. Currently a national sensitive data service is provided by Univ. of Oslo (USIT). The sensitive data services is an important part of the national services and a significant number of users within the fields of medicine and life science rely on this service today. UNINETT Sigma2 AS has supported the development and is currently offering resources to this service from the national storage resource pool. It may be challenging to migrate the service during the first year of operation of the new infrastructure and it is therefore likely that the service will continue to be provided as it is today until at least 2018. The storage infrastructure should be able to serve these needs for the various national services and satisfy performance requirements by combining different storage mediums e.g SSDs and SATA. Storage Requirements for Services Taking the user requirements for current and future services into consideration, we require the storage solution to be scalable, reliable, flexible and policy driven. The focus of the storage system is to be an enabler of new services and allow users to interact with the national e infrastructure in an intuitive and flexible way. The storage system should provide a global 7

name space and enabling users to access their data from any resource/site. The storage system should support the use of different storage technologies to achieve high performance (read+write) and balance between low cost capacity storage. Depending on the access pattern of specific data objects the system should automatically move the data between performance and capacity storage pools; e.g. moving data that is infrequently modified / accessed from the performance storage pool to the more cost effective capacity storage pool (e.g. erasure coded storage on SATA disks). Data from the storage system should be accessible on all the national services. The users should be able to deposit data in the storage system using different protocols as mentioned in the table below. Once data is received in the storage system, users can access this data from the HPC or any other service offered by the national e infrastructure. The storage system should be free from single point of failures (SPOFs), mainly by ensuring component/data redundancy. The system should be able to operate normally under disk/controller/server/rack/site failure given enough free capacity. To provide redundancy in case of fire or natural disaster, the storage system needs to support geo replication across a minimum of two sites. The replication can be performed asynchronously to avoid the performance penalty and reducing the requirement for large network bandwidth between two sites. Protocols for Interacting with Data Storage Resources The table below lists the expected relevant protocols for various key services. User data deposit/access HPC Data Access Data Analytics Dedicated Resources Visualisation Sync n Share (WebDAV or similar), POSIX, SSH/SFTP, Object based HTTP REST API (S3) Parallel POSIX compliant filesystem, Object based HTTP REST API (S3) Parallel POSIX/Hadoop compliant file system, Object based HTTP REST API (S3) Block storage for virtual machines, POSIX/SMB/CIFS file system access Parallel POSIX compliant file system Backup service It is necessary to offer a backup service that can secure the history (changes and deletions) in /home area. It will also be required to provide a form of backup service for the /project data by snapshots locally on the storage system or some other form of undelete functionality. The 8

backup service should be cost effective to be feasible for multi petabyte storage. The vendors should mention the support for various backup softwares. 9