Data storage considerations for HTS platforms. George Magklaras -- node manager http://www.no.embnet.org http://www.biotek.uio.no admin@embnet.uio.



Similar documents
The NGS IT notes. George Magklaras PhD RHCE

Large Scale Storage. Orlando Richards, Information Services LCFG Users Day, University of Edinburgh 18 th January 2013

Addendum No. 1 to Packet No Enterprise Data Storage Solution and Strategy for the Ingham County MIS Department

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

SMB Direct for SQL Server and Private Cloud

Best Practices for Data Sharing in a Grid Distributed SAS Environment. Updated July 2010

The safer, easier way to help you pass any IT exams. Exam : Storage Sales V2. Title : Version : Demo 1 / 5

Affordable. Simple, Reliable and. Vess Family Overview. VessRAID FC RAID Storage Systems. VessRAID SAS RAID Storage Systems

IBM Global Technology Services September NAS systems scale out to meet growing storage demand.

RFP-MM Enterprise Storage Addendum 1

Enhancements of ETERNUS DX / SF

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

Quantum StorNext. Product Brief: Distributed LAN Client

<Insert Picture Here> Refreshing Your Data Protection Environment with Next-Generation Architectures

Cluster Implementation and Management; Scheduling

AUTOMATED DATA RETENTION WITH EMC ISILON SMARTLOCK

OVERVIEW. CEP Cluster Server is Ideal For: First-time users who want to make applications highly available

POWER ALL GLOBAL FILE SYSTEM (PGFS)

PrimeArray Data Storage Solutions Network Attached Storage (NAS) iscsi Storage Area Networks (SAN) Optical Storage Systems (CD/DVD)

Business-centric Storage for small and medium-sized enterprises. How ETERNUS DX powered by Intel Xeon processors improves data management

Product Overview. UNIFIED COMPUTING Managed Hosting - Storage Data Sheet

Network Attached Storage. Jinfeng Yang Oct/19/2015

E4 UNIFIED STORAGE powered by Syneto

Quantum DXi6500 Family of Network-Attached Disk Backup Appliances with Deduplication

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

Virtualization, Business Continuation Plan & Disaster Recovery for EMS -By Ramanj Pamidi San Diego Gas & Electric

Best Practice and Deployment of the Network for iscsi, NAS and DAS in the Data Center

Transforming the UL into a Big Data University. Current status and planned evolutions

Oracle Database Deployments with EMC CLARiiON AX4 Storage Systems

THE EMC ISILON STORY. Big Data In The Enterprise. Copyright 2012 EMC Corporation. All rights reserved.

Overview of I/O Performance and RAID in an RDBMS Environment. By: Edward Whalen Performance Tuning Corporation

Business-centric Storage for small and medium-sized enterprises. How ETERNUS DX powered by Intel Xeon processors improves data management

High Performance Server SAN using Micron M500DC SSDs and Sanbolic Software

10th TF-Storage Meeting

Lustre SMB Gateway. Integrating Lustre with Windows

EMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise

Best in Class Storage Solutions for BS2000/OSD

The Panasas Parallel Storage Cluster. Acknowledgement: Some of the material presented is under copyright by Panasas Inc.

Building a Scalable Storage with InfiniBand

Future technologies for storage networks. Shravan Pargal Director, Compellent Consulting

How To Evaluate Netapp Ethernet Storage System For A Test Drive

Lessons learned from parallel file system operation

Moving Virtual Storage to the Cloud. Guidelines for Hosters Who Want to Enhance Their Cloud Offerings with Cloud Storage

Backup and Recovery Solutions for Exadata. Ľubomír Vaňo Principal Sales Consultant

Online Storage Replacement Strategy/Solution

June Blade.org 2009 ALL RIGHTS RESERVED

VIDEO SURVEILLANCE WITH SURVEILLUS VMS AND EMC ISILON STORAGE ARRAYS

CONFIGURATION GUIDELINES: EMC STORAGE FOR PHYSICAL SECURITY

CASE STUDY SAGA - FcoE

全 新 企 業 網 路 儲 存 應 用 THE STORAGE NETWORK MATTERS FOR EMC IP STORAGE PLATFORMS

WHITE PAPER. How To Build a SAN. The Essential Guide for Turning Your Windows Server Into Shared Storage on Your IP Network

NAS or iscsi? White Paper Selecting a storage system. Copyright 2007 Fusionstor. No.1

Storage Backup and Disaster Recovery: Using New Technology to Develop Best Practices

Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers

How to Choose your Red Hat Enterprise Linux Filesystem

High Availability Databases based on Oracle 10g RAC on Linux

Scalable Storage for Life Sciences

Data Compression and Deduplication. LOC Cisco Systems, Inc. All rights reserved.

Hyper-V over SMB: Remote File Storage Support in Windows Server 2012 Hyper-V. Jose Barreto Principal Program Manager Microsoft Corporation

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Overview and Frequently Asked Questions Sun Storage 10GbE FCoE PCIe CNA

Big data management with IBM General Parallel File System

Oracle Maximum Availability Architecture with Exadata Database Machine. Morana Kobal Butković Principal Sales Consultant Oracle Hrvatska

Moving Virtual Storage to the Cloud

BlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything

29/07/2010. Copyright 2010 Hewlett-Packard Development Company, L.P.

EMC Unified Storage for Oracle Database 11g/10g Virtualized Solution. Enabled by EMC Celerra and Linux using NFS and DNFS. Reference Architecture

Enterprise Linux Business Continuity Solutions for Critical Applications

Integrated Grid Solutions. and Greenplum

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

Block based, file-based, combination. Component based, solution based

HP 3PAR StoreServ 8000 Storage - what s new

Ultra-Scalable Storage Provides Low Cost Virtualization Solutions

DELL s Oracle Database Advisor

Getting performance & scalability on standard platforms, the Object vs Block storage debate. Copyright 2013 MPSTOR LTD. All rights reserved.

Infortrend ESVA Cluster File System Competitor Analysis

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

HITACHI VIRTUAL STORAGE PLATFORM FAMILY MATRIX

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

The BIG Data Era has. your storage! Bratislava, Slovakia, 21st March 2013

Introduction to Gluster. Versions 3.0.x

Storage Architectures for Big Data in the Cloud

Use cases and best practices for HP StorageWorks P2000 G3 MSA FC/iSCSI Combo Controller

Implementing a Digital Video Archive Using XenData Software and a Spectra Logic Archive

Achieving High Availability & Rapid Disaster Recovery in a Microsoft Exchange IP SAN April 2006

EMC Integrated Infrastructure for VMware

Storage Virtualization from clusters to grid

Scalable filesystems boosting Linux storage solutions

FIBRE CHANNEL OVER ETHERNET

VERITAS Backup Exec 9.0 for Windows Servers

Open-E Data Storage Software and Intel Modular Server a certified virtualization solution

Emulex OneConnect 10GbE NICs The Right Solution for NAS Deployments

Storage Solutions For the DIY-types

StarWind Virtual SAN for Microsoft SOFS

Backup of NAS devices with Avamar

Transcription:

Data storage considerations for HTS platforms George Magklaras -- node manager http://www.no.embnet.org http://www.biotek.uio.no admin@embnet.uio.no

Overview: The need for data storage Volume dimensioning the right size The right filesystem for the right job How to benchmark filesystems The network layer - FcOE as storage barrier braker Credits Questions

The presentation sales pitch: I am a biologist, give me a break, this is too technical. Yes, but you will have to deal with it. Instruments produce lots of data whether you are technical or not. The number of instruments is surpassing the ability of departmental infrastructures to store data. Your IT people would find this info useful.

The Norwegian EMBnet platform HTS Device No. of Tier 1 Tier 2 Tier 3 Tier 4 Total runs Gbytes Gbytes Gbytes Gbytes Tbytes Per year Illumina 100 9728 100 300 400 990 454 SOLiD 100 100 200 6144 50 100 25 100 75 200 27 80

Volume dimensioning (2) Tier 1: raw unprocessed data as they come out from the instrument (mostly images). For most HTS devices, Tier 1 data will generate several Tbytes per run (several thousands of Gigabytes), especially as the instrument's ability to become more precise gets better with time (firmware or device upgrades). Tier 2: Initial processing data stage: including base (or colour) calls, intensities and first pass quality scores. These data are currently in the order of several tenths of Gigabytes to 300 Gigabytes per run maximum for certain types of sequencers. Tier 3: Ιncludes aligned and analyzed data (alignments of all the reads to a reference or de-novo assembly, if required). This can be at least as big as the initial processing stage (Tier 2), since the initial reads themselves have to be preserved as part of the alignment output. At the end of each successful processing step, the raw data of Tier 1 are removed.

Volume Dimensioning (3) Tier 4: The final fourth tier includes data that should be backed up off site, in order to provide disaster recovery, as well as a long term archive. This includes a mirror of Tier 2 and 3 plus the archive requirements. It is not financially feasible or technically practical to off-site backup Tier 1 data, at least not for every run, as the volume of data is huge. There is some data redundancy between tiers 2 and 3, as in theory one could resort Tier 3 reads according to the alignment output and then discard Tier 2 data. However, this might not be feasible/desirable in all analysis scenarios and thus we assume it is good practice to backup and archive both Tier 2 and Tier 3 data.

Volume dimensioning (4) Tier1store=Σ(Nhts x Gbpr + (Nhts x Gbpr)/4) (x Nruns) Nhts=number of per type HTS devices, Gbpr=Gigabytes per run Tier2,3store=Σ(Nruns x Ganalysis + (Nruns x Ganalysis)/3) Nruns=expected number of runs per year, Ganalysis=Gigabytes per run for Tiers 2 and 3 (Table 1) Tier4store=Tier2,3store + Rperiod x Tier2,3store Rperiod= number of years to keep the data

Volume Dimensioning (5) 2 x Illumina 2 x 454 1 x SOLiD 3 year data retention period

Conclusion: Next Gen Sequencing is a classic example of data intensive computing [1]. Tier facilitate compartmentalization, because the number and range of tasks are different.

Filesystems galore A filesystem is a key component of the Operating System that dictates how the files are stored and accessed. Commonly used disk filesystems: ext3/4 (Linux) [2,3], NTFS (Windows) [4], HFS+ (MACOSX)[5], ZFS (Sun)[6]*. Shared/clustered/SAN filesystems: GFS (RedHat)[7],XSAN (Apple)[8] Distributed File Systems (Network File Systems): NFS(9), CIFS/SMB(10) Distributed parallel fault-tolerant file systems: GPFS (IBM) [11], XtreemeFS[12], OneFS (Isilon) [13], PanFS(Panasas)[14], Lustre [15]

Filesystem requirements: How to choose then? Next Gen Sequencing (NGS) filesystems need to be: Scalable in size: Ext3 with max. volume size of 16 TiB would not fit the bill. Scalable in the number of IOPS for read/writes/nested directory access: (NTFS would not scale that well here). Allow concurrent access: Raw data accessed by hundreds/thousands of compute nodes. Have file redundancy/replication features: Have you ever measured restore times for multiple TiB volumes? What if you want to replicate part of the data set across regions/countries/continents for your colleagues?

Filesystem requirements: Ideally they should offer transparent disk encryption features: sensitive sequences/hospital environments... All these criteria point to distributed parallel fault-tolerant filesystems: Distributed: Data replication issues Parallel: Raise the IOPS gauge and facilitate concurrent access. Tier 1: Disks FS (ext4 + CTDB [16]) : To Facilitate crossplatform access Tiers 2,3: Lustre Tier 4: Lustre + other candidates

Filesystem benchmarks FS Random I/O (MB/s) Seq. I/O MB/S IOPS/seek Recovery time ext3 40/20 1/132 2300 2 days ext4 76/48 90/72 4800 2 days GPFS 114/290 187/320 6700 3 days Lustre 178/190 380/336 5800 5 days

The data network and storage DAS versus NAS versus SAN: Directly Attached Storage (DAS): SATA/SAS, simple, effective for single/dual host access for capacities of up to 20/30 Tbytes. The cheapest, but not the least scalable in terms of storage capacity and simultaneous access. Network Attached Storage(NAS): A combination of a TCP/IP based protocol and a filesystem (NTFS over SMB/CIFS, NFS over ext3/4). Medium price and scalability. Storage Area Network(SAN): Block storage over a proprietary (Fiber Channel or off-the shelf protocol (iscsi, AoE). Expensive, fast.

The data network and storage (2)

The data network and storage (3) Questions for your IT architect/system administrator(s): Can you afford the pure Fiber Channel solutions today? How many storage interconnects you have (GigE, FC, Infiniband). Would it not be nice to have a smaller number of storage interconnects (consolidation)? If you have already chosen iscsi, have you wondered how much is the overhead of encapsulating block protocols over TCP/IP?

The data network and storage FCoE (4)

The data network and storage FCOE(5)

Clustered Samba Tier 1 entry

Bill of materials: Cisco Nexus Switch 5000 series switches

Bill of materials (2): QLE8152 Dual Port 10GbE Ethernet to PCIe Converged Network Adapter (CNA). www.qlogic.com

Bill of materials (3): Dell EMC CX4-960 (8Gbit and 4Gbit FC/10 and 1Gbit iscsi SAN Array) Dell 1950, 64 Gbytes of RAM/Qlogic CN cards (as access/front end nodes), 8 cores.

Bill of materials (4): Redhat Enterprise Linux 5.4 (has support for FCoE Samba 3.3.x with CTDB (not the one that comes with RHEL 5.4)

Comments/questions: email: georgios@biotek.uio.no

Credits