Storage Solutions for Bioinformatics

Similar documents
DAS, NAS or SAN: Choosing the Right Storage Technology for Your Organization

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

EMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise

Solution Brief: Creating Avid Project Archives

Backup and Recovery 1

Data Storage Solutions

Implementing a Digital Video Archive Using XenData Software and a Spectra Logic Archive

The safer, easier way to help you pass any IT exams. Exam : Storage Sales V2. Title : Version : Demo 1 / 5

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

Storage Options for Document Management

DISK STORAGE AND DATA BASES

THE EMC ISILON STORY. Big Data In The Enterprise. Copyright 2012 EMC Corporation. All rights reserved.

Hadoop Architecture. Part 1

Long term retention and archiving the challenges and the solution

Data Protection with IBM TotalStorage NAS and NSI Double- Take Data Replication Software

Protect Data... in the Cloud

Key Considerations for Managing Big Data in the Life Science Industry

Implementing a Digital Video Archive Based on XenData Software

XenData Archive Series Software Technical Overview

Big + Fast + Safe + Simple = Lowest Technical Risk

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

STORAGE Arka Service s.r.l.

Implementing Offline Digital Video Storage using XenData Software

Chapter 8. Secondary Storage. McGraw-Hill/Irwin. Copyright 2008 by The McGraw-Hill Companies, Inc. All rights reserved.

Universal Backup Device The Essential Facts of UBD

XenData Video Edition. Product Brief:

DAS (Direct Attached Storage)

Implementing an Automated Digital Video Archive Based on the Video Edition of XenData Software

IBM System Storage DS5020 Express

Hardware Configuration Guide

TCO Case Study. Enterprise Mass Storage: Less Than A Penny Per GB Per Year. Featured Products

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Data Management using Hierarchical Storage Management (HSM) with 3-Tier Storage Architecture

Keys to Successfully Architecting your DSI9000 Virtual Tape Library. By Chris Johnson Dynamic Solutions International

Scala Storage Scale-Out Clustered Storage White Paper

TCO Case Study Enterprise Mass Storage: Less Than A Penny Per GB Per Year

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Cost Effective Backup with Deduplication. Copyright 2009 EMC Corporation. All rights reserved.

The NGS IT notes. George Magklaras PhD RHCE

NEXTGEN v5.8 HARDWARE VERIFICATION GUIDE CLIENT HOSTED OR THIRD PARTY SERVERS

William Stallings Computer Organization and Architecture 7 th Edition. Chapter 6 External Memory

Data storage considerations for HTS platforms. George Magklaras -- node manager

William Stallings Computer Organization and Architecture 8 th Edition. External Memory

Chapter 7. Using Hadoop Cluster and MapReduce

Cloud Computing. Chapter 6 Data Storage in the Cloud

Optimizing Large Arrays with StoneFly Storage Concentrators

EMC ISILON SCALE-OUT STORAGE PRODUCT FAMILY

Tier 2 Nearline. As archives grow, Echo grows. Dynamically, cost-effectively and massively. What is nearline? Transfer to Tape

Ultra-Scalable Storage Provides Low Cost Virtualization Solutions

Disk-to-Disk-to-Tape (D2D2T)

Storage Switzerland White Paper Storage Infrastructures for Big Data Workflows

Archive Data Retention & Compliance. Solutions Integrated Storage Appliances. Management Optimized Storage & Migration

How to choose the right RAID for your Dedicated Server

Management Challenge. Managing Hardware Assets. Central Processing Unit. What is a Computer System?

EMC arhiviranje. Lilijana Pelko Primož Golob. Sarajevo, Copyright 2008 EMC Corporation. All rights reserved.

Storage Solutions For Small and Medium Businesses

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

Tiered Data Protection Strategy Data Deduplication. Thomas Störr Sales Director Central Europe November 8, 2007

<Insert Picture Here> Refreshing Your Data Protection Environment with Next-Generation Architectures

Enterprise Storage Solutions and Services

Virtual Tape Systems for IBM Mainframes A comparative analysis

REMOTE OFFICE BACKUP

Upgrading Small Business Client and Server Infrastructure E-LEET Solutions. E-LEET Solutions is an information technology consulting firm

WD and the WD logo are registered trademarks of Western Digital Technologies, Inc. in the U.S. and other countries; absolutely, WD Re, WD Se, WD Xe,

Energy Efficient Storage - Multi- Tier Strategies For Retaining Data

Contingency Planning and Disaster Recovery

Data management challenges in todays Healthcare and Life Sciences ecosystems

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

Low-Cost DICOM 3.0 Multi-Modality Archive Appliance ISO 13485:

E4 UNIFIED STORAGE powered by Syneto

Scalable Multi-Node Event Logging System for Ba Bar

Storage Networking Overview

EMC ISILON SCALE-OUT STORAGE PRODUCT FAMILY

Slash Costs and Improve Operations with Server, Storage and Backup Virtualization. December 2008

Reduced Complexity with Next- Generation Deduplication Innovation

Scalable Storage for Life Sciences

Advanced Knowledge and Understanding of Industrial Data Storage

Network Storage AN ALCATEL EXECUTIVE BRIEF

Overview of I/O Performance and RAID in an RDBMS Environment. By: Edward Whalen Performance Tuning Corporation

EVOLUTION OF NETWORKED STORAGE

IBM Global Technology Services September NAS systems scale out to meet growing storage demand.

Backup and Recovery Redesign with Deduplication

High Availability and Disaster Recovery Solutions for Perforce

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

(Scale Out NAS System)

Apache Hadoop FileSystem and its Usage in Facebook

Protecting Microsoft SQL Server with an Integrated Dell / CommVault Solution. Database Solutions Engineering

EMC ISILON ONEFS OPERATING SYSTEM

HP LTO-5 Ultrium Tape Drive Portfolio Bridging the gap between current data protection infrastructure capabilities and today s business demands

Multi-Terabyte Archives for Medical Imaging Applications

Data Storage. Vendor Neutral Data Archiving. May 2015 Sue Montagna. Imagination at work. GE Proprietary Information

Product Brief: XenData X2500 LTO-6 Digital Video Archive System

Large Scale Storage Solutions for Bioinformatics and Genomics Projects

Data Management and Retention for Standards Consortia

Storage Architectures for Big Data in the Cloud

Cloud Storage and Backup

High Performance Computing. Course Notes High Performance Storage

Business-centric Storage FUJITSU Hyperscale Storage System ETERNUS CD10000

Transcription:

Storage Solutions for Bioinformatics Li Yan Director of FlexLab, Bioinformatics core technology laboratory liyan3@genomics.cn http://www.genomics.cn/flexlab/index.html Science and Technology Division, BGI-Shenzhen

OUTLINE Background Hardware Infrastructure of Data Storage Data Management Data Storage Architecture In BGI Distributed Computing on Storage Server

Background: Fast Growing Big Data

Background

Fast growing big data From small genomes to large complex genomes E. coli Genome: 4.9M Caenorhaditis elegans Genome: 100M Human Genome: 3G Wheat Genome: 16G Salamander: 45G From one sample to populations Human Genome: 3 billion DNA subunits (A,T,C,G) 80~100X Sequencing: 600GB Raw data for individual study 1000 Genome Project: 600TB Raw data for population study From the first generation sequencing to the second generation sequencing

Long-Term Data Storage Needs Properly secure the data Plan for data redundancy, which generally means we mirror data with two or more copies Available(24x7x365) for all kinds of uses Readily accessible and in the right format Fast Data Transfer for collaborations Fast Network server(aspera) instead of mailing a hard drive Scalable, easy to scale up Choosing reliable file systems

Hardware infrastructure of data storage

Type of Storage infrastructure Disk library A high-capacity storage system that holds a quantity of CD-ROM, DVD or magnetooptic (MO) disks in a storage rack and feeds them to one or more drives for reading and writing. Magnetic tape A high-capacity data storage system for storing, retrieving, reading and writing multiple magnetic tape cartridges. Redundant array of independent disks (RAID) RAID is a storage technology that combines multiple disk drive components into a logical unit Direct-attached storage (DAS) a digital storage system directly attached to a server or workstation, without a storage network in between Network-attached storage (NAS) Network-attached storage (NAS) is file-level computer data storage connected to a computer network providing data access to heterogeneous clients. Storage area network (SAN) A storage area network (SAN) is a dedicated network that provides access to consolidated, block level data storage.

Type of Storage Pros Cons General use Disk library Fast High storage capacity High data availability Not as easily accessible as DAS Intended for write once, read rarely info Disk-to-disk backup Archiving Near line storage Magnetic tape Low cost per megabytes Portable Unlimited capacity (with multiple tapes) Inconvenient for fast recovery of individual or group files Archiving Limited-budget businesses Offsite storage Redundant array of independent disks (RAID) Fast High storage capacity High data availability Reliable Security Fault tolerance Possible false sense of security Some recovery difficulty on some systems High cost for optimum systems Swap files Internet service providers Redundant storage

Type of Storage Pros Cons General use Direct-attached storage (DAS) Networkattached storage (NAS) Storage area network (SAN) Simple Low starting cost Easy to use Fast file access for multiple clients Ease of data sharing High storage capacity Redundancy Ease of drive mirroring Consolidated resources Excellent for moving large blocks of data Exceptional reliability Easily availible Fault tolerance Scalability Needs separate storage for each server Not easy to transfer data in network Server takes application processing load Less convenient than SAN for moving large blocks of data Expensive Lack of standardization Management complexity Data and application sharing Data backup Archiving Backup Archiving Redundant storage Large databases Bandwidth-intensive applications Mission-critical applications

Software Level of Data storage

Data flow of NGS Alignment Assembly Association Sequencer Raw Data Annotation of features Variations/Mutations Protein Structural Gene Expressions Function Networks Complex workflow Data Store Meaningful Biology Data

Data Management Classify the data into different levels First Level of Storage: Dynamic, fast, Temporary Secondary Level of storage: Slower than first level, but enduring and safety Third Level of storage: High capacity medium for backups and archives Choosing file systems Current popular distributed file systems include: Lustre, HDFS, MogileFS, FreeNAS, FastDFS, OpenAFS, MooseFS, pnfs, and GoogleFS.

Classify the data into different levels First Level of Storage: Dynamic, fast, Temporary intermediate results of data analysis Reference data Secondary Level of storage: Slower than first level, but enduring and safety Sequencing raw data Meaningful data Third Level of storage: High capacity medium for backups and archives Backups and archives of raw data and meaningful data

Lustre Distributed File systems lustre is a large, safe and reliable, highly available cluster file system, which is developed and maintained by the SUN. Lustre can support more than 10,000 nodes, the number to the number of PB storage system. Hadoop(HDFS) Hadoop and not just a hadoop distributed file system for storage, but designed for general-purpose computing device in the form of large-scale distributed applications running on the cluster framework. OneFS OneFS enables to scale data access capacity to more than 1.6 petabytes and up to 10 Gb/sec of throughput for a single cluster capacity of up to 10 GBS (Gigabytes per second) of throughput. Distributed file systems Storage Server

Distributed File systems MogileFS (www.danga.com) FreeNAS ( www.openqrm.org ) FastDFS (code.google.com / p / fastdfs) OpenAFS ( www.openafs.org ) MooseFS (derf.homelinux.org) pnfs ( www.pnfs.com ) GoogleFS

Data compression&& Data security Data compression Common used: Lemple-Ziv, BWT Exclusive used for DNA sequences: Biocompress, GeneCompress, CTW-LZ, GeNML, fqzcomp, sam_comp Data security Raid system failure/ Redundancy File system Network

Data Storage Architecture In BGI

Data Storage Architecture In BGI Two Copies Write Write Write Read Tape Library Sequencers Compute Nodes

Data Storage Architecture In BGI Two Copies Write Write Write Read Tape Library Sequencers First Level Storage Compute Nodes

Data Storage Architecture In BGI Two Copies Write Write Second Level Storage Write Read Tape Library Sequencers Compute Nodes

Data Storage Architecture In BGI Two Copies Write Write Sequencers Write Read Tape Library Third Level Storage Compute Nodes

Data Storage Architecture In BGI Two Copies Write Write Write Read Tape Library Sequencers Compute Nodes

Distributed Computing on Storage Server

Traditional Genome Assembly Costly, Unscalable NGS read file Sequence Assembly Large memory server >500GB Storage Users 26

Distributed Genome Assembly Several storage server (IBM3630*16 for human genome) Assembly Cost effectively, Scalable

Constructing de bruijn Graph Hecate Solving Tiny Repeats Merging Bubbles Scaffolding Merging Contigs

Reads Gaea 2.1 Reference genome Preprocessing Distributed Indexing for load balancing Flexible splitting tolerates more mistmatches Dynamic Programming for robust gap alignment Locating Aligning SNP calling Standard mapping quality for SNP calling 29

Q&A