Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar



Similar documents
The NGS IT notes. George Magklaras PhD RHCE

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Data management challenges in todays Healthcare and Life Sciences ecosystems

Storage Solutions for Bioinformatics

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

Building a Top500-class Supercomputing Cluster at LNS-BUAP


Building Clusters for Gromacs and other HPC applications

Hadoop: Embracing future hardware

Data storage considerations for HTS platforms. George Magklaras -- node manager

Hack the Gibson. John Fitzpatrick Luke Jennings. Exploiting Supercomputers. 44Con Edition September Public EXTERNAL

HP Proliant BL460c G7

Cluster Implementation and Management; Scheduling

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Hadoop & its Usage at Facebook

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

Design and Evolution of the Apache Hadoop File System(HDFS)

An Introduction to High Performance Computing in the Department

SMB Direct for SQL Server and Private Cloud

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

CMS Tier-3 cluster at NISER. Dr. Tania Moulik

Monitoring Infrastructure for Superclusters: Experiences at MareNostrum

Clusters: Mainstream Technology for CAE

SURFsara HPC Cloud Workshop

Introduction to Cloud Computing

CRIBI. Calcolo Scientifico e Bioinformatica oggi Università di Padova 13 gennaio 2012

Big Data Challenges in Bioinformatics

Managing a local Galaxy Instance. Anushka Brownley / Adam Kraut BioTeam Inc.

Hadoop & its Usage at Facebook

Hadoop on the Gordon Data Intensive Cluster

New Storage System Solutions

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

IBM System x GPFS Storage Server

Linux Cluster Computing An Administrator s Perspective

Oracle Exadata Database Machine for SAP Systems - Innovation Provided by SAP and Oracle for Joint Customers

Flexible Scalable Hardware independent. Solutions for Long Term Archiving

3 Red Hat Enterprise Linux 6 Consolidation

wu.cloud: Insights Gained from Operating a Private Cloud System

Cornell University Center for Advanced Computing

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

Panasas at the RCF. Fall 2005 Robert Petkus RHIC/USATLAS Computing Facility Brookhaven National Laboratory. Robert Petkus Panasas at the RCF

How To Design A Data Center

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

Cloud Sure - Virtual Machines

David Vicente Head of User Support BSC

Lustre failover experience

Lecture 1: the anatomy of a supercomputer

SURFsara HPC Cloud Workshop

High Performance Computing in CST STUDIO SUITE

Michael Kagan.

Big Data and Cloud Computing for GHRSST

Kriterien für ein PetaFlop System

THE SUN STORAGE AND ARCHIVE SOLUTION FOR HPC

Scaling from 1 PC to a super computer using Mascot

The safer, easier way to help you pass any IT exams. Exam : Storage Sales V2. Title : Version : Demo 1 / 5

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Storage Architectures for Big Data in the Cloud

Apache Hadoop FileSystem and its Usage in Facebook

Globus and the Centralized Research Data Infrastructure at CU Boulder

Mississippi State University High Performance Computing Collaboratory Brief Overview. Trey Breckenridge Director, HPC

Ground up Introduction to In-Memory Data (Grids)

Cloud Computing and Amazon Web Services

Grid Scheduling Dictionary of Terms and Keywords

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Low-cost

Open source Google-style large scale data analysis with Hadoop

The Evolution of Microsoft SQL Server: The right time for Violin flash Memory Arrays

Putting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable

PRIMERGY server-based High Performance Computing solutions

System Software for High Performance Computing. Joe Izraelevitz

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Intro to Map/Reduce a.k.a. Hadoop

Introduction to grid technologies, parallel and cloud computing. Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber

SCALABLE FILE SHARING AND DATA MANAGEMENT FOR INTERNET OF THINGS

Data Management & Storage for NGS

HPC Cloud. Focus on your research. Floris Sluiter Project leader SARA

Hybrid Software Architectures for Big

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Deep Dive on SimpliVity s OmniStack A Technical Whitepaper

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Large Scale Storage. Orlando Richards, Information Services LCFG Users Day, University of Edinburgh 18 th January 2013

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Transcription:

Computational infrastructure for NGS data analysis José Carbonell Caballero Pablo Escobar

Computational infrastructure for NGS Cluster definition: A computer cluster is a group of linked computers, working together closely thus in many respects forming a single computer Requirements High perfomance High availability Load balancing Scalability

Computational infrastructure for NGS In NGS we have to process really big amounts of data, which is not trivial in computing terms. Big (or medium) NGS projects require supercomputing infrastructures

Computational infrastructure for NGS These infrastructures are expensive and not trivial to use, we require: Acondicionated data center

Computational infrastructure for NGS These infrastructures are expensive and not trivial to use, we require: Acondicionated data center This is not a super computer!!!!!

Computational infrastructure for NGS These infrastructures are expensive and not trivial to use, we require: Acondicionated data center The Blue Gene/P supercomputer at Argonne National Lab - 250,000 processors

Computational infrastructure for NGS These infrastructures are expensive and not trivial to use, we require: Acondicionated data center Tier 1 = Non-redundant capacity components (single uplink and servers). Tier 2 = Tier 1 + Redundant capacity components. Tier 3 = Tier 1 + Tier 2 + Dual-powered equipments and multiple uplinks. Tier 4 = Tier 1 + Tier 2 + Tier 3 + all components are fully fault-tolerant including uplinks, storage, chillers, HVAC systems, servers etc. Everything is dual-powere

Computational infrastructure for NGS Computing cluster: Many computing nodes (servers) High performance storage (hard disks) Fast networks (10Gb ethernet, infiniband...)

Computational infrastructure for NGS Skilled people in computing ( sysadmins and developers). In CNAG currently 30 staff - >50% informatics

Big infrastructure cluster Distributed memory cluster Starting at 20 computing nodes 160 to 240 cores amd64 (x86_64) is the most used cpu architecture At least 48GB ram per node Fast networks 10Gbit Infiniband Batch queue system (sge, condor, pbs, slurm) Optional MPI and GPUs environment depending on project requirements

Big infrastructure storage Distributed filesystem for high performance storage (starting at 100TB) Lustre GPFS Ibrix parallel nfs glusterfs NFS is not a good option for supercomputing Storage is the most expensive (2000$ per Tb)

Big infrastructure storage

Big infrastructure

Big infrastructure Starting at 200.000 200.000 is just the hardware Plus data center (computers room) Plus informatics salary Not every partner knows about supercomputing. SGI Bull IBM HP

Middle-size infrastructure Small distributed filesystem ( around 50TB). Small cluster (around 10 nodes, 80 to 120 cores). At least gigabit ethernet network. Price range: 50.000 100.000 (just hardware) plus data center and informatics salary

Small infrastructure Recommended at least 2 machines 8 or 12 cores each machine. 48Gb ram minimum each machine. BIG local disk. At least 4TB each machine As much local disks as we can afford Price range: starting at 8.000-10.000 (two machines)

Sequencing centers in Spain Medical Genome Project Sequencing Instruments 7 GS-FLX (Roche) 4 SolidTM 4 (Applied Biosystems) Informatics infrastructure 300 core cluster 0,5 petabyte hard disks

Medical genome project Storage racks IBRIX filesystem front-ends

MGP raw data generation a solid sequencer run 7 days running Generates around 4TB Only the four solid sequencers working full time can generate around 12TB each week. 12TB just of raw data. After running bioinformatics analysis more data is generated Raw data size grows really fast New sequencer models New reagents

MGP raw data generation

Sequencing centers in Spain CNAG Sequencing Instruments 8 Illumina Genome Analyzer Iix 6 Illumina HiSeq2000 4 Illumina cbots Informatics infrastructure 850 core cluster 1.2 petabyte hard disks 10 x 10 Gb/s link with marenostrum (Barcelona Super Computer 10,240 cores)

CNAG

Largest sequencing center in the world Beijing Genomics Institute (BGI)

Largest sequencing center in the world Beijing Genomics Institute (BGI) Hardware resources Source: http://www.genomics.cn/en/platform.php?id=249

Sequencing center resources

Clusters around the world

Most used operating system is GNU/LINUX Source: http://www.top500.org/stats/list/36/osfam

Alternatives cloud computing Cloud computing Remote computation Pay per use Elastic Mirrors around the world Virtualization

Alternatives cloud computing Pros Flexibility. You pay what you use. Don t need to maintain a data center. Cons Transfer big datasets over internet is slow. You pay for consumed bandwidth. That is a problem with big datasets. Lower performance, specially in disk read/write. Privacy/security concerns. More expensive for big and long term projects.

Thanks