Computational infrastructure for NGS data analysis José Carbonell Caballero Pablo Escobar
Computational infrastructure for NGS Cluster definition: A computer cluster is a group of linked computers, working together closely thus in many respects forming a single computer Requirements High perfomance High availability Load balancing Scalability
Computational infrastructure for NGS In NGS we have to process really big amounts of data, which is not trivial in computing terms. Big (or medium) NGS projects require supercomputing infrastructures
Computational infrastructure for NGS These infrastructures are expensive and not trivial to use, we require: Acondicionated data center
Computational infrastructure for NGS These infrastructures are expensive and not trivial to use, we require: Acondicionated data center This is not a super computer!!!!!
Computational infrastructure for NGS These infrastructures are expensive and not trivial to use, we require: Acondicionated data center The Blue Gene/P supercomputer at Argonne National Lab - 250,000 processors
Computational infrastructure for NGS These infrastructures are expensive and not trivial to use, we require: Acondicionated data center Tier 1 = Non-redundant capacity components (single uplink and servers). Tier 2 = Tier 1 + Redundant capacity components. Tier 3 = Tier 1 + Tier 2 + Dual-powered equipments and multiple uplinks. Tier 4 = Tier 1 + Tier 2 + Tier 3 + all components are fully fault-tolerant including uplinks, storage, chillers, HVAC systems, servers etc. Everything is dual-powere
Computational infrastructure for NGS Computing cluster: Many computing nodes (servers) High performance storage (hard disks) Fast networks (10Gb ethernet, infiniband...)
Computational infrastructure for NGS Skilled people in computing ( sysadmins and developers). In CNAG currently 30 staff - >50% informatics
Big infrastructure cluster Distributed memory cluster Starting at 20 computing nodes 160 to 240 cores amd64 (x86_64) is the most used cpu architecture At least 48GB ram per node Fast networks 10Gbit Infiniband Batch queue system (sge, condor, pbs, slurm) Optional MPI and GPUs environment depending on project requirements
Big infrastructure storage Distributed filesystem for high performance storage (starting at 100TB) Lustre GPFS Ibrix parallel nfs glusterfs NFS is not a good option for supercomputing Storage is the most expensive (2000$ per Tb)
Big infrastructure storage
Big infrastructure
Big infrastructure Starting at 200.000 200.000 is just the hardware Plus data center (computers room) Plus informatics salary Not every partner knows about supercomputing. SGI Bull IBM HP
Middle-size infrastructure Small distributed filesystem ( around 50TB). Small cluster (around 10 nodes, 80 to 120 cores). At least gigabit ethernet network. Price range: 50.000 100.000 (just hardware) plus data center and informatics salary
Small infrastructure Recommended at least 2 machines 8 or 12 cores each machine. 48Gb ram minimum each machine. BIG local disk. At least 4TB each machine As much local disks as we can afford Price range: starting at 8.000-10.000 (two machines)
Sequencing centers in Spain Medical Genome Project Sequencing Instruments 7 GS-FLX (Roche) 4 SolidTM 4 (Applied Biosystems) Informatics infrastructure 300 core cluster 0,5 petabyte hard disks
Medical genome project Storage racks IBRIX filesystem front-ends
MGP raw data generation a solid sequencer run 7 days running Generates around 4TB Only the four solid sequencers working full time can generate around 12TB each week. 12TB just of raw data. After running bioinformatics analysis more data is generated Raw data size grows really fast New sequencer models New reagents
MGP raw data generation
Sequencing centers in Spain CNAG Sequencing Instruments 8 Illumina Genome Analyzer Iix 6 Illumina HiSeq2000 4 Illumina cbots Informatics infrastructure 850 core cluster 1.2 petabyte hard disks 10 x 10 Gb/s link with marenostrum (Barcelona Super Computer 10,240 cores)
CNAG
Largest sequencing center in the world Beijing Genomics Institute (BGI)
Largest sequencing center in the world Beijing Genomics Institute (BGI) Hardware resources Source: http://www.genomics.cn/en/platform.php?id=249
Sequencing center resources
Clusters around the world
Most used operating system is GNU/LINUX Source: http://www.top500.org/stats/list/36/osfam
Alternatives cloud computing Cloud computing Remote computation Pay per use Elastic Mirrors around the world Virtualization
Alternatives cloud computing Pros Flexibility. You pay what you use. Don t need to maintain a data center. Cons Transfer big datasets over internet is slow. You pay for consumed bandwidth. That is a problem with big datasets. Lower performance, specially in disk read/write. Privacy/security concerns. More expensive for big and long term projects.
Thanks