How To Write A Paper On A Computer Program

Similar documents

Even in the coolest of economies, spending and

Consensus Scoring to Improve the Predictive Power of in-silico Screening for Drug Design

Grid Scheduling Dictionary of Terms and Keywords

Netherlands escience Center

Personalized Medicine and IT

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

Integrating Bioinformatics, Medical Sciences and Drug Discovery

Deploying Clusters at Electricité de France. Jean-Yves Berthou

Acceleration for Personalized Medicine Big Data Applications

Access Stations and other Services at SERC. M.R. Muralidharan

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

IBM Deep Computing Visualization Offering

Informatics and Knowledge Management at the Novartis Institutes for BioMedical Research (NIBR)

Introduction to bioknoppix: Linux for the life sciences

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

White Paper. Version 1.2 May 2015 RAID Incorporated

In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps. Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Data Visualization in Cheminformatics. Simon Xi Computational Sciences CoE Pfizer Cambridge

HPC Cloud. Focus on your research. Floris Sluiter Project leader SARA

RevoScaleR Speed and Scalability

High Performance Computing in CST STUDIO SUITE

Call 2014: High throughput screening of therapeutic molecules and rare diseases

Pipeline Pilot Enterprise Server. Flexible Integration of Disparate Data and Applications. Capture and Deployment of Best Practices

Scaling from 1 PC to a super computer using Mascot

Tutorial for proteome data analysis using the Perseus software platform

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

PRIMERGY server-based High Performance Computing solutions

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

New Tools. Diseases. for Research on. R&D for Neglected Patients. Scientific Symposium. conference programme. In collaboration with.

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

64-Bit versus 32-Bit CPUs in Scientific Computing

On-Demand Supercomputing Multiplies the Possibilities

Understanding the Benefits of IBM SPSS Statistics Server

SeSE/SNIC-UPPMAX: Scientific Visualisation Workshop 2014

Paradigm Changes Affecting the Practice of Scientific Communication in the Life Sciences

White Paper. Recording Server Virtualization

Parallels Desktop 4 for Windows and Linux Read Me

M The Nucleus M The Cytoskeleton M Cell Structure and Dynamics

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Applications des grilles aux sciences du vivant

Comparison of computational services at LRZ

IOS110. Virtualization 5/27/2014 1

Panasas High Performance Storage Powers the First Petaflop Supercomputer at Los Alamos National Laboratory

Scale-out NAS Unifies the Technical Enterprise

Web-Based Genomic Information Integration with Gene Ontology

MEGWARE HPC Cluster am LRZ eine mehr als 12-jährige Zusammenarbeit. Prof. Dieter Kranzlmüller (LRZ)

Big Data & Analytics. A boon under certain conditions. Dr. Christian Keller General Manager IBM Switzerland IBM Corporation

White Paper The Numascale Solution: Extreme BIG DATA Computing

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

A leader in the development and application of information technology to prevent and treat disease.

Storage Solutions for Bioinformatics

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

Data Centric Systems (DCS)

Scientific and Technical Applications as a Service in the Cloud

An Introduction to Genomics and SAS Scientific Discovery Solutions

The Fusion of Supercomputing and Big Data. Peter Ungaro President & CEO

Application Note # LCMS-81 Introducing New Proteomics Acquisiton Strategies with the compact Towards the Universal Proteomics Acquisition Method

Using GPUs in the Cloud for Scalable HPC in Engineering and Manufacturing March 26, 2014

GPUs for Scientific Computing

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Medisoft Clinical v17 System Requirements

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

VMware Horizon FLEX User Guide

Altix Usage and Application Programming. Welcome and Introduction

Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Agenda. Capacity Planning practical view CPU Capacity Planning LPAR2RRD LPAR2RRD. Discussion. Premium features Future

wu.cloud: Insights Gained from Operating a Private Cloud System

A Distributed Render Farm System for Animation Production

CMSC 611: Advanced Computer Architecture

EVALUATING THE USE OF VIRTUAL MACHINES IN HIGH PERFORMANCE CLUSTERS

Dutch HPC Cloud: flexible HPC for high productivity in science & business

Kazan (Volga region) Federal University, Kazan, Russia Institute of Fundamental Medicine and Biology. Master s program.

The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland

Making the most of academic drug target discoveries

HPC Cloud Computing Guide PENGUIN ( ) HPC

User Reports. Time on System. Session Count. Detailed Reports. Summary Reports. Individual Gantt Charts

Radar Image Processing with Clusters of Computers

Accelerating CFD using OpenFOAM with GPUs

Dr Alexander Henzing

Muse Server Sizing. 18 June Document Version Muse

Microsoft Windows Apple Mac OS X

numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT

CLUSTER ANALYSIS WITH R

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

BIOINFORMATICS Supporting competencies for the pharma industry

Big Data Challenges in Bioinformatics

Diabetes and Drug Development

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

How To Understand Protein-Protein Interaction And Inhibitors

Integrating Medicinal Chemistry and Computational Chemistry: The Molecular Forecaster Approach

QUADRICS IN LINUX CLUSTERS

Next Generation Sequencing: Technology, Mapping, and Analysis

SGI InfiniteStorage Solutions

QTP Computing Laboratory Strategy

KNIME Enterprise server usage and global deployment at NIBR

Chapter 5: System Software: Operating Systems and Utility Programs

How To Understand The Chemistry Of A 2D Structure

Transcription:

High-Performance Computing: the example of Merck Serono Friedrich Rippmann, Bio- and Chemoinformatics rippmann@merck.de July 1st, 2009 - Paris, France

Introducing Merck Serono Merck is the oldest pharmaceutical and chemical company in the world Establishment of the new Merck Serono division in early 2007 following the acquisition of Serono S.A. by Merck; division now headquartered in Genevea Focus on innovative, prescription drugs of chemical and biological origin Major research sites in Darmstadt, Geneva and Boston R&D Investment: 1 billion (2008) R&D FTEs: 2300 >100 clinical trials ongoing over 150 active external collaborations (former Research) The Pützer Tower and Pyramid at Darmstadt Merck headquarters Merck Serono headquarters in Geneva Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 2

HPC at Merck, Merck Serono Awareness phase Test & establ. phase Production phase 1991 1996 2001 CONVAX (in-house) Specialized HW (external) IBM & SGI clusters & e.g. Parsytec multiprocessor machines [PC Cluster (internal)] calculation speed calculation speed calculation & storage sp. limited in-house activity academic collaborations: all in-house MAXHOM, PHASE, TTN, TARGID molecular dynamics sequence annotation, sequence annotation, docking, structure docking, genome-wide prediction association studies, large-scale clustering low project impact low project impact good project impact Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 3

HPC external or internal? pro con External - up-to-date hardware - on demand (potentially lower cost) - (almost) unlimited capacity - Software installation - Data transmission - Security: what happens if service leaks (academia), or commercial service providers goes bankrupt (and your data are sold together with the machines )? Internal - safe retained option (for now) -flexible(you install & run whatever software whenever you need it) - high investment, maintenance - know-how intensive (load balancing systems, memory allocation etc.) Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 4

Application sequence annotation Various tasks (early application: GENEQUIZ (1996); today e.g. re-annotation of Affymetrics probe sets) Typical run times: days to weeks Software used: e.g. BLAST, in-house SW Issues: data volumes, storage speed Project impact: medium (must be done, but no immediate discovery) Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 5

Example re-annotation of GeneChip probes Run times of several weeks for a single annotation job ~550.000 single jobs to do ~120-300 concurrent jobs running at a time 136 CPUs in total Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 6

Application threading A given target sequence is threaded on all known protein folds, with gaps of any length at any position Typical run time: several days (evaluation 1-2 weeks) Software used: THREADER, 123D and other Issues: results often not conclusive Project impact: very low Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 7

Fold prediction via threading A given sequence......is threaded in all positions......onto all known folds BMBF Project Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 8

Application ab initio prediction A given target sequence is folded into ist likely 3D structure Typical run time: several hours Software used: DRAGON (Aszodi & Taylor) Issues: works only on small proteins Project impact: very low Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 9

Application docking Large numbers of chemical structures are docked into a predefined cavity of the protein target Typical run time: 1-2 days (evaluation 1-2 weeks) Software used: FLEXX, GOLD Issues: large numbers of files generated cause problems to file systems and their operation (e.g. listing, deletion of a few result sets takes several days) Project impact: very high (has regularly generated new guide structures for chemists) Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 10

Virtual screening by ligand docking Find in silico suitable molecules for my target take large numbers of structures calculate find those that fit Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 11

High speed-up by novel algorithm BMBF Project Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 12

Application combinatorial chemistry Possible combinatorial structures need to be generated & filtered Typical run time: days to weeks Software used: in-house Issues: full enumeration prohibitive Project impact: medium Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 13

Managing the combinatorial explosion One scaffold and 3000 reagents in 4 groups are enough to generate a library of 320 000 000 000 virtual compounds In silico screening would take years of time and a petabyte of space with conventional methods. 3 R R O 1 R O H N N N 2 R O " R1 O H N. O R2 Our Virtual Combinatorial Database retrieves all the hits in just a few seconds. Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 14

Application druggability analysis Known protein structures are checked for suitable binding pockets Typical run time: days to weeks (followed by long evaluation time) Software used: in-house Issues: suitable visualization in order to speed-up human analysis Project impact: medium Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 15

Application druggability analysis Extracted from diploma thesis by Daniela Grimme Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 16

Application protein dynamics Several applications, e.g. flexibility analysis of proteins (applied to kinases, integrins, antibodies ) Typical run time: days Software used: AMBER, NMSim/RCNMA, FIRST FRODA, YASARA, Issues: accuracy of parameterization; time scale (pico nano seconds) Project impact: medium Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 17

Structure of integrin avb3 Arnaout et al. Science 294 pp. 339 (2001) Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 18

Atelier 2 - Calcul Haute Performance et les Sciences du Vivant Forum Ter@tec 2009 - www.teratec.eu 1er juillet 2009 - Paris, France Simulation of flexibility in collaboration with Holger Gohlke Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 19

Modelling of antibody flexibility Simulation done with NMSim/RCNMA, Gohlke et al. Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 20

Application association studies genome-wide association studies: permutations (m markers processed n times, m ~ 1M, n ~ 10000) and genetic interactions (m^2/2 tests) Typical run time: Permutation-based FDR computation (10000 permutations) - about one week on 1 CPU Interaction scan (not optimized yet), about 10000 markers - about 12 weeks on 1 CPU Software used: in-house Project impact: medium Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 21

Future applications: Next Generation Sequencing unclassical HPC problem: ~8 processor are enough, but disk speed would not be sufficient to store data as they are generated (there are solutions to this, however) Data volumes are too big for transfer to analysis computers (e.g. line speed between Sanger Centre and European Bioinformatics Institute [same campus, physical distance 20m] not sufficient for transfer); internet bandwidth far too small Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 22

Future applications: Systems Biology Some trial applications, but limited acceptance in-house so far Simulation of individual pathways, cell populations etc. feasible Organ modeling (selected aspects only) becomes possible Virtual human still some distance away Issue: incomplete knowledge, lack of accurate rate constants, protein concentrations, lack of spatial resolution, lack of time resolution Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 23

Example: VERY simplified TGF-ß pathway TGF-β In WT cells RI RII RI TGF-β RII Extracellular SARA Smad2/3 SARA P Smad2/3 Cytoplasm Smad4 Nucleus P Smad2/3 Smad4 Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 24

Current HPC infrastructure (DA only) Optimized compute server and cluster for analyzing large datasets Origin 3400 64bit IRIX (UNIX of SGI) 32 CPUs (MIPS/R16000) 32 GByte Shared Memory 64bit Linux 32 CPUs (INTeL Itanium) 256 GByte Shared Memory 64bit Linux (Cluster) 94x 4 CPU-Cores (Xeon) 94x 8 GByte distinct Memory Altix 3700 Linux-Cluster For MEDIUM number of LARGE problems For LARGE number of SMALL problems High-end graphics workstations with 3D stereo support 7.5 TByte Shared Storage for biological and chemical databases and analysis results IT infrastructure optimized for high performance and high throughput computing, using a large pool of sophisticated scientific software Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 25

Future challenges Compute power less an issue, but DATA deluge (e.g. personal genomes; quote from director of German Cancer Research Center: In five years we will sequence all tumors of all our patients Storage Bandwidth locally/internet Power consumption may become limiting (1 entire nuclear power plant is needed just to serve German HPC centers) Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 26

The end and finally we have calculated that our simulations have added 0.5 degrees to Global Warming HPC saves the planet Thanks to Reinhard Schneider, Chris Sander, Alexander Reinefeld, Willie Taylor, Andras Aszodi, Holger Gohlke Jacques Barbanton, Thierry Convard Massimo de Francesco, Jerome Wojcik Christian Griebel, Anja von Heydebreck, Oliver Karch, Michael Krug, Mireille Krier, Daniela Grimme Friedrich Rippmann, Bio- and Chemoinformatics July 1st, 2009 - Paris, France 27