Heuristic Static Load-Balancing Algorithm Applied to the Fragment Molecular Orbital Method



Similar documents
Heuristic Static Load-Balancing Algorithm Applied to CESM

Project Networks With Mixed-Time Constraints

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

The Greedy Method. Introduction. 0/1 Knapsack Problem

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

IMPACT ANALYSIS OF A CELLULAR PHONE

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

What is Candidate Sampling

The OC Curve of Attribute Acceptance Plans

An Alternative Way to Measure Private Equity Performance

Calculating the high frequency transmission line parameters of power cables

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Multiple-Period Attribution: Residuals and Compounding

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

L10: Linear discriminants analysis

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

A Replication-Based and Fault Tolerant Allocation Algorithm for Cloud Computing

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

Credit Limit Optimization (CLO) for Credit Cards

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

Support Vector Machines

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

Recurrence. 1 Definitions and main statements

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

IDENTIFICATION AND CORRECTION OF A COMMON ERROR IN GENERAL ANNUITY CALCULATIONS

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña

A Programming Model for the Cloud Platform

Forecasting the Direction and Strength of Stock Market Movement

An MILP model for planning of batch plants operating in a campaign-mode

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

DEFINING %COMPLETE IN MICROSOFT PROJECT

Calculation of Sampling Weights

Realistic Image Synthesis

Loop Parallelization

Ants Can Schedule Software Projects

A Prefix Code Matching Parallel Load-Balancing Method for Solution-Adaptive Unstructured Finite Element Graphs on Distributed Memory Multicomputers

Open Access A Load Balancing Strategy with Bandwidth Constraint in Cloud Computing. Jing Deng 1,*, Ping Guo 2, Qi Li 3, Haizhu Chen 1

Power-of-Two Policies for Single- Warehouse Multi-Retailer Inventory Systems with Order Frequency Discounts

Course outline. Financial Time Series Analysis. Overview. Data analysis. Predictive signal. Trading strategy

Frequency Selective IQ Phase and IQ Amplitude Imbalance Adjustments for OFDM Direct Conversion Transmitters

Lecture 2: Single Layer Perceptrons Kevin Swingler

BERNSTEIN POLYNOMIALS

An ILP Formulation for Task Mapping and Scheduling on Multi-core Architectures

) of the Cell class is created containing information about events associated with the cell. Events are added to the Cell instance

"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *

A frequency decomposition time domain model of broadband frequency-dependent absorption: Model II

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

Examensarbete. Rotating Workforce Scheduling. Caroline Granfeldt

How To Solve An Onlne Control Polcy On A Vrtualzed Data Center

HÜCKEL MOLECULAR ORBITAL THEORY

INVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMA-HDR NETWORKS

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

Traffic State Estimation in the Traffic Management Center of Berlin

AN APPOINTMENT ORDER OUTPATIENT SCHEDULING SYSTEM THAT IMPROVES OUTPATIENT EXPERIENCE

Logical Development Of Vogel s Approximation Method (LD-VAM): An Approach To Find Basic Feasible Solution Of Transportation Problem

2008/8. An integrated model for warehouse and inventory planning. Géraldine Strack and Yves Pochet

Chapter 4 ECONOMIC DISPATCH AND UNIT COMMITMENT

Damage detection in composite laminates using coin-tap method

J. Parallel Distrib. Comput.

Joint Scheduling of Processing and Shuffle Phases in MapReduce Systems

Enabling P2P One-view Multi-party Video Conferencing

This circuit than can be reduced to a planar circuit

Section 5.4 Annuities, Present Value, and Amortization

Activity Scheduling for Cost-Time Investment Optimization in Project Management

Methodology to Determine Relationships between Performance Factors in Hadoop Cloud Computing Applications

A Secure Password-Authenticated Key Agreement Using Smart Cards

A Simple Approach to Clustering in Excel

Adaptive Fractal Image Coding in the Frequency Domain

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

Analysis of Premium Liabilities for Australian Lines of Business

Formulating & Solving Integer Problems Chapter

Characterization of Assembly. Variation Analysis Methods. A Thesis. Presented to the. Department of Mechanical Engineering. Brigham Young University

Robust Design of Public Storage Warehouses. Yeming (Yale) Gong EMLYON Business School

Linear Circuits Analysis. Superposition, Thevenin /Norton Equivalent circuits

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Research Article Enhanced Two-Step Method via Relaxed Order of α-satisfactory Degrees for Fuzzy Multiobjective Optimization

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background:

J. Parallel Distrib. Comput. Environment-conscious scheduling of HPC applications on distributed Cloud-oriented data centers

An Interest-Oriented Network Evolution Mechanism for Online Communities

METHODOLOGY TO DETERMINE RELATIONSHIPS BETWEEN PERFORMANCE FACTORS IN HADOOP CLOUD COMPUTING APPLICATIONS

A Design Method of High-availability and Low-optical-loss Optical Aggregation Network Architecture

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

PAS: A Packet Accounting System to Limit the Effects of DoS & DDoS. Debish Fesehaye & Klara Naherstedt University of Illinois-Urbana Champaign

A Lyapunov Optimization Approach to Repeated Stochastic Games

A DATA MINING APPLICATION IN A STUDENT DATABASE

Proceedings of the Annual Meeting of the American Statistical Association, August 5-9, 2001

SCHEDULING OF CONSTRUCTION PROJECTS BY MEANS OF EVOLUTIONARY ALGORITHMS

Transcription:

Heurstc Statc Load-Balancng Algorthm Appled to the Fragment Molecular Orbtal Method Yur Alexeev*, Ashutosh Mahajan*, Sven Leyffer, Graham Fletcher Argonne Natonal Laboratory 9700 S. Cass Avenue Argonne, IL 60439, USA {yur,fletcher}@alcf.anl.gov {mahajan,leyffer}@mcs.anl.gov Dmtr G. Fedorov Natonal Insttute of Advanced Industral Scence and Technology Central 2, Umezono 1-1-1 Tsukuba 305-8568, Japan d.g.fedorov@ast.go.jp Abstract In the era of petascale supercomputng, the mportance of load balancng s crucal. Although dynamc load balancng s wdespread, t s ncreasngly dffcult to mplement effectvely wth thousands of processors or more, promptng a second look at statc load-balancng technques even though the optmal allocaton of tasks to processors s an NP-hard problem. We propose a heurstc statc load-balancng algorthm, employng ftted benchmarkng data, as an alternatve to dynamc load balancng. The problem of allocatng CPU cores to tasks s formulated as a mxed-nteger nonlnear optmzaton problem, whch s solved by usng an optmzaton solver. On 163,840 cores of Blue Gene/P, we acheved a parallel effcency of 80% for an executon of the fragment molecular orbtal method appled to model proten-lgand complexes quantummechancally. The obtaned allocaton s shown to outperform dynamc load balancng by at least a factor of 2, thus motvatng the use of ths approach on other coarse-graned applcatons. Keywords: Dynamc load balancng, statc load balancng, heurstc algorthm, quantum chemstry, GAMESS, fragment molecular orbtals, FMO, optmzaton, MINLP, proten-lgand complex I. INTRODUCTION Achevng an even load balance s a key ssue n parallel computng, and ncreasngly so as we enter the petascale supercomputng era. By Amdahl s law, the scalable component of the total wall tme shrnks as the numbers of processors ncreases, whle the load mbalance, together wth the constant sequental component, acts to retard the scalablty. Although parallelzaton of sequental code often requres rewrtng the code, adoptng an effcent loadbalancng scheme can be a smple and effectve way to boost scalablty and performance. Dynamc load balancng (DLB) and statc load balancng (SLB) are two broad classes of load-balancng algorthms. *YA and AM contrbuted equally to ths work Whereas SLB reles on prevously obtaned knowledge (for example benchmarkng data), or consstent task szes, DLB dynamcally assgns jobs to processors durng code executon. Many varatons on SLB and DLB algorthms adapted for specfc applcatons have been reported [1-4], usng dfferent technques such as random stealng [5, 6], smulated annealng [7], recursve bsecton methods [8-10], space-fllng curve parttonng [11-14], and graph parttonng [15-21]. SLB s usually smple to mplement and has neglgble overhead, makng t sutable for fne-graned parallelsm consstng of many small tasks. However, f the applcaton nvolves much larger tasks of dverse szes, as s often the case wth coarse-graned parallelsm, DLB may be preferred. Snce many applcatons naturally nvolve wdely dfferng task szes, DLB algorthms have become wdespread. Indeed, as the number of avalable processors ncreases (for nstance, when movng from a PC cluster envronment to a large modern supercomputer), many applcatons fnd t advantageous to allocate work n larger chunks n the nterest of reducng overhead. In the shft from fne- to coarse-graned parallelsm, DLB may seem to be the natural choce. However, the DLB schemes sutable for a PC cluster often perform poorly on many thousands of processors, promptng the search for load-balancng paradgms that can handle dverse task szes wth mnmal overhead. One possblty s to adapt SLB technques to pre-allocate tasks more effectvely by drawng on a deeper understandng of the applcaton at hand. However, the optmal statc mappng of jobs to more than two processors s, n general, an NP-hard problem [22, 23]. Nevertheless, such SLB methods have been successfully appled to a large number of applcatons [1]. The success of applyng SLB often reles on predctve models that can also depend on the accuracy of nput data from a benchmarkng study; both factors can be systematcally mproved. Furthermore, f the calculaton s teratve, the lack of a dynamc means of allocatng tasks can be accounted for n SLB schemes by redstrbutng work between teratons. SC12, November 10-16, 2012, Salt Lake Cty, Utah, USA 978-1-4673-0806-9/12/$31.00 2012 IEEE

In ths paper we examne parallel load-balancng schemes appled to a quantum chemstry method - the fragment molecular orbtal (FMO) method mplemented n the quantum chemstry code GAMESS [24, 25] - on the Blue Gene/P [26] supercomputer at Argonne Natonal Laboratory. Whle FMO has been shown before to acheve superor scalablty for fne-graned systems such as water clusters [27], we am to mprove the scalablty and effcency of coarse-graned systems, such as protens. We analyze why FMO s current DLB scheme s not optmal and propose an SLB alternatve. A key feature of our SLB method s the formulaton of a mxed-nteger nonlnear optmzaton (MINLP) problem to model the allocaton of processng cores to tasks. The MINLP approach provdes great flexblty n modelng the allocaton problem realstcally. Usng nonlnear functons, we can capture complex relatonshps between runnng tme and the number of processors. At the same tme, we can mpose nteger restrctons on certan varables (e.g., number of processors). The soluton to MINLP can then be drectly used for load balancng n the GAMESS applcaton. To solve the MINLP arsng n our procedure, we use MINOTAUR [28], a freely avalable MINLP toolkt. It offers several algorthms for solvng general MINLPs, and can be easly called from dfferent nterfaces. Our MINLP formulaton requres a few parameters to accurately model the performance, obtaned by collectng benchmarkng data about the applcaton and solvng a fttng problem. We descrbe these methods n Secton IV. Our experments demonstrate that both the fttng problem and the MINLP problem can be solved quckly on a sngle core, and the resultng allocatons lead to sgnfcant savngs n the run tme of the GAMESS applcaton. The DLB and SLB comparson s done on the receptorlgand system Aurora-A knase and nhbtor shown n Fg. 1 (A). We demonstrate the performance of our method on a large proten system (see Fg. 1 (C)) usng all 40 racks (163,840 cores) on Argonne s Blue Gene/P. II. FRAGMENT MOLECULAR ORBITAL METHOD Ab nto quantum chemstry methods are, n prncple, applcable to any molecular system though the computatonal cost ncreases steeply wth the system sze. Even the smplest restrcted Hartree-Fock (RHF) method scales approxmately cubcally wth the system sze. There are ongong efforts to reduce the scalng of quantum-mechancal (QM) methods [29, 30] and parallelze them effcently [31-38]. See, for example, the lnearly scalng method developed by Challacombe and Schwegler [39], and the adaptve multresoluton method developed by Harrson, et al. [40]. Alternatvely, fragment-based methods [41, 42] (whch dvde the system nto fragments) can dramatcally reduce computatonal cost, ncrease stablty of calculatons, and provde addtonal nformaton on propertes of fragments and ther nteractons. Algorthmcally, fragmentaton results n dvson of one large calculaton nto many small and nearly ndependent subtasks or loosely coupled ensemble calculatons. As a result, fragmentaton methods are effcent for performng quantum mechancal calculatons on supercomputers. One of the fragment-based methods s the FMO method [43], whch has been nterfaced wth many QM methods and successfully appled to chemcal systems such as protens, DNA, slcon nanowres, and onc lquds [44]. FMO has been mplemented n GAMESS [45] and parallelzed wth the generalzed dstrbuted data nterface (GDDI) [46-48]. In FMO, each fragment electronc state s computed n the potental exerted by all the others. Startng from an ntal guess, fragment calculatons that update the embeddng potental are terated untl self-consstency s acheved. Subsequently, fragment par calculatons are performed n the embeddng potental. The fragmentaton, whch s usually chemcally motvated for rapd convergence, fxes the parallel doman decomposton at the outset. The basc FMO equaton has the form F E E E E E, (1) 1 j (, j): 1... F, j1... F, j where F s the number of fragments and E, E j are the energes of fragment (monomer) and fragment par (dmer) j, respectvely. These energes are assembled accordng to Eq. (1) to gve the total energy and other propertes of the system. GDDI s a two-level parallelzaton scheme, whch can be thought of as coarse-graned parallelsm snce all CPU cores are dvded nto a few groups. At the hgher ntergroup level the load balancng s accomplshed by assgnng fragments or fragment pars to GDDI groups. At the lower ntragroup level, the load balancng s accomplshed by assgnng some ntegral workload to ndvdual CPU cores wthn a group. Varous mplementatons of GDDI exst, of whch the man ones are (1) UNIX socket-based, whereby each CPU core runs a GAMESS process and communcates over TCP/IP va sockets, and (2) MPI-based, where MPI communcators are created for groups. Ths two-level parallelzaton has been successful n obtanng up to about 90% of the perfect scalablty (.e., 90-fold speedup on a 100-fold ncrease n the number of cores) [48] on PC clusters wth 128 CPUs connected by a low-end network (FastEthernet). FMO/GDDI has subsequently been used on larger computer systems such as the AIST supercluster [49]. More recently, FMO/GDDI has been successfully run for large water clusters on 131,072 CPU cores on Argonne s Blue Gene/P [27]. Our current MPI-based mplementaton on Blue Gene/P comprses compute- and data-server process pars, so that half of all CPU cores are used for QM calculatons, whle the other half handle communcatons and dstrbuted memory processng. We reported wall-clock tmngs for GAMESS runs on the total number of CPU cores. In ths paper we apply FMO to two proten-lgand systems. All benchmarkng and tunng of both DLB and HSLB schemes have been done on Aurora-A knase wth nhbtor phthalaznone shown on Fg. 1 (A). Aurora knases j

are essental for cell prolferaton and a major target n desgnng new ant-cancer drugs. The system s of moderate sze: 155 fragments (154 amno acds and 1 lgand), wth the total number of atoms equal to 2,604, computed at the RHF level of theory wth the 6-31G* bass set. The producton run was done for ovne COX-1 complexed wth buprofen shown on Fg. 1 (B). The system conssts of 17,767 atoms dvded nto 1,093 fragments. For ths work, we used the dstrbuted memory storage of fragment denstes [27]. Varous tasks, ncludng the fragmentaton of protens, structure checkng, the generaton of GAMESS nput for FMO calculatons, and the vsualzaton of results, were performed by the FMOtools sute of Python programs [50]. hgh power of the system sze. For example, RHF scales as N 3 and coupled-cluster wth perturbatve trples (CCSD(T)) scales as N 7. The electronc state of some fragments can be frozen [51]. Stll more varaton n task szes can arse from havng dfferent levels of theory and bass set for dfferent regons of the system, as n the multlayer FMO method [52]. As the methodology of FMO becomes ncreasngly sophstcated, the tme to soluton and the scalablty of ndvdual fragment calculatons become harder to model. An example of the varaton n scalablty of fragment calculatons as a functon of sze s shown n Fg. 2. Fgure 1: (A) A schematc vew of the structure of Aurora-A knase complexed wth nhbtor phthalaznone n cyan color (PDB code: 3P9J). (B) ball-and-stck representaton of nhbtor. The system conssts of 2,604 atoms dvded nto 155 fragments. (C) a schematc vew of the structure of prostaglandn H(2) synthase-1 (COX-1) n a complex wth buprofen n cyan color (PDB code: 1EQG). (D) ball-and-stck representaton of buprofen. The system conssts of 17,767 atoms dvded nto 1,093 fragments. III. LOAD BALANCING IN FRAGMENT MOLECULAR ORBITAL METHOD In the FMO method, a system s frst subdvded nto fragments. The protens consdered n ths paper are dvded naturally nto amno acd resdue fragments (at the C atoms) usng the FMOgen tool n the FMOtools package [50]). We assumed that ther standard protonaton state les at ph 7. The key ssue for load balancng s that amno acd resdues vary n sze from the smallest wth 7 atoms (Glycne) to the largest wth 24 atoms (Tryptophan). The accuracy of FMO [44] s determned by the fragmentaton, and hence fragments should not be very small. In other words, the number and the sze of fragments are determned by the underlyng chemstry and should not be modfed merely to mprove the effcency of parallelzaton. The number of fragments and ther szes are therefore consdered fxed for the purposes of parallelzng the calculatons. The sze of a fragment greatly affects quantum chemstry calculatons because the calculaton cost tends to scale as a Fgure 2: Scalablty of FMO fragments n Aurora-A knase and nhbtor system on Blue Gene/P. The smallest fragment (Gly amno acd resdue) and the largest fragment (nhbtor) are represented by and, respectvely. The data ponts were ftted and performance models for each fragment are shown. The cores represent the computatonal processes n GDDI. The scale of the x-axs s logarthmc. A prmary factor n the cost of quantum chemcal calculatons s the number of bass-functons (whch s roughly proportonal to the number of atoms). For FMO specfcally (assumng the smple RHF level), mportant factors also nclude (1) the number of self-consstent feld (SCF) teratons needed to acheve convergence; (2) the fragment packng densty, namely, the number of fragments close to a gven fragment, whch strongly affects the computatonal tme for the embeddng potental; and (3) the fragment packng densty. The latter has a large mpact on dmer calculatons owng to the use of electrostatc approxmatons (descrbed elsewhere [44]). Furthermore, factors (1)-(3) strongly nteract. For nstance, the fragment packng densty affects the SCF convergence, whch, n turn, also depends on the charge, spn state, and the ntal guess of the electron densty. In addton, the scalng and parallel effcency of the code are a complex functon of these factors; for nstance, the relatve fracton of the number of sequental steps, such as matrx dagonalzatons, s strongly affected by the choce of SCF convergence method, as well as by the number of SCF teratons (because the embeddng potental s computed once before SCF). All these factors

make the modelng of the functonal dependence of the tmng upon the fragment sze a formdable task. Once a system s splt nto fragments FMO calculatons can be performed by usng the algorthm shown n Fg. 3. A detaled dscusson of the algorthm s gven elsewhere [48]. Here, we descrbe t brefly. At the coarse-graned DLB level, fragments are assgned to groups of CPU cores. In conjuncton wth MPI, GDDI can generate processor groups as shown n Fg. 3, lne 3 (MPI_COMM_SPLIT functon). Currently, the default opton n GAMESS creates processor subgroups of unform sze. Each group performs sngle-pont fragment calculatons, assgned dynamcally (see Fg. 3, lne 7). Throughout ths paper the theory s RHF. The output of such an RHF calculaton s the fragment densty n the Coulomb feld of all fragments (Fg. 3, lne 10). Snce the new densty changes the feld, the process must be repeated untl self-consstency s acheved. Ths process nvolves exchange of fragment denstes among the groups by puttng generated denstes n DDI global array (DDI_put, Fg. 3, lne 12). The fragment denstes are accessed va DDI_get nsde SCF() and SCF(,j) n order to compute the embeddng potental, lnes 10 and 23, respectvely. The teratve process s sometmes referred to as the selfconsstent charge (SCC) or monomer SCF step, correspondng to the frst term of the energy expanson n equaton (1) wth RHF theory. In the fnal step (Fg. 3, lnes 17-26), fragment monomer denstes are used to construct dmers from all pars of monomers consttutng a second round of larger RHF calculatons. However, the dmer step s not terated to self-consstency wth respect to the embeddng potental. // Intalze varables 1: number_of_fragments=nput(); 2: number_of_groups=number_of_fragments/3; 3: DDI_group=DDI_group_create(number_of_groups,DDI_world); // Monomer loop 4: do { 5: for (=1; <number_of_fragments; ++) { 6: DDI_scope(DDI_world); 7: mytask=dynamc_load_balancng(ddi_world); 8: f (mytask==) { 9: DDI_scope(DDI_group); 10: fragment_densty()=scf(); 11: DDI_scope(DDI_world); 12: DDI_put(fragment_densty[]); 13: } 14: } 15: DDI_sync(DDI_world); 16: } whle (fragment_densty[]!=converged); // Dmer loop 17: for (=1; <number_of_fragments; ++) { 18: for (j=1; j<; j++) { 19: DDI_scope(DDI_world); 20: mytask=dynamc_load_balancng(ddi_world); 21: f (mytask==,j) { 22: DDI_scope(DDI_group); 23: two_fragment_densty(,j)=scf(,j); 24: } 25: } 26: } Fgure 3: Pseudo-code of FMO calculatons for dynamc load balancng. For FMO, three types of load balancng have been attempted pror to ths work, and we suggest an effcent modfcaton of one of them, the statc load balancng [48]. The alternatve to SLB s DLB; n addton, there s semdynamc load balancng (SDLB) [49]. In DLB, an effcent means to mprove effcency s the large-jobs-frst strategy [48]. Ths strategy consderably reduces the synchronzaton lag at the end of calculatons because the smallest tasks are done last. DLB, n our experence, performs satsfactorly when the rato of the total number of cores to the number of fragments s not very hgh (roughly 16 for our case, but t may vary consderably). Usng ths rato and recallng that the number of fragments s fxed, DLB may be appled on a proten wth 400 resdues wth good results on up to roughly 6,400 CPU cores. Addng more cores may result n a deteroraton of the performance. The parallelzaton effcency may drop because the calculatons of small fragments cannot be effcently parallelzed on a large number of cores allocated under the equal parttonng scheme of DLB. An mprovement of ths DLB problem has been acheved wth SDLB, n whch a handful of the largest fragment calculatons are performed usng SLB, whle the rest are done wth DLB (after the CPU cores partcpatng n SLB fnsh, they also jon n the DLB calculatons). However, such a strategy s useful manly n cases where there are only a few large fragments and the total number of CPU cores s not hgh; otherwse the problems mentoned above cannot be avoded, an effcent soluton s gven by the heurstc statc load balancng (HSLB) method proposed n Secton IV. The man dea behnd HSLB s to customze GDDI group szed to the fragment szes. Snce we solve an optmzaton problem heurstcally, t can easly adapt to handle dfferent numbers of CPUs and fragments. The number of processor groups used n FMO calculatons can vary from one to the number of fragments. Fg. 4 depcts the mpact of the group count on the scalablty of FMO for a sngle SCC teraton of a system wth 155 fragments. In the case of a sngle GDDI group, each fragment calculaton s executed on all CPU cores. Clearly, all but the largest fragments utlze the large processor count neffcently, and the overall calculaton has a low scalablty. On the other hand, the 155-group calculaton, n whch there s a group for every fragment, exhbts mproved scalablty. The current default choce assumes three fragments per group, yeldng 52 groups n ths system. The dfference n scalablty and wall clock tme for dfferent group counts s explaned n Fgs 5 and 6. Whle the synchronzaton tme shown s averaged over all GDDI groups, the effcency s computed for each fragment separately and then averaged over all fragments. Thus, the effcency, W, of fragment, as a functon of the number of CPU cores, s computed as n0 / T n, T W n (2) n / n 0 where n0 s the reference value of the number of CPU cores ( n 0 =2 and T n 0 was obtaned by extrapolaton), n s the actual number of CPU cores, and T n 0 and T n are the wall clock tmes to compute the energy of fragment n FMO on n 0 and n CPU cores, respectvely.

The data n Fg. 5 and Fg. 6 can be used to explan why the optmum group count wth DLB s between 1 and 155. For example, the synchronzaton tme tends to ncrease wth the group count, startng at zero seconds n the case of a sngle group. However, computatonal effcency also tends to ncrease wth the group count as smaller groups encounter lower parallel overheads. Therefore, an optmal group-count can be obtaned only by fndng the rght balance between the tme spent n synchronzaton and that ganed by parallelsm. In addton, we must ensure that the varance n tme taken by dfferent fragments n mnmzed. These tmes n turn depend on hardware characterstcs: the number of cores, CPU type, and the network type of the system. Fgure 6: Parallel effcency averaged over fragments durng the frst FMO SCC teraton for dfferent load balancng schemes on Blue Gene/P. The dataset s for Aurora-A knase and nhbtor system. Fgure 4: Wall-clock tme to fnsh a sngle FMO SCC teraton wth dfferent load balancng schemes. The dataset s for Aurora-A knase and nhbtor system. The calculatons are done at the RHF-D level of theory and 6-31G* bass set on Blue Gene/P. The scale of the y-axs s logarthmc. Fgure 5: Average synchronzaton tme among fragments accumulated durng the frst FMO SCC teraton. For DLB wth one group, the synchronzaton tme s equal to 0 seconds but because of the log scale t s shown as 1 second. The scale of the y-axs s logarthmc. IV. HEURISTIC STATIC LOAD-BALANCING ALGORITHM Our heurstc statc load-balancng method conssts of four steps. Frst, we collect benchmarkng data related to the compute tme of fragments. Second, we solve for the optmal parameters by a least-squares method based on our chosen scalablty model. Thrd, we solve an nteger optmzaton problem n order to obtan an optmal allocaton of cores. Fourth, we allocate the optmal number of cores obtaned from the optmzaton to run FMO n statc load-balancng mode. Wth a sutable model for the compute tme, one can apply ths four-step procedure to any other coarse-graned applcaton. Before descrbng each of these steps for our applcaton, we lst n Table I the notaton used to denote varables and parameters n our models. Table I. Lst of varables and parameters used n models descrbed n Secton IV. Symbol F N Descrpton Set of postve real numbers. Total number of tasks (fragments) among whch we want to allocate avalable cores. Total number of cores avalable for allocaton. n Number of cores allocated for processng task-. T ( n ) Performance functon that models the tme taken to process task- by usng n number of cores. scal T n ) Scalable component of the functon T ). ( ( n seral T n ) Seral component of the functon T ). ( ( n

nonln T n ) Component of the functon T ( n ) other than D ( a, scal seral T ( n ) and T ( n ). Total number of data ponts avalable for creatng the performance functon model for fragment., b, c d Parameters assocated wth the performance functon, T ( n ), of task-. Wall-clock tme obtaned from solvng the allocaton problem. j j run of fragment, j 1,..., D, n the benchmarkng stage. j run of fragment j, j 1,..., D, n the benchmarkng stage. y Observed wall-clock tme n the th n Number of cores allocated n the th A. Performance Model Choosng an approprate performance model s one of the most mportant steps n desgnng a successful SLB algorthm. Over the years many performance models have been developed [53]. Many of parallel performance models begn by dentfyng sequental and parallel components of the executon tme n accordance wth Amdahl s law. They try to capture the salent features of the calculaton n terms of the key parameters of the problem. For the FMO applcaton consdered here, the key feature s the coarsegraned parallelsm, whch can be captured by selectng mathematcal models for the run tme of each fragment ndependently. In ths work, we use the nonlnear model scal nonln seral a c T ( n ) T ( n ) T ( n ) T b n d, 1,..., F, n (3) where T ) represents the wall-clock tme to compute the ( n th fragment as a functon of n the number of processor cores allocated to process t. The three components of T ( n ) are descrbed next. scal The quantty T ( n ) represents the component of the wall-clock tme wth perfect (or lnear) scalablty. It s a monotoncally decreasng functon that asymptotcally seral approaches zero. The quantty T ( n ), on the other hand, represents the tme spent n the nonparallelzed component of the applcaton. It s ndependent of the number of cores n and ncludes any purely seral part of code. From the mathematcal pont of vew t s a constant that defnes the mnmum value of T ) nonln ( T ( n. nonln ( n (gnorng T n ) to domnate ) ). As n ncreases, ( n seral T s expected The quantty T ) represents the component of the scal ( wall-clock tme that s not descrbed by ether T n ) or seral ( T n ). It represents the tme spent n code that s only partally parallelzed or depends on n n a way more complcated than the other two components. An example of a partally parallel component of our applcaton s the dagonalzaton of the Fock matrx n the self-consstent nonln feld (SCF) method. Generally, T ( n ) may nclude tme spent n actvtes such as ntalzaton, communcaton, and nonln synchronzaton. Our choce of the form of T ( n ) gves our model the ablty to account for all these components wthout constranng t to be an ncreasng or decreasng functon. The sgn of the parameters b and c determnes the shape of the functon, and consequently every fragment nonln may have a dfferent shape of T n ). ( n The functonal form of T ) seems to make sense both mathematcally and from the vewpont of Amdahl s law. From the mathematcal perspectve, one component of T ( n ) decreases, whle another ncreases wth n. The functon may ncrease or decrease for dfferent values of n dependng on the domnatng component for that number of cores. Two real examples of T ( n ) are llustrated n Fg. 2, where the probed range of the number of cores s not large enough to observe a complex behavor and T ( n ) s a smoothly decreasng functon. From the perspectve of Amdahl s law, n the absence of the complcatng nonln scal component T n ), T n ) accounts for largest ( contrbuton when contrbuton to for large n. B. Fttng Data ( ( n s small, whle seral T s the largest We estmate the parameters a, b, c, and d used n Eq. (3) by fttng the values of wall-clock tme of each fragment over the frst SCC teraton for dfferent CPU core groupngs. In other words, we perform calculatons of each fragment n the embeddng potental, varyng the number of cores per GDDI group. The tmngs are collected as a functon of the number of cores per group, and we ft the coeffcents. In the future we plan to examne the possbltes of usng several SCC teratons for the fttng. th For the fragment, we obtan the best ft by solvng the least squares problem D mn a c y,,,, j bnj d a b c d j1 n (4) j subject to a, b, c, d, where y j s the observed value of tme taken n solvng for fragment j when n j cores are allocated to t. D s the number of dfferent GDDI groups szes tred n the fttng procedure (n ths paper, D vared from 3 to 7, dependng on the system). The objectve functon of the optmzaton problem (4) s n general not convex, and there may be several locally optmal solutons of the problem. Snce nonlnear optmzaton algorthms are teratve, selectng a dfferent startng pont may lead the solver to a dfferent local soluton. We expermented wth dfferent startng solutons and observed that even though the parameter values may 2

dffer, the soluton value of problem (4) dd not vary sgnfcantly. More mportant was the observaton that the dfferences n parameter values dd not translate nto sgnfcant dfferences n the optmal allocaton of cores that we calculate n the next step. We have constraned the varables n our fttng problem Eq. (4) to be nonnegatve even though dong so s not necessary mathematcally. It makes sense for parameters a, b, and d to be postve because they represent values of tme. It s less obvous what the constrants for c should be. In nonln general, T ( n ) can be ncreasng or decreasng, but we prefer a postve c because our applcaton s hghly scalable. The total tme does not ncrease even when the number of cores used n producton runs s much larger than that n tral runs for gatherng data. Thus, a postve value of c ensures that our model has a better ft even when we extrapolate t to a large number of cores. The examples of ftted a, b, c, and d can be seen for the smallest and the largest fragments n Fg. 2. Snce the values y j are gathered from actual runs on the system, t s mportant to judcously choose tral values of n j n the data-gatherng stage. There s an obvous trade-off between the tme taken to obtan y and the qualty of the model. j Snce the soluton procedure n GAMESS s teratve and the nature of work s smlar for all teratons, we can model the functons usng tme observatons for a sngle teraton only. It helps us save tme wthout sacrfcng accuracy. To obtan good estmates of a b, c, and d, we recommend samplng nj, from a large range of core counts: from a few to thousands for each fragment. In order to avod over-fttng, the number of samples should be at least greater than four for each fragment. We used eght samples n our experments. The number of samples should obvously ncrease wth the level of nose n the applcaton and the number of parameters to be estmated. In general, one should judcously pck samples based on a pror knowledge of the tasks. Lackng such knowledge, we began by dvdng the avalable cores equally among all groups. Ths approach proved satsfactory for systems wth smlar-szed fragments (along wth a consstent theory and bass set). In the cases when, for example, a lgand s much larger than the largest amno acd, a more sophstcated allocaton for samplng may needed. We also note that the recorded tmes do not nclude FMO ntalzaton and ntergroup synchronzaton tme, but they do nclude all ntragroup computaton and communcaton ncludng synchronzaton. Our procedure of frst collectng data and then rerunnng the full applcaton from scratch can be mproved. We use our smple procedure to demonstrate the effectveness of usng an optmal allocaton of cores. Our procedure can be modfed wth lttle effort to reuse more nformaton from the data collecton stage for the solvng stage. C. Formulatng the Optmzaton Problem Once we have dentfed an approprate performance model and obtaned values of all parameters from the prevous steps, we can formulate an optmzaton problem to fnd the optmal allocaton of cores. The decson varables that we seek to optmze are the number of processors, n, to be allocated to each fragment { 1,... F}. The choce of objectve that we seek to mnmze or maxmze depends on the preference of the user. One can mnmze the total wallclock tme of the applcaton the followng mn-max functon can be used mn max T ( n ). (5) n 1,..., F Alternatvely, the objectve functon s just the sum of tmes used by each task, F mn T ( n ). (6) n 1 One can also seek to maxmze the mnmum tme used by a task. Lke the mn-max crteron, the max-mn crteron also seeks to obtan a far dstrbuton of cores by takng away allocatons from the fastest tasks. It s wrtten as max mnt ( n ). (7) n The physcal restrctons of the system can be modeled by addng constrants to the optmzaton problem; for example, the number of cores used n calculatons cannot exceed the total number of avalable cores, N, F n N. (8) 1 We can also have constrants based on user s preferences, e.g. the user may wsh to mnmze the wall clock tme wth an addtonal constrant that the total core tme must be below a threshold T: F T ( n ) T. (9) 1 Some constrants may be needed to make the model amenable for the solver. In partcular, most solvers requre the dervatves of objectve and constrants to be contnuous. The mn-max objectve functon should be therefore be replaced by an objectve of mnmzng a new varable, say η, and addtonal constrants must be ntroduced to ensure η s no less than each f n ). The full model s ( F mn subject ton N n, 1 a c bn d, 1,..., F, n n 0, nteger, 1,..., F. (10)

We consdered the three objectve functons descrbed above, together wth constrant (8) n our models. We observed n our experments that the mn-max functon (5) outperforms the other objectves, whch makes sense from the vewpont of mnmzng the overall wall-clock tme. Mnmzng total tme, at the other extreme, may lead to a soluton where one fragment s solved n exceptonally large tme (Fg. 7), thus keepng the other processors watng. FORTRAN codes and hence can be drectly called wthout requrng AMPL. For solvng our problem, we use the LP/NLP [56] solver mplemented n MINOTAUR. Snce the coeffcents a, b, c are postve, the nonlnear functons are convex, and ths algorthm fnds a global soluton of the problem. We brefly descrbe ths algorthm next. Fgure 7: Allocaton of dfferent solvers: (A) mnmzng total tme, (B) maxmzng the mnmum group tme, and (C) mnmzng the maxmum group tme. The heght of each column represents tme to compute one fragment, and the wdth of each column represents how many cores were assgned. The dataset s for the complex of Aurora-A knase and ts nhbtor, whch was collected on 1024 cores of Blue Gene/P for FMO at the RHF/6-31G* level. D. Solvng the MINLP Model MINLP problems, of whch the optmzaton problem Eq. (10) s a specal case, are NP-hard n general. Certan specfc classes of MINLP, such as the sngle constrant resource constraned problems wth nonncreasng objectve functons can be solved n polynomal tme [54]. But they requre customzed algorthms. Hence we consder algorthms for general MINLPs only. The algorthms to solve general MINLPs are usually based on the branch-andbound method [55]. These methods are guaranteed to provde an optmal soluton or show that none exsts. In addton to the number of varables and constrants, the tme requred to solve these problems depends on the type of functons used n the objectve and constrants. For nstance, f all the nonlnear functons are convex, then a local soluton of the contnuous relaxaton s also ts global soluton. Several specalzed algorthms explot ths fact and other useful propertes of convex functons [55-60]. On the other hand, f any functon s not convex, then the contnuous relaxaton does not gve a bound on the objectve value. In ths case, one needs to further relax the contnuous problem by ntroducng new varables and modfyng the constrants [61, 62]. We wrote our optmzaton problem n the AMPL [63] modelng language. AMPL enables users to wrte optmzaton model usng smple mathematcal notaton. It also provdes dervatves of nonlnear functons automatcally, and t can be used wth several dfferent solvers. To solve the problem, we used the open-source solver toolkt MINOTAUR [28]. MINOTAUR offers dfferent solvers based on the algorthms mentoned above and also offers advanced routnes to reformulate MINLPs. It provdes lbrares that can be called from other C++ and The LP/NLP algorthm s ntalzed by frst creatng a lnear relaxaton of the MINLP. Suppose we have a nonlnear constrant of the form f ( x) 0, where f s a contnuously dfferentable convex functon. A lnear relaxaton of the constrant s obtaned by the lnearzaton around any k pont x, k T k k x x f ( x ) 0. f ( x ) (11) In general, the more the number of lnearzaton constrants obtaned from dstnct ponts, the closer s the relaxaton to the orgnal problem. However, a large number of constrants can slow the solver. In order to mtgate ths problem, lnearzaton-constrants derved from only a sngle pont are added ntally. Ths pont s the obtaned by solvng the contnuous nonlnear programmng (NLP) problem. We later add lnearzaton constrants for only those nonlnear constrants that are volated sgnfcantly by the soluton. After the ntal lnear programmng (LP) relaxaton s created, t s added to a lst of unsolved sub-problems. The value of the ncumbent soluton of MINLP s ntalzed to nfnty. In each step of the algorthm, we remove a subproblem from the lst and solve the lnear relaxaton usng an LP solver. If the soluton value s greater than the ncumbent, we dscard ths sub-problem because t does not contan any soluton better than the ncumbent. If the soluton (x ˆ) of the LP problem has fractonal values, we create two new subproblems by branchng. We choose an nteger varable for whch xˆ s fractonal. In one sub-problem we add the constrant x xˆ. In the other, we add the x ˆ. These two sub-problems are added to constrant x

the lst of unsolved sub-problems. If xˆ satsfes nteger constrants, we check whether t satsfes all the nonlnear constrants as well. If t s feasble, then we have an ncumbent soluton. Otherwse, we add more lnearzaton constrants around xˆ of the form shown n Eq. (10), and contnue. The algorthm termnates when the lst s empty. In MINOTAUR, the LP problems are solved by usng the CLP solver [64], and the NLP problems use fltersqp [65]. In the worst case, the algorthm may requre solvng an exponental (n the number of nteger varables) number of LP and NLP problems, but n practce t takes much fewer. For example, the MINLP for 4096 cores took < 180 seconds on one core to solve, and made 12863 calls to the LP solver and 2 calls to the NLP solver. For 16384 cores, these numbers were 165 seconds, 9883 and 2, respectvely. E. Summary of HSLB Algorthm Before presentng the results of our experments, we summarze the four-step HSLB algorthm and dscuss some ways of further mprovng t. (1) Gather Data: Perform a sngle SCC teraton for the gven molecular system (proten-lgand complex) wth FMO by executng GAMESS D tmes usng a dfferent total numbers of cores, wth sutable choces for D. Collect the runnng tmes y j for each fragment. (2) Ft: Next, solve F dfferent least squares problems (4) to determne the coeffcents a, b, c, and d n Eq. (3) for each fragment. (3) Solve: Determne the best allocaton by solvng the MINLP (10), and obtan the optmal values of sze n for each fragment. (4) Execute: Execute the complete FMO run wth GAMESS, usng the determned group szes n step (3). Ths algorthm, beng of a general nature, can be mproved n several ways for a gven applcaton. The data gatherng step (1) can be avoded altogether f relable benchmarks are already avalable, for example, from prevous experments. Steps (2) and (3) can be solved by callng a MINLP solver drectly from the applcaton, thus avodng the use of AMPL. The least-squares problem can be solved wth a MINLP solver by just callng ts nonlnear solver once. After t s solved, the MINLP solver can then solve the MINLP of step (3). More mprovements are possble f the HSLB procedure s called more than once to reallocate the cores after a few teratons of the complete run. The runnng tme of all teratons can be stored, a better ft be obtaned, and the MINLP re-solved to obtaned better allocaton based on the new data. In ths work, we appled HSLB only to the monomer step n FMO, whch s teratve and requres runnng each monomer calculaton typcally 20-25 tmes. We used DLB for the dmer step, whch nvolves computng each dmer once; and thus the benefts of an optmzed allocaton n HSLB do not mert ts applcaton gven the need to do prelmnary data gatherng. However, n the future t s concevable to construct a good guess for an optmum node allocaton n dmers based on the monomer data, whch would accelerate the dmer step as well. The load balancng n dmers s also less severe than n monomers, because the number of dmers, for whch quantum-mechancal calculatons are performed, s typcally 3-4 tmes the number of fragments F (dmers that are spatally well separated are computed wth a very fast electrostatc approxmaton) [44]. V. RESULTS AND DISCUSSION The performance of HSLB s compared to that of DLB wth dfferent numbers of groups for the system of Aurora knase and nhbtor phthalaznone (see Fg. 1 (A) and (B)). Ths system has 155 fragments. Fg. 4 shows that the HSLB scheme outperforms the DLB schemes by at least a factor of two n the wall-clock tme. We also found that some DLB schemes have scalablty smlar to HSLB. We also make other observatons about the performance of HSLB. Fg. 5 shows that HSLB has the lowest synchronzaton tme even on thousands of processors. Snce the synchronzaton tme becomes mportant when a large number of CPU cores are used, HSLB should be preferred for such systems. The HSLB algorthm also shows excellent effcency, greater than 90% on large numbers of cores, as shown n Fg. 6. As the number of cores ncreases, we antcpated that the scalablty and effcency of HSLB mght deterorate. To quantfy ths deteroraton, we tested the performance of HSLB for larger processor counts usng a larger problem: COX-1 complexed wth buprofen (see Fg. 1 (C) and (D)); a total of 1093 fragments and 17,767 atoms. Fg. 8 shows that the COX-1 calculaton acheves 80% effcency averaged over all fragments for the SCC teratons n FMO on 163,840 cores at the RHF, 6-31G* level of theory. The sngle-pont Fgure 8: Ideal and observed scalablty curves based on wall-clock tme for the frst FMO SCC teraton on Blue Gene/P for COX-1 complexed wth buprofen. All calculatons are done n a dual mode that restrcts processes to 2 MPI tasks per node. Effcency averaged over all the fragments s shown for each run. energy calculaton takes only ~54 mnutes (6+ years on a sngle core). The results obtaned at ths computatonal level strongly suggest that sgnfcantly hgher processor counts can be effcently utlzed for larger problems.

Whle HSLB outperforms DLB, t stll exhbts a small declne n scalablty and effcency for hgh processor counts for both the Aurora knase and COX-1 calculatons. Ths declne may be due to sequental steps n the fragment SCF and the fluctuatons n the synchronzaton tme caused by runtme operatng system tasks, shared network ssues, hardware falure, defcences of performance model or benchmarkng data, and so forth. It s commonly understood that for these reasons, synchronzaton becomes more problematc as the number of processors ncreases. Although these fluctuatons do not appear n Fg. 5 (only averaged values are shown), use of a low level of theory (RHF, 6-31G*) here has helped uncover the lmtatons of the HSLB approach by rasng the sgnfcance of the synchronzaton tme (for densty functonal theory (DFT) one can expect a better parallel effcency because of a hgh scalng of the DFT specfc grd ntegraton). From the data, the operatng lmts of HSLB on Blue Gene/P would appear to be anywhere from three cores per task up to the pont where random computatonal nose (>100 thousands cores) hampers the ablty to predct the tme to soluton for tasks. We have dentfed drectons for further mprovng our load-balancng approach. We observed that the teraton tme n our applcaton s not a constant but tends to decrease because successve SCC teratons typcally requre fewer mcro-teratons to converge the densty. Moreover, ths behavor s not unform over dfferent fragments because they converge at dfferent rates. We propose to apply HSLB adaptvely. We can ft scalablty curves, obtan the nonlnear equatons and solve for the optmal allocaton for all SCC teratons, as descrbed n Secton IV. To ths end, we have nterfaced MINOTAUR drectly wth GAMESS on Blue Gene/P. It enables us to drectly optmze wthout makng system calls to execute the AMPL model. We have not ncluded results for adaptve HSLB here because our goal s to present the fundamental HSLB concept. That sad, adaptve HSLB offers a promsng drecton for future development because t combnes the effcency of HSLB wth the adaptablty of DLB. VI. CONCLUSIONS The method development n ths paper s an evoluton of the parallelzaton of a complex quantum-mechancal program GAMESS [24, 25] over dozens of CPU cores n DDI ntroduced n 2000 [46], extended to hundreds wth GDDI n 2004 [48] as demonstrated on a powerful supercomputer of that tme (n 2005 [49]). The manual varaton of the group sze n GDDI to optmze ts performance used n a Supercomputng-2005 paper [49] nspred the present work, whch we have conducted based on advanced mathematcal methods guaranteeng the best allocaton for a gven number of cores and a molecular system. Although for fne-graned systems (water clusters) the prevously developed load balancng has performed well up to about 130,000 CPU cores [27], coarse-graned systems (protens) cannot be treated wth hgh effcency on modern petascale computers n the same way. We have shown that the present HSLB approach s twce as fast as the prevous DLB method and acheves a parallel effcency of about 80% on petascale core-counts (hundreds of thousands of cores). Thus, from the user-perspectve, HSLB s enablng FMO to handle automatcally very large problems wth dverse fragment szes. Many nterestng cases fall nto the latter category. For example, n the study of photosynthess, the reacton center [49, 66] features the chlorophyll specal par, whch s large and dffcult to fragment for the chemcal reasons (sgnfcant electron delocalzaton across the planar system). Another common stuaton s found n drug desgn, where the drug molecules often have 50-100 atoms wth extended conjugaton. Such large fragments typcally coexst wth many small ones, such as explct water molecules havng only three atoms per fragment. Where DLB based on unform group szes would be unable to utlze many cores effectvely for such systems, by fttng the GDDI group szes to the fragments HSLB can effcently utlze CPU core counts n the 100,000-range wth neglgble overhead. In ths sense, HSLB s smlar n sprt to the use of prelmnary benchmarks n prevous work to guess the optmum group szes [49]. Our current era of petascale computng already has an eye on the comng exascale era, and the development of software capable of effcently utlzng many thousands or mllons of CPU cores s a topc of great nterest. FMO accelerated by HSLB on petascale and exascale computers can become a powerful tool for drug and materal desgn [44], realzng the hgh potental held by quantummechancal methods on massvely parallel computers. The present coarse-graned optmzaton algorthm s not lmted to FMO. Many coarse-graned applcatons can beneft from the present approach. For nstance, many other fragment-based methods can be smlarly parallelzed. As the number of cores ncreases, the ssues of mnmzng the synchronzaton tme whle retanng a hgh effcency wll put load balancng schemes to a hghly stressful test. We beleve that for coarse-graned applcatons our HSLB algorthm s a promsng and general approach. ACKNOWLEDGMENT We thank Dr. R. Loy and ALCF team members for dscussons and help related to the paper. We thank Dr. M. Mazanetz from Evotec for provdng the PDB structure of Aurora-A knase system used n our calculatons. DGF thanks the Next Generaton Super Computng Project, Nanoscence Program (MEXT, Japan) and Computatonal Materals Scence Intatve (CMSI, Japan) for fnancal support and Prof. K. Ktaura for frutful dscussons. The submtted manuscrpt has been created by the UChcago Argonne, LLC, Operator of Argonne Natonal Laboratory ( Argonne ) under Contract No. DE-AC02-06CH11357 wth the U.S. Department of Energy. The U.S. Government retans for tself, and others actng on ts behalf, a pad-up, nonexclusve, rrevocable worldwde lcense n sad artcle to reproduce, prepare dervatve works, dstrbute copes to the publc, and perform publcly and dsplay publcly, by or on behalf of the Government. Ths work was also supported by the U.S. Department of Energy through grant DE-FG02-05ER25694.

REFERENCES [1]C. Xu and F. C. M. Lau, Load balancng n parallel computers: theory and practce. Norwell, MA. Kluwer Academc Publshers, 1997. [2]K. D. Devne, E. G. Boman, R. T. Heaphy, B. A. Hendrckson, J. D. Teresco, J. Fak, J. E. Flaherty, and L. G. Gervaso, "New challenges n dynamc load balancng," Appled Numercal Mathematcs, vol. 52, pp. 133-152, 2005. [3]M. H. Wllebeek-LeMar and A. P. Reeves, "Strateges for dynamc load balancng on hghly parallel computers," Parallel and Dstrbuted Systems, IEEE Transactons on, vol. 4, pp. 979-993, 1993. [4]Y. Bejerano, S. J. Han, and L. E. L, "Farness and load balancng n wreless LANs usng assocaton control," n Proceedngs of the 10th annual nternatonal conference on moble computng and networkng, New York, NY, 2004, pp. 315-329. [5]B. Y. Zhang, Z. Y. Mo, G. W. Yang, and W. M. Zheng, "An effcent dynamc load-balancng algorthm n a large-scale cluster," Dstrbuted and Parallel Computng, pp. 174-183, 2005. [6]M. J. Zak, W. L, and S. Parthasarathy, "Customzed dynamc load balancng for a network of workstatons," n Proceedngs of 5th IEEE Internatonal Symposum on Hgh Performance Dstrbuted Computng, Syracuse, NY, 1996, pp. 282-291. [7]R. D. Wllams, "Performance of dynamc load balancng algorthms for unstructured mesh calculatons," Concurrency: Practce and experence, vol. 3, pp. 457-481, 1991. [8]M. J. Berger and S. H. Bokhar, "A parttonng strategy for nonunform problems on multprocessors," Computers, IEEE Transactons on, vol. 100, pp. 570-580, 1987. [9]H. D. Smon, "Parttonng of unstructured problems for parallel processng," Computng Systems n Engneerng, vol. 2, pp. 135-148, 1991. [10]V. E. Taylor and B. Nour-Omd, "A study of the factorzaton fll-n for a parallel mplementaton of the fnte element method," Internatonal journal for numercal methods n engneerng, vol. 37, pp. 3809-3823, 1994. [11]M. S. Warren and J. K. Salmon, "A parallel hashed octtree n-body algorthm," n Proceedngs of the ACM/IEEE Supercomputng 1993 Conference, Portland, 1993, pp. 12-21. [12]J. R. Plkngton and S. B. Baden, "Parttonng wth spacefllng curves, CSE Techncal Report CS94-349," Dept. of Computer Scence Engneerng, Unversty of Calforna, San Dego, CA1994. [13]A. Patra and J. T. Oden, "Problem decomposton for adaptve hp fnte element methods," Computng Systems n Engneerng, vol. 6, pp. 97-109, 1995. [14]J. E. Flaherty, R. M. Loy, M. S. Shephard, B. K. Szymansk, J. D. Teresco, and L. H. Zantz, "Adaptve local refnement wth octree load balancng for the parallel soluton of three-dmensonal conservaton laws," Journal of Parallel and Dstrbuted Computng, vol. 47, pp. 139-152, 1997. [15]A. Pothen, H. D. Smon, and K. P. Lou, "Parttonng sparse matrces wth egenvectors of graphs," SIAM Journal on Matrx Analyss and Applcatons vol. 11, pp. 430-452, 1990. [16]E. Less and H. Reddy, "Dstrbuted load balancng: desgn and performance analyss," WM Keck Research Computaton Laboratory, vol. 5, pp. 205-270, 1989. [17]G. Karyps and V. Kumar, "A fast and hgh qualty multlevel scheme for parttonng rregular graphs," SIAM Journal on Scentfc Computng, vol. 20, p. 359, 1999. [18]Y. F. Hu and R. J. Blake, "An optmal dynamc load balancng algorthm, Techncal Report DL-P-95-011," Daresbury Laboratory, Warrngton, WA4 4AD, UK1995. [19]B. Hendrckson and R. Leland, "A multlevel algorthm for parttonng graphs," n Proceedngs of the ACM Supercomputng 1995 Conference, New York, 1995, pp. 28-42. [20]G. Cybenko, "Dynamc load balancng for dstrbuted memory multprocessors," Journal of Parallel and Dstrbuted Computng, vol. 7, pp. 279-301, 1989. [21]T. Bu and C. Jones, "A heurstc for reducng fll n sparse matrx factorzaton," n SIAM Conference on Parallel Processng for Scentfc Computng, Phladelpha, PA, 1993, pp. 445-452. [22]S. H. Bokhar, "On the mappng problem," IEEE Transactons on Computers, vol. 100, pp. 207-214, 1981. [23]S. H. Bokhar, Assgnment problems n parallel and dstrbuted computng vol. 32. New York, NY. Sprnger- Verlag, 1987. [24]M. W. Schmdt, K. K. Baldrdge, J. A. Boatz, S. T. Elbert, M. S. Gordon, J. H. Jensen, S. Kosek, N. Matsunaga, K. A. Nguyen, S. S., T. L. Wndus, M. Dupus, and J. A. J. Montgomery, "General atomc and molecular electronc structure system," Journal of Computatonal Chemstry, vol. 14, pp. 1347-1363, 1993. [25]M. S. Gordon and M. W. Schmdt, "Advances n electronc structure theory: GAMESS a decade later," n Theory and Applcatons of Computatonal Chemstry: The Frst Forty Years, C. Dykstra, G. Frenkng, K. Km, and G. Scusera, Eds., ed. Elsever Scence, 2005, pp. 1167-1189. [26]Argonne Natonal Laboratory: Argonne Leadershp Computng Faclty. Avalable: http://www.alcf.anl.gov/, [27]G. D. Fletcher, D. G. Fedorov, S. R. Prutt, T. L. Wndus, and M. S. Gordon, "Large-scale MP2 calculatons on the Blue Gene archtecture usng the Fragment Molecular Orbtal method," Journal of Chemcal Theory and Computaton, vol. 8, pp. 75-79, 2012. [28]A. Mahajan, S. Leyffer, J. Lnderoth, J. Luedtke, and T. Munson. MINOTAUR wk. Avalable: http://www.mcs.anl.gov/mnotaur, (January 16, 2012)

[29]R. Zalesny, M. G. Papadopoulos, P. G. Mezey, and J. Leszczynsk, Lnear-Scalng Technques n Computatonal Chemstry and Physcs. New York, NY. Sprnger, 2011. [30]J. R. Remers, Computatonal Methods for Large Systems: Electronc Structure Approaches for Botechnology and Nanotechnology. Sngapore. Wley, 2011. [31]E. Apra, R. J. Harrson, W. Shelton, V. Tpparaju, and A. Vázquez-Mayagota, "Computatonal chemstry at the petascale: Are we there yet?," n Journal of Physcs: Conference Seres, 2009, p. 012027. [32]Y. Hasegawa, J. I. Iwata, M. Tsuj, D. Takahash, A. Oshyama, K. Mnam, T. Boku, F. Shoj, A. Uno, and M. Kurokawa, "Frst-prncples calculatons of electron states of a slcon nanowre wth 100,000 atoms on the K computer," n Proceedngs of the ACM/IEEE Supercomputng 2005 Conference, Seattle, 2011, pp. 1-11. [33]E. Apra, A. P. Rendell, R. J. Harrson, V. Tpparaju, W. A. dejong, and S. S. Xantheas, "Lqud water: obtanng the rght answer for the rght reasons," n Proceedngs of the ACM/IEEE Supercomputng 2009 Conference, Portland, 2009, p. 66. [34]K. Kowalsk, S. Krshnamoorthy, R. M. Olson, V. Tpparaju, and E. Aprà, "Scalable mplementatons of accurate excted-state coupled cluster theores: Applcaton of hgh-level methods to porphyrn-based systems," n Proceedngs of the ACM/IEEE Supercomputng 2011 Conference, Seattle, 2011, pp. 1-10. [35]Y. Alexeev, R. A. Kendall, and M. S. Gordon, "The dstrbuted data SCF," Computer Physcs Communcatons, vol. 143, pp. 69-82, 2002. [36]Y. Alexeev, M. W. Schmdt, T. L. Wndus, and M. S. Gordon, "A parallel dstrbuted data CPHF algorthm for analytc Hessans," Journal of Computatonal Chemstry, vol. 28, pp. 1685-1694, 2007. [37]M. Krshnan, Y. Alexeev, T. L. Wndus, and J. Neplocha, "Multlevel parallelsm n computatonal chemstry usng Common Component Archtecture and Global Arrays," n Proceedngs of the ACM/IEEE Supercomputng 2005 Conference, Seattle, 2005, pp. 23-23. [38]G. Fletcher, "A parallel mult-confguraton selfconsstent feld algorthm," Molecular Physcs, vol. 105, pp. 2971-2976, 2007. [39]M. Challacombe and E. Schwegler, "Lnear scalng computaton of the Fock matrx," Journal of Chemcal Physcs, vol. 106, pp. 5526-5536, 1997. [40]R. J. Harrson, G. I. Fann, T. Yana, Z. Gan, and G. Beylkn, "Multresoluton quantum chemstry: Basc theory and ntal applcatons," Journal of Chemcal Physcs, vol. 121, p. 11587, 2004. [41]M. S. Gordon, S. R. Prutt, D. G. Fedorov, and L. V. Slpchenko, "Fragmentaton methods: a route to accurate calculatons on large systems," Chemcal Revews, vol. 112, pp. 632-672, 2012. [42]S. Hrata, M. Valev, M. Dupus, S. S. Xantheas, S. Sugk, and H. Sekno, "Fast electron correlaton methods for molecular clusters n the ground and excted states," Molecular Physcs, vol. 103, pp. 2255-2265, 2005. [43]K. Ktaura, E. Ikeo, T. Asada, T. Nakano, and M. Uebayas, "Fragment molecular orbtal method: an approxmate computatonal method for large molecules," Chemcal Physcs Letters, vol. 313, pp. 701-706, 1999. [44]D. G. Fedorov, T. Nagata, and K. Ktaura, "Explorng chemstry wth the Fragment Molecular Orbtal method," Physcal Chemstry Chemcal Physcs, vol. 14, pp. 7562-7577, 2012. [45]D. G. Fedorov and K. Ktaura, "The mportance of threebody terms n the fragment molecular orbtal method," Journal of Chemcal Physcs, vol. 120, pp. 6832-6840, 2004. [46]G. D. Fletcher, M. W. Schmdt, B. M. Bode, and M. S. Gordon, "The dstrbuted data nterface n GAMESS," Computer Physcs Communcatons, vol. 128, pp. 190-200, 2000. [47]J. L. Bentz, R. M. Olson, M. S. Gordon, M. W. Schmdt, and R. A. Kendall, "Coupled cluster algorthms for networks of shared memory parallel processors," Computer Physcs Communcatons, vol. 176, pp. 589-600, 2007. [48]D. G. Fedorov, R. M. Olson, K. Ktaura, M. S. Gordon, and S. Kosek, "A new herarchcal parallelzaton scheme: Generalzed dstrbuted data nterface (GDDI), and an applcaton to the fragment molecular orbtal method (FMO)," Journal of Computatonal Chemstry, vol. 25, pp. 872-880, 2004. [49]T. Ikegam, T. Ishda, D. G. Fedorov, K. Ktaura, Y. Inadom, H. Umeda, M. Yokokawa, and S. Sekguch, "Full electron calculaton beyond 20,000 atoms: Ground electronc state of photosynthetc protens," n Proceedngs of the ACM/IEEE Supercomputng 2005 Conference, Seattle, pp. 10-10. [50]Y. Alexeev. FMO portal: Web nterface for FMOtools. Avalable: http://www.fmo-portal.nfo, (January 16, 2012) [51]D. G. Fedorov, Y. Alexeev, and K. Ktaura, "Geometry optmzaton of the actve ste of a large system wth the fragment molecular orbtal method," Journal of Physcal Chemstry Letters, vol. 2, pp. 282-288, 2011. [52]D. G. Fedorov, T. Ishda, and K. Ktaura, "Multlayer formulaton of the fragment molecular orbtal method (FMO)," The Journal of Physcal Chemstry A, vol. 109, pp. 2638-2646, 2005. [53]C. L. Janssen and I. M. B. Nelsen, Parallel computng n quantum chemstry. CRC Press, 2008. [54]T. Ibarak and N. Katoh, Resource allocaton problems: algorthmc approaches. Cambrdge, MA. The MIT Press, 1988.

[55]R. J. Dakn, "A tree-search algorthm for mxed nteger programmng problems," The Computer Journal, vol. 8, pp. 250-255, 1965. [56]I. Quesada and I. E. Grossmann, "An LP/NLP based branch and bound algorthm for convex MINLP optmzaton problems," Computers & Chemcal Engneerng, vol. 16, pp. 937-947, 1992. [57]M. A. Duran and I. E. Grossmann, "An outerapproxmaton algorthm for a class of mxed-nteger nonlnear programs," Mathematcal Programmng, vol. 36, pp. 307-339, 1986. [58]R. Fletcher and S. Leyffer, "Solvng mxed nteger nonlnear programs by outer approxmaton," Mathematcal Programmng, vol. 66, pp. 327-349, 1994. [59]T. Westerlund and F. Pettersson, "An extended cuttng plane method for solvng convex MINLP problems," Computers & Chemcal Engneerng, vol. 19, pp. 131-136, 1995. [60]A. Mahajan, S. Leyffer, and C. Krches, "Solvng mxednteger nonlnear programs by QP-dvng," Argonne Natonal Laboratory ANL/MCS-P2071-0312, 2012 [61]R. Horst and T. Hoang, Global Optmzaton: Determnstc Approaches. Berln. Sprnger-Verlag, 1996. [62]M. Tawarmalan and N. V. Sahnds, Convexfcaton and Global Optmzaton n Contnuous and Mxed- Integer Nonlnear Programmng: Theory, Algorthms, Software, and Applcatons vol. 65. Dordrecht. Kluwer Academc Publshers, 2002. [63]R. Fourer, D. M. Gay, and B. Kernghan, AMPL: A Modelng Language for Mathematcal Programmng, 2nd Edton Independence, KY. Cengage Learnng, 2002. [64]J. J. Forrest. Clp project. Avalable: http://projects.conor.org/clp, (January 16, 2012) [65]R. Fletcher and S. Leyffer, "Nonlnear programmng wthout a penalty functon," Mathematcal Programmng, vol. 91, pp. 239-269, 2002. [66]T. Ikegam, T. Ishda, D. G. Fedorov, K. Ktaura, Y. Inadom, H. Umeda, M. Yokokawa, and S. Sekguch, "Fragment molecular orbtal study of the electronc exctatons n the photosynthetc reacton center of Blastochlors vrds," Journal of Computatonal Chemstry, vol. 31, pp. 447-454, 2010.