Gold (Genetic Optimization for Ligand Docking) G. Jones et al. 1996 LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 1
Genetic algorithms Inspired from evolution General principle: LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 2
Gold GA Gold uses a genetic algorithm for optimization Steady state principle (single operations no generations) No duplicates Roulette wheel selection Operators and parents Gray coding of binary features Approximate coding of conformation LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 3
The Gold chromosomes Each chromosome consists of two binary plus two integer strings The binary strings code the torsions of the ligand and the protein In the protein the single bonds to terminal H-bond donors are rotatable The integer strings code for the translation and orientation of the ligand, in terms of the H-bonds that are formed. If the Nth integer in the FIRST integer string has the value P then the Nth H-donor in the ligand forms a H-bond with the Pth acceptor of the protein If the Nth integer in the SECOND integer string has the value P then the Nth H-acceptor in the ligand forms a H-bond with the Pth donor of the protein The actual position of the ligand is obtained with a least squares fit LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 4
The H-Bonds LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 5
Gold 1. A set of reproduction operators (crossover, mutation, etc.) is chosen. Each operator is assigned a weight. 2. An initial population is randomly created and the fitness of its members determined 3. An operator is chosen using roulette wheel selection, based on operator weights 10 for crossover, 40 for mutation 4. The parents are chosen with rws based on fitness 5. Offspring are obtained and their fitness evaluated 6. If not already present in the population the children replace the least fit members of the population 7. After 100000 operations stop else goto 3 LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 6
The energy function H-bonds VdW between protein and ligand (12-6 potential) Intra-ligand VdW The energy function of Gold is one of its strengths LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 7
Efficiency depends strongly on the parameters (initial population, number of runs) The developers report very good results already with runs that take ~1 min per complex LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 8
Some results LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 9
A related approach Autodock used initially a SA/MC approach The main advantage of SA is the combination of global optimization (high temperature) with local optimization (lower temperature) For flexible molecules >8 flexible dihedrals it turns out that SA is far too slow LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 10
LGA LGA or GALS Lamarkian GA or GA with local search has been implemented The idea is to adapt each individual to its environment by performing a LS (minimization) Optimization takes place directly on the chromosomes The effect of the minimization is passed on to the offspring Force field type of energy function GM Morris et al 1998, Comparison of SA, GA, LGA LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 11
LGA LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 12
SA/GA/LGA comparison SA GA LGA LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 13
LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 14
Conclusion GA GAs are very robust Default parameters used all along and efficient (depending on the settings) They clearly outperform SA for docking problems Not in our hands A significant part of the trick, seems to be the combination with at least a crude type of local optimization Hydrogen bonds are crucial for docking How do GAs compare with systematic approaches? LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 15
GlamDock Old GlamDock Gold-like interaction point matching search space Steady-State Genetic Algorithm search A ChemScore-like empirical function New GlamDock Replaced the GA with a simpler MC/SA search + conformational stack Simpler configuration More efficient search Smooth, continuously differentiable ChemScore based scoring a gradient based minimization in torsion space More effective identification of local minima LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 16
GlamDock (MCM) LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 17
Comparison between 8 different docking tools Bissantz et al. J. Med. Chem. 2000, 43, 4759-4767 Kellenberger et al. PROTEINS: Structure, Function, and Bioinformatics 57:225 242 (2004) LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 18
8 Docking tools against each other Dock (negative image of binding site) FlexX (incremental construction) Fred (naive) Glide (systematic, funnel) Gold (GA) Slide (Flex protein (side chains), Surflex (Det. GA), QXP (Monte Carlo) (Why not ICM?) LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 19
Sampling accuracy LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 20
Ranking accuracy LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 21
CPU time LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 22
GlamDock LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 23
Conclusion of comparison study Gold, Glide, Surflex, Flexx: Best structure prediction (50-55%) Gold, Glide, Surflex, Flexx: Best screening properties (50-55%) Previous results Poor prediction of absolute free energies Reasonable results for virtual screening Docking and esp. virtual screening depend mainly on scoring function Consensus scoring improves results significantly LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 28
Conclusion of flexible ligand docking Flexible redocking is doable Best methods GAs, and incremental construction (and MCM Main problem is the evaluation of the structures (Score) Possibly scoring functions have been fitted too strongly to redocking of known ligands LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 29
Flexible receptor LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 30
Flexible receptor Side chain flexibility Backbone flexibility Hinge bending Domain flexibility Even small differences can be important! Induced fit Protein mutants Homology modelling LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 31
Substate view of protein dynamics LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 32
Induced fit Folding free energy lies between 10-15 kcal for many proteins Less favorable substates may be stabilized by certain ligands Most of the time the differences are not very large, yet significant LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 33
Side chain flexibility of proteins upon ligand binding Najmanovich et al. Proteins 39:261-268 2000 LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 34
Number of flexible side chains per binding site LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 35
Amino acid type dependence LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 36
AA dependence related to N tor LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 37
Backbone / Side chain flexibility LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 38
Conclusions Relatively few side chains move on average ( 3 for 85% of cases) Polar side chains move most Side chain flexibility does not correlate with backbone flexibility LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 39
Flexible receptor docking LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 40
Methods Simulation MC/MD, SA Fuzzy Discrete Ensembles of structures Rotamer libraries LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 41
FlexE H. Claussen J. Mol. Biol. 2001 308, 377-395 LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 42
Protein flexibility Main idea: describe the protein structure variations with a set of protein structures representing the flexibility, mutation or alternative models of a protein. The variability considered by flexe is defined by the differences within the given input structures. LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 43
United protein description Data structure that administers the protein structures variations. Contains an ensemble of up to 30 possible conformations of the protein. Most of them are low energy conformations of the same protein. LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 44
United protein description - construction Superposition Clustering Add picture - 8 LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 45
United protein description - clustering The superimposed structures are combined by clustering each part separately Complete linkage hierarchical cluster The clustered instances can be recombined to form new valid protein structures. LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 46
Notation Component : all the atoms which belong to the same amino acid or mutation of the amino acid. Contains a backbone part and a side chain part Part : set of instances Instance : one of the alternative conformations. LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 47
Incompatibility Two instances of the united protein description are incompatible if they cannot be realized simultaneously. Logical: two instances are alternative to each other Geometric: two logically compatible instances overlap Structural: two instances of the same chain are unconnected LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 48
Incompatibility graph LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 49
Incompatibility graph The incompatibility is internally represented as a graph by using the instances as nodes and connecting pairs of incompatible nodes by an edge. Valid protein structures correspond to independent sets in the graph. LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 50
Selection of instances The ligand is placed fragment by fragment into the active site by the incremental construction algorithm. After each construction step, all possible interactions are determined. Apply the scoring function for each instance. We choose the IS with the highest score. LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 51
Independent set The IS can be assembled from IS of the connected components. Apply a modified version of the Bron-Kerbosch algorithm on the complementary graph. Compatibility graph Independent components! cliques LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 52
Cannot be extended Enumerating all cliques (Bron Kerbosch, 1973) Clique: Maximal complete subgraph Two versions of the algorithm Both are backtracking algorithms The two algorithms are quite similar The first goes through the cliques in an ordered fashion The second optimizes the order of the search and visits larger cliques at the beginning Version I is mainly relevant for illustration purposes LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 53
Version I Three sets are important for the algorithms: Compsub: Current set Is extended or reduced by one point by travelling along the edges of the backtracking tree Candidates The set of all points that will in due time serve as extension to compsub Not The set of all points that have already served as an extension of the present configuration of compsub LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 54
Version I Recursive extension operator: Extend (COMPSUB, CANDIDATES, NOT, G) If CANDIDATES== //cannot grow if NOT== print COMPSUB //maximality return //backtrack end if For c 2 CANDIDATES Put c in COMPSUB Update CANDIDATES and NOT // Remove all points not connected to the selected candidate Extend (COMPSUB, CANDIDATES, NOT, G) Remove c from COMPSUB and put into NOT End //for return Also for NOT LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 55
Some remarks The lists NOT and CANDIDATES can be concatenated into a single local array NOT CANDIDATES 1.ne ce For the indices ne, ce we have: ne ce ne = ce: CANDIDATES= ne=0: NOT= Ce=0: NOT=CANDIDATES= clique found If ne+1 is the current candidate then all we need to do at the end of extend is ne=ne+1 Both CANDIDATES and NOT must be empty when a clique is found If 9 c 2 NOT s.t. 8 d2 CANDIDATES: (c,d)2 E c will never be removed from NOT! no cliques on this subtree LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 56
Version II Is simply a clever way of choosing the next candidate: Pick vertex c in NOT with the most edges to CANDIDATES Use as next candidate a vertex that is not connected to c With every iteration we are at least one step closer to cutting the subtree LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 57
Evaluation FlexE was evaluated with ten protein structures ensembles containing 105 crystal structure from the PDB. The structures within the ensemble highly similar backbone trace Different conformations for several side chains. LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 58
LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 59
Evaluation Cont. FlexE finds a ligand position with RMSD below 2 A in 67% of the cases. Average CPU time for the incremental construction algorithm is 5.5 minutes. LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 60
LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 61
Conclusion The ensemble approach is able to cope with several sidechains conformations and even movements of loops. Very efficient. Motions of larger backbone segments or even domain movements are not covered by this approach. Main problems: Protein structures (where do they come from?) Internal protein energy LMU Institut für Informatik, LFE Bioinformatik, Cheminformatics, Structure based methods J. Apostolakis 62