On Efficiently Capturing Scien3fic Proper3es in Distributed Big Data without Moving the Data:

Size: px

Start display at page:

Download "On Efficiently Capturing Scien3fic Proper3es in Distributed Big Data without Moving the Data:"

Adele McLaughlin
8 years ago
Views:

1 On Efficiently Capturing Scien3fic Proper3es in Distributed Big Data without Moving the Data: Case Study in Distributed Structural Biology using MapReduce Boyu Zhang, Trilce Estrada 2, Pietro 3, Michela Taufer University of Delaware 2 University of New Mexico 3 San Diego Supercomputer Center

2 The docking process in drug design High- throughput screening in drug design Docking of ligand conforma7ons into protein P- L docking algorithm Model Prot.- Lig. onto 3D grid Alter Lig. ConﬁguraOon Dock Lig. Into Prot. MD Simulated Annealing Energy MinimizaOon Trial Calculate Scores Evaluate Candidate SoluOons

3 The docking process in drug design High- throughput screening in drug design Docking of ligand conforma7ons into protein P- L docking algorithm Model Prot.- Lig. onto 3D grid Alter Lig. ConﬁguraOon Dock Lig. Into Prot. MD Simulated Annealing Energy MinimizaOon Trial Calculate Scores Evaluate Candidate SoluOons 2

4 Docking simulaoon = large conformaoon dataset How can we select well- docked (near- naove) conformaoons among the sampled conformaoons in the large dataset? 3

5 It is not just about the lowest energy. Number of conformations Energy Docked complexes are tradioonally scored based on energy A ligand scoring min energy does NOT always have a near- naove structure 4

6 It is not just about the lowest energy. Number of conformations Energy Docked complexes are tradioonally scored based on energy A ligand scoring min energy does NOT always have a near- naove structure 5

7 Searching for most popular pose Compare geometry of ligands, searching for the most popular pose Number of conformations RMSD Energy Energy Energy 6

8 Comparing ligand geometries Compare geometry of molecules, searching for dense spaces of similar poses Compute RMSDs, each between two 3D poses 7

9 Comparing ligand geometries Distributed data generaoon and storage Centralized data analysis Node Data movement Node When dealing with big, distributed datasets of conformaoons, clustering kills performance Dealing with uncertainoes, e.g., number of clusters Dealing with scalabilioes, e.g., I/O and storage limits

10 Capturing relevant properoes Extract geometrical shape (property) of docked ligand in the docking pocket of protein conformation geometry Perform space reducoon from data (atom coordinates of the ligand conformaoons) to extracted property (3D point) Expect conformaoons with similar geometry mapped into closed points 9

11 Capturing relevant properoes Encode ligand conformaoons into single 3D points Best- fit linear regression line of the 2D points Line slopes become coordinates of 3D point encoding geometry

12 From clustering problem to density search

13 CounOng property aggregates Deal with property- encoding points rather than raw data Transform the analysis problem from a clustering or classificaoon problem into a density search problem Build octree by assigning an octkey to each point represenong a ligand conformaoon based on its posioon in 3D space 2

14 Octree- based encoding and search Binary search through the octree hierarchy to find the deepest, most dense octant

15 Search for dense spaces 4

16 Search for dense spaces Octree nodes Reengineered ligand conformaoons 5

17 Search for dense spaces Deepest, more dense octant found by our algorithm Near- naove ligand structures 6

18 ImplementaOon in MapReduce Ligand conformaoons distributed across mulople nodes Define map and reduce funcoons - different variants possible: From global to local à move properoes From local to global à move densioes 7

19 Node Encode properoes Count densioes (SPAs) Exchange Node 2 Store locally Map properties Shuffle properties From global to local 3 properoes Count locally

20 Node 3 2 Encode properoes Count densioes (SPAs) Exchange densioes (SPAs) Node 2 Store locally Map properties Shuffle densities Count globally From local to global

21 Logical distribuoon of data Dataset: million protein-ligand records Strong convergence towards one ligand conformation Strong convergence towards two ligand conformations Weak convergence towards one ligand conformation No convergence!!! 2

22 Physical distribuoon of data Distributed datasets generated in semi- or fully- decentralized systems UNIFORM: property- encoding points that belong to the same subspace in the logical distribuoon are stored in the same physical storage ROUND- ROBIN: points that belong to the same subspace in the logical distribuoon are stored in separate physical storages in a round- robin manner RANDOM: points are randomly stored in the physical storages of all the system nodes 2

23 D Map Shuffle Overhead Reduce GL LG 2D D S UN D S UN D Round-robin

24 Round-robin.5 Random.5 GL LG GL LG GL GL LG LG GL LG.5.5 Reduce D GL LG Overhead Shuffle GL LG GL LG Map

25 Accuracy: Self- docking Self- docking: 23, 2, and 2 ligands dock into HIV, trypsin, and p38alpha respecovely Search across 56 datasets of, poses each Percentage (%) /23 /2 3/2 8/23 5/2 /2 HIV protease Trypsin P38alpha 24 octree-based energy

26 Lessons learned We can avoid data movement when analyzing big scienofic data distributed across mulople nodes Our approach performs a single pass of data to extract relevant properoes Geometry of a ligand conformaoon in a large dataset of million confirmaoons Only either properoes or property densioes are exchanges among nodes When exchanging property densioes, our approach delivers scalable performance and is NOT sensiove to scienofic contents 25

27 Acknowledgments Collaborators: volunteers Roger Armen (TJU) Request for informaoon: Michela Taufer: Global CompuOng Sponsors: 26

On Efficiently Capturing Scientific Properties in Distributed Big Data without Moving the Data:

On Efficiently Capturing Scientific Properties in Distributed Big Data without Moving the Data: A Case Study in Distributed Structural Biology using MapReduce Boyu Zhang, Trilce Estrada, Pietro Cicotti,