RDKit: A software suite for cheminformatics, computational chemistry, and predictive modeling

Similar documents

KNIME Enterprise server usage and global deployment at NIBR

How to create a web-based molecular structure database with free software

Predicting outcome of soccer matches using machine learning

COSC 6397 Big Data Analytics. Mahout and 3 rd homework assignment. Edgar Gabriel Spring Mahout

Table of Contents. The RCS MINI HOWTO

Car Insurance. Havránek, Pokorný, Tomášek

KITES TECHNOLOGY COURSE MODULE (C, C++, DS)

Journée Thématique Big Data 13/03/2015

Slides from INF3331 lectures - web programming in Python

TEST AUTOMATION FRAMEWORK

SciTools Understand Flavor for Structure 101g

Cheminformatics and Pharmacophore Modeling, Together at Last

Visualizing molecular simulations

Pipeline Pilot Enterprise Server. Flexible Integration of Disparate Data and Applications. Capture and Deployment of Best Practices

Sources: On the Web: Slides will be available on:

Apache 2.0 Installation Guide

Version Control with. Ben Morgan

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Data Visualization in Cheminformatics. Simon Xi Computational Sciences CoE Pfizer Cambridge

Introduction Predictive Analytics Tools: Weka

Instruction Set Architecture of Mamba, a New Virtual Machine for Python

Intro to scientific programming (with Python) Pietro Berkes, Brandeis University

Analysis report examination with CUBE

Data Mining Individual Assignment report

An Introduction to WEKA. As presented by PACE

Scientific Programming with Python. Randy M. Wadkins, Ph.D. Asst. Prof. of Chemistry & Biochem.

Machine Learning Final Project Spam Filtering

Amazon Glacier. Developer Guide API Version

2/1/2010. Background Why Python? Getting Our Feet Wet Comparing Python to Java Resources (Including a free textbook!) Hathaway Brown School

Version Control with Svn, Git and git-svn. Kate Hedstrom ARSC, UAF

Advanced Visualization for Chemistry

Introduction to Programming and Computing for Scientists

1 Topic. 2 Scilab. 2.1 What is Scilab?

How to Design and Create Your Own Custom Ext Rep

metaengine DataConnect For SharePoint 2007 Configuration Guide

HPC Wales Skills Academy Course Catalogue 2015

Open source framework for interactive data exploration in server based architecture

dedupe Documentation Release Forest Gregg, Derek Eder, and contributors

To reduce or not to reduce, that is the question

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7

MayaVi: A free tool for CFD data visualization

This document presents the new features available in ngklast release 4.4 and KServer 4.2.

The MANTIS Framework Cyber-Threat Intelligence Mgmt. for CERTs Siemens AG All rights reserved

Why are we teaching you VisIt?

Python for Chemistry in 21 days

Survey of Unit-Testing Frameworks. by John Szakmeister and Tim Woods

The C Programming Language course syllabus associate level

C++ Programming Language

Analysis Tools and Libraries for BigData

Code Estimation Tools Directions for a Services Engagement

Why you shouldn't use set (and what you should use instead) Matt Austern

Visualization Cluster Getting Started

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

ORACLE NOSQL DATABASE HANDS-ON WORKSHOP Cluster Deployment and Management

B&K Precision 1785B, 1786B, 1787B, 1788 Power supply Python Library

1 Choosing the right data mining techniques for the job (8 minutes,

Car Insurance. Prvák, Tomi, Havri

Bacula The Network Backup Solution

Automation Tools for UCS Sysadmins

How To Use Postgresql With Foreign Data In A Foreign Server In A Multi Threaded Database (Postgres)

Programming Using Python

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc.

EMBL-EBI. EM map display & Visualization

Structure Tools and Visualization

Caml Virtual Machine File & data formats Document version: 1.4

CORBA Programming with TAOX11. The C++11 CORBA Implementation

CDD user guide. PsN Revised

Introduction to Python

How To Understand How Weka Works

Bacula. The leading Opensource Backup Solution

Developing a Web Server Platform with SAPI Support for AJAX RPC using JSON

Automate Your Deployment with Bamboo, Drush and Features DrupalCamp Scotland, 9 th 10 th May 2014

latest Release 0.2.6

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Gillcup Documentation

CS 1133, LAB 2: FUNCTIONS AND TESTING

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Prediction of Car Prices of Federal Auctions

Building a Basic Communication Network using XBee DigiMesh. Keywords: XBee, Networking, Zigbee, Digimesh, Mesh, Python, Smart Home

Developing an ODBC C++ Client with MySQL Database

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

Hadoop Distributed File System Propagation Adapter for Nimbus

I don t intend to cover Python installation, please visit the Python web site for details.

Fingerprint-Based Virtual Screening Using Multiple Bioactive Reference Structures

Cheminformatics and its Role in the Modern Drug Discovery Process

Setting up an online Java Jmonitor. server using the. EXPERIMENTAL code from. John Melton GØORX/N6LYT

Illustration 1: Diagram of program function and data flow

Exercise 0. Although Python(x,y) comes already with a great variety of scientic Python packages, we might have to install additional dependencies:

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

1 Abstract Data Types Information Hiding

Transcription:

RDKit: A software suite for cheminformatics, computational chemistry, and predictive modeling Greg Landrum glandrum@users.sourceforge.net

SourceForge Page: http://rdkit.sourceforge.net Availability Source code: Browse: http://svn.sourceforge.net/viewcvs.cgi/rdkit/trunk/ Download using subversion: svn co https://svn.sourceforge.net/svnroot/rdkit/trunk rdkit Binaries: available for Win32 Licensing: BSD license (except for GUI components, which are GPL)

Components End User Command-line scripts Python interface Database extensions Developer Python programming library C++ programming library

General Molecular Functionality Input/Output: SMILES/SMARTS, mol, SDF, TDT Cheminformatics : Substructure searching Canonical SMILES Chirality support Chemical transformations Chemical reactions Molecular serialization (e.g. mol <-> text) 2D depiction, including constrained depiction and mimicking 3D coords 2D->3D conversion/conformational analysis via distance geometry UFF implementation for cleaning up structures Fingerprinting (Daylight-like, circular, atom pairs, topological torsions, MACCS keys, etc.) Similarity/diversity picking (include fuzzy similarity) 2D pharmacophores Gasteiger-Marsili charges Hierarchical subgraph/fragment analysis Hierarchical RECAP implementation

General Molecular Functionality, cntd Feature maps and feature-map vectors Shape-based similarity Molecule-molecule alignment Shape-based alignment (subshape alignment)* Very fast 3D pharmacophore searching Integration with PyMOL for 3D visualization Database cartridge (PostgreSQL, sqlite coming) * * functional implementations, but not really recommended for use

General QSAR Functionality Molecular descriptor library: Topological (κ3, Balaban J, etc.) Electrotopological state (EState) clogp, MR (Wildman and Crippen approach) MOE like VSA descriptors Feature-map vectors Machine Learning: Clustering (hierarchical) Information theory (Shannon entropy) Decision trees, naïve Bayes*, knn* * functional implementations, but Bagging, random forests not really recommended for use Infrastructure: - data splitting - shuffling (y scrambling) - out-of-bag classification - serializable models and descriptor calculators - enrichment plots, screening, etc.

Command Line Tools ML/BuildComposite.py: build models ML/ScreenComposite.py: screen models ML/EnrichPlot.py: generate enrichment plot data ML/AnalyzeComposite.py: analyze models (descriptor levels) Chem/Fingerprints/FingerprintMols.py: generate 2D fingerprints Chem/BuildFragmentCatalog.py: CASE-type analysis with a hierarchical catalog

Database Utilities DbUtils.ExcelToDatabase('::GPCRs','bzr_raw85',keyCol='name') DbUtils.DatabaseToExcel('::GPCRs','bzr_raw85') DbUtils.TextFileToDatabase('::GPCRs','bzr_raw85', file('input.txt',delim=',', keycol='name') DbUtils.DatabaseToText('::GPCRs','bzr_raw85')

Reading/Writing Molecules MolFromSmiles MolFromSmarts MolFromMolBlock MolFromMolFile mol = Chem.MolFromSmiles('CCCOCC') SmilesMolSupplier SDMolSupplier TDTMolSupplier SupplierFromFilename MolToSmiles MolToSmarts MolToMolBlock SmilesWriter SDWriter TDTWriter mols = [x for x in Chem.SDMolSupplier('input.sdf')] print >>outf,chem.moltomolblock(mol) w = Chem.SDWriter('out.sdf') for m in mols: w.write(m)

Working with molecules: Substructure matching >>> m = Chem.MolFromSmiles('O=CCC=O') >>> p = Chem.MolFromSmarts('C=O') >>> m.hassubstructmatch(p) True >>> m.getsubstructmatch(p) (1, 0) >>> m.getsubstructmatches(p) ((1, 0), (3, 4)) >>> m = Chem.MolFromSmiles('C1CCC1C(=O)O') >>> p= Chem.MolFromSmarts('C=O') >>> m2 =Chem.DeleteSubstructs(m,p) >>> Chem.MolToSmiles(m2) 'O.C1CCC1' >>> >>> def findcore(m): smi = Chem.MolToSmiles(m) patt = Chem.MolFromSmarts('[D1]') while 1: m2 = Chem.DeleteSubstructs(m,patt) nsmi = Chem.MolToSmiles(m2) m = m2 if nsmi==smi: break else: smi = nsmi return m >>> m = Chem.MolFromSmiles('CCC1CCC1') >>> c = findcore(m) >>> Chem.MolToSmiles(c) 'C1CCC1'

Working with molecules: properties >>> suppl = Chem.SDMolSupplier('divscreen.400.sdf') >>> m = suppl.next() >>> list(m.getpropnames()) ['Formula', 'MolWeight', 'Mol_ID', 'Smiles', 'cdk2_ic50', 'cdk2_inhib', 'cdk_act_bin_1', 'mol_name', 'scaffold', 'sourcepool'] >>> m.getprop('scaffold') 'Scaffold_00' >>> m.getprop('missing') Traceback (most recent call last): File "<stdin>", line 1, in? KeyError: 'missing' >>> m.hasprop('missing') 0 >>> m.setprop('testing','value') >>> m.getprop('testing') 'value' >>> m.setprop('calcd','45',computed=true) >>> list(m.getpropnames()) ['Formula', 'MolWeight', 'Mol_ID', 'Smiles', 'cdk2_ic50', 'cdk2_inhib', 'cdk_act_bin_1', 'mol_name', 'scaffold', 'sourcepool', 'testing']

Generating Depictions >>> from rdkit import Chem >>> from rdkit.chem import AllChem >>> m = Chem.MolFromSmiles('C1CCC1') >>> m.getnumconformers() 0 >>> AllChem.Compute2DCoords(m) 0 >>> m.getnumconformers() 1 >>> print Chem.MolToMolBlock(m) 4 4 0 0 0 0 0 0 0 0999 V2000 1.0607 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0-0.0000-1.0607 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0-1.0607 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 1.0607 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 2 3 1 0 3 4 1 0 4 1 1 0 M END >>>

Generating 3D Coordinates >>> from rdkit import Chem >>> from rdkit.chem import AllChem >>> m = Chem.MolFromSmiles('C1CCC1') >>> AllChem.EmbedMolecule(m) 0 >>> m.getnumconformers() 1 >>> AllChem.UFFOptimizeMolecule(m) 0 >>> m.setprop('_name','testmol') >>> print Chem.MolToMolBlock(m) testmol 4 4 0 0 0 0 0 0 0 0999 V2000-0.8040 0.5715-0.2537 C 0 0 0 0 0 0 0 0 0 0 0 0-0.3727-0.9165-0.2471 C 0 0 0 0 0 0 0 0 0 0 0 0 0.7942-0.5376 0.6386 C 0 0 0 0 0 0 0 0 0 0 0 0 0.3825 0.8826 0.6323 C 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 2 3 1 0 3 4 1 0 4 1 1 0 M END >>> list(allchem.embedmultipleconfs(m,10)) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] >>> m.getnumconformers() 10

Low-budget conformational analysis >>> from rdkit import Chem >>> from rdkit.chem import AllChem >>> from rdkit.chem import PyMol >>> v = PyMol.MolViewer() >>> m=chem.molfromsmiles('oc(=o)c1ccc(cc(=o)o)cc1') >>> AllChem.EmbedMultipleConfs(m,10) <rdbase._vectint object at 0x00A2D2B8> >>> for i in range(m.getnumconformers()): AllChem.UFFOptimizeMolecule(m,confId=i) v.showmol(m,confid=i,name='conf-%d'%i,showonly=false) >>> w = Chem.SDWriter('foo.sdf') >>> for i in range(m.getnumconformers()): w.write(m,confid=i) >>> w.flush()

Molecular Miscellany >>> from rdkit import Chem >>> from rdkit.chem import Crippen >>> Crippen.MolLogP(Chem.MolFromSmiles('c1ccccn1')) 1.0815999999999999 >>> Crippen.MolMR(Chem.MolFromSmiles('c1ccccn1')) 24.236999999999991 >>> AllChem.ShapeTanimotoDist(m1,m2) >>> Chem.Kekulize(m) >>> m = Chem.MolFromSmiles('F[C@H]([Cl])Br') >>> Chem.AssignAtomChiralCodes(m) >>> m.getatomwithidx(1).getprop('_cipcode') 'R' >>> m = Chem.MolFromSmiles(r'F\C=C/Cl') >>> Chem.AssignBondStereoCodes(m) >>> m.getbondwithidx(1).getstereo() Chem.rdchem.BondStereo.STEREOZ

Database CLI tools # building a database: % python $RDBASE/Projects/DbCLI/CreateDb.py --dbdir=bzr --molformat=sdf bzr.sdf # similarity searching: % python $RDBASE/Projects/DbCLI/SearchDb.py --dbdir=bzr --molformat=smiles \ --similaritytype=atompairs --topn=5 bzr.smi [18:23:21] INFO: Reading query molecules and generating fingerprints [18:23:21] INFO: Finding Neighbors [18:23:21] INFO: The search took 0.1 seconds [18:23:21] INFO: Creating output Alprazolam,Alprazolam,1.000,Ro13-9868,0.918,Triazolam,0.897,Estazolam,0.871,U-35005,0.870 Bromazepam,Bromazepam,1.000,Ro05-3072,0.801,Ro05-3061,0.801,Nordazepam,0.801,Ro05-2921,0.772 Delorazepam,Delorazepam,1.000,Ro05-4619,0.900,Nordazepam,0.881,Lorazepam,0.855,Ro20-8065,0.840 # substructure and property searching: % python $RDBASE/Projects/DbCLI/SearchDb.py --dbdir=bzr --smarts='c1ncccc1' \ -q 'activity>6.5' --sdfout=search1.sdf [18:27:25] INFO: Doing substructure query [18:27:25] INFO: Found 1 molecules matching the query O [18:27:25] INFO: Creating output Bromazepam NH [18:27:25] INFO: Done! N N Br

RDKit Developer's Overview Greg Landrum glandrum@users.sourceforge.net

Installation Download and install boost libraries (www.boost.org) Download distribution from http://rdkit.sourceforge.net (or get the latest version using subversion) Download and install other python dependencies (see $RDBASE/Docs/SoftwareRequirements.txt) Follow build instructions in $RDBASE/INSTALL

Documentation C++: Generated using doxygen: cd $RDBASE/Code doxygen doxygen.config Python: Currently generated using epydoc: cd $RDBASE/rdkit epydoc --config epydoc.config

Organization $RDBASE/ External/: essential 3rd party components that are hard to get or have been modified Code/: C++ library, wrapper, and testing code rdkit/: Python library, GUI, and scripts Docs/: Manually and automatically generated documentation Data/: Testing and reference data Scripts/: a couple of useful utility scripts Web/: some simple CGI applications bin/: installation dir for DLLs/shared libraries

C++ "Guiding Ideas" If boost already has the wheel, don't re-invent it. Try to keep classes lightweight Maintain const-correctness Test, test, test Include tests with code Keep namespaces explicit (i.e. no using namespace std;) Keep most everything in namespace RDKit Test, test, test

Sample test_list.py: Test Harness

Wrapping code using BPL: Intro boost::python is not an automatic wrapper generator: wrappers must be written by hand. very flexible handling of "primitive types" tight python type integration allows subclassing of C++ extension classes

Wrapping code using BPL: defining a module DataStructs/Wrap/DataStructs.cpp #include <boost/python.hpp> #include <RDBoost/Wrap.h> #include "DataStructs.h" namespace python = boost::python; void wrap_sbv(); void wrap_ebv(); BOOST_PYTHON_MODULE(cDataStructs) { python::scope().attr(" doc ") = "Module containing an assortment of functionality for basic data structures.\n" "\n" "At the moment the data structures defined are:\n" ; python::register_exception_translator<indexerrorexception>(&translate_index_error); python::register_exception_translator<valueerrorexception>(&translate_value_error); wrap_sbv(); wrap_ebv(); }

Wrapping code using BPL: defining a class DataStructs/Wrap/wrap_SparseBV.cpp struct SBV_wrapper { static void wrap(){ python::class_<sparsebitvect>("sparsebitvect", "Class documentation", python::init<unsigned int>()).def(python::init<std::string>()).def("setbit",(bool (SBV::*)(unsigned int))&sbv::setbit, "Turns on a particular bit on. Returns the original state of the bit.\n").def("getnumbits",&sbv::getnumbits, "Returns the number of bits in the vector (the vector's size).\n") }; void wrap_sbv() { SBV_wrapper::wrap(); }

Wrapping code using BPL: making it "pythonic" DataStructs/Wrap/wrap_SparseBV.cpp struct SBV_wrapper { static void wrap(){ python::class_<sparsebitvect>("sparsebitvect", "Class documentation", python::init<unsigned int>()).def(" len ",&SBV::GetNumBits).def(" getitem ", (const int (*)(const SBV&,unsigned int))get_vectitem).def(" setitem ", (const int (*)(SBV&,unsigned int,int))set_vectitem).def(python::self & python::self).def(python::self python::self).def(python::self ^ python::self).def(~python::self) ; } };

Wrapping code using BPL: supporting pickling DataStructs/Wrap/wrap_SparseBV.cpp // allows BitVects to be pickled struct sbv_pickle_suite : python::pickle_suite { static python::tuple getinitargs(const SparseBitVect& self) { return python::make_tuple(self.tostring()); }; }; struct SBV_wrapper { static void wrap(){ python::class_<sparsebitvect>("sparsebitvect", "Class documentation", python::init<unsigned int>()).def_pickle(sbv_pickle_suite()) ; } };

Wrapping code using BPL: returning complex types DataStructs/Wrap/wrap_SparseBV.cpp struct SBV_wrapper { static void wrap(){ python::class_<sparsebitvect>("sparsebitvect", "Class documentation", python::init<unsigned int>()).def("getonbits", (IntVect (*)(const SBV&))GetOnBits, "Returns a tuple containing IDs of the on bits.\n") ; } }; IntVect is a std::vector<int> Convertors provided for: std::vector<int>, std::vector<unsigned>, std::vector<string>, std::vector<double>, std::vector< std::vector<int> >, std::vector< std::vector<unsigned> >, std::vector< std::vector<double> >, std::list<int>, std::list< std::vector<int> > in rdbase.dll Others can be created using: RegisterVectorConverter<RDKit::Atom*>(); RegisterListConverter<RDKit::Atom*>();

Wrapping code using BPL: accepting Python types DataStructs/Wrap/wrap_SparseBV.cpp void SetBitsFromList(SparseBitVect *bv, python::object onbitlist) { PySequenceHolder<int> bitl(onbitlist); for (unsigned int i = 0; i < bitl.size(); i++) { bv->setbit(bitl[i]); } } struct SBV_wrapper { static void wrap(){ python::class_<sparsebitvect>("sparsebitvect", "Class documentation", python::init<unsigned int>()).def("setbitsfromlist",setbitsfromlist, "Turns on a set of bits. The argument should be a tuple or list of bit ids.\n") ; } };

Wrapping code using BPL: keyword arguments GraphMol/Wrap/rdmolfiles.cpp docstring="returns the a Mol block for a molecule\n\ ARGUMENTS:\n\ \n\ - mol: the molecule\n\ - includestereo: (optional) toggles inclusion of stereochemical\n\ information in the output\n\ - confid: (optional) selects which conformation to output (-1 = default)\n\ \n\ RETURNS:\n\ \n\ a string\n\ \n"; python::def("moltomolblock",rdkit::moltomolblock, (python::arg("mol"),python::arg("includestereo")=false, python::arg("confid")=-1), docstring.c_str()); This also demonstrates defining a function and taking a complex (wrapped) type as an argument

Wrapping code using BPL: returning pointers GraphMol/Wrap/rdmolfiles.cpp docstring="construct a molecule from a SMILES string.\n\n\ ARGUMENTS:\n\ - SMILES: the smiles string\n\ - sanitize: (optional) toggles sanitization of the molecule.\n\ Defaults to 1.\n\ RETURNS:\n\ a Mol object, None on failure.\n\ \n"; python::def("molfromsmiles",rdkit::molfromsmiles, (python::arg("smiles"), python::arg("sanitize")=true), docstring.c_str(), python::return_value_policy<python::manage_new_object>()); GraphMol/Wrap/Mol.cpp.def("GetAtomWithIdx",(ROMol::GRAPH_NODE_TYPE (ROMol::*)(unsigned int)) &ROMol::getAtomWithIdx, python::return_value_policy<python::reference_existing_object>(), "Returns a particular Atom.\n\n" " ARGUMENTS:\n" " - idx: which Atom to return\n\n" " NOTE: atom indices start at 0\n")