Exploiting the Amazon.com People Who Bought Also Bought Algorithm in Reagent Selection Christian Tyrchan, iklas Falk and Jonas Boström
Setting the Scene The current trend is that drug discovery projects are treated as processes creativity might be hampered, and little room for Serendipity? We need new ways of working we want creative users (not feeling stuck in processes) Making novel compounds is at the heart of drug design Thus, the aim of the current work is to enhance discovery, surfacing reagents from deep in the catalog that our chemists wouldn't find on their own. Using a novel approach, where similarity is based on users (not structures).
Internet Success Stories ew Technologies ew Sciences Finite State Machines Item-to-Item Collaborative Filtering (ew approaches to improve searches)
Recommendation Systems are best known for their use on e-commerce Web sites. attempts to present items that are likely to be of interest to the user. The idea of recommending items at checkout is nothing new
The Harry Potter Shopping Cart Amazon.com saw the opportunity to personalize impulse buys
The Harry Potter Shopping Cart The idea of recommending items at checkout is nothing new
Recommendation Systems Typically, a recommender system compares the user's profile to some reference characteristics, and seeks to predict the 'rating' that a user would give to an item they had not yet considered. Should help a customer find and discover new, relevant, and interesting items Two main categories (based on how the recommendations are made): Content-based recommendations the information item user will be recommended items similar to the ones the user preferred in the past Collaborative recommendations social environment user will be recommended items that people with similar taste liked in the past
Content-based and Collaborative Systems Content-based recommendations nly the movies that have a high degree of similarity to what the user s preference are would be recommended. Collaborative recommendations start by finding a set of customers whose purchased items overlap the user s purchased items. The algorithm aggregates items from these similar customers, eliminates items the user has already purchased, and recommends the remaining items to the user. focus on finding similar users represents a user as an -dimensional vector of items.
Recommendations needed to work... from sparse data often just a few purchases. it needed to be fast high-quality in real-time. the system needed to scale to massive numbers huge amounts of data. the algorithm must respond immediately to new information customer data is volatile. one of the existing methods were good enough Traditional collaborative filtering does little or no offline computation, nline computation scales with the number of customers and catalog items. The algorithm is impractical on large data sets. Content-based recommendations no news (unless randomization)
Item-to-Item Collaborative Filtering item-to-item collaborative filtering matches each of the user s purchased items to similar items, then combines those similar items into a recommendation list. To determine the most-similar match for a given item, the algorithm builds a similaritems table by finding items that customers tend to purchase together. Amazon.com's item-to-item approach computes the cosine between binary vectors representing the purchases in a user-item matrix. Given two vectors of attributes (A and B) the cosine similarity (θ) is represented using a dot product and magnitude as: Recommendations based on items which are most similar to query item. Greg Linden et al. "Amazon.com Recommendations: Item-to-Item Collaborative Filtering, IEEE Internet Computing, 2003, 7, 76-80.
Since it works for Amazon.com, why not try it... to help medicinal chemist select reagents from chemical databases enhance discovery, surfacing reagents from deep in the catalog that our chemists wouldn't find on their own.
Exploiting the Amazon.com People Who Bought Also Bought algorithm in Reagent Selection ot only suggesting new reagents, but also solving problems? For example, suggesting possible bioisosters: + reductive amination R H R Final product may be genetoxic. Design idea to avoid AMES positives R H Genetoxic AMES test is one measure of genetic toxicity Aromatic amines are often unwanted fragments in drug design (GeneToxic). Regulatory view: If carcinogenic in animals, it will be a carcinogen in man.
Strategy Collect Data Set of Chemical Reagents Get Check-out information Generate Similarity Matrix using Cosine Similarities Import Matrix into an racle database Display Recommendations ISIS/db query items (reagents) which are most similar to query item (reagent). Check-out information
Reagent Data Set Extract reagents in Stockroom ( CIMS ) checked out the last 5yrs 42 304 reagents Filter amount!=0 tweak-1 canonical SMILES generated counter salts were removed (and reagents merged) unique compound id s assigned 12513 unique Grouping Assign reagents into 10 functional classes, by SMARTS mapping: tweak-2 Times Check-ut 100 90 80 70 60 50 40 30 20 10 0 Check-out only once 10229 reagents could be mapped onto the 10 functional classes. 194 unique chemists. Reagents
Tweak 1 counter-ions Ca 5000 entries include a counter-ion Different salts should give the same results For example, the reagent below exists with and without the hydrochloride salt F F ClH F F F F 3,3,3-TRIFLURPRPYLAMIE 3,3,3-TRIFLURPRPYLAMIE HYDRCHLRIDE The salts are removed, and the data are merged for the vectors.
Tweak 2 functional classes A search for amines should only recommend other amines + R reductive amination H R Class Reagents Freq FunctionalGroups 1 3982 8902 primary and secondary amines 2 4349 4772 acids, acid halides, anhydrides, carbamates, carbonates, esters 3 2426 2515 aromatic halides 4 1047 2002 alkyl halides 5 281 281 sulphonyl chlorides 6 2150 3023 alcohols 7 1073 1623 aldehydes, ketones 8 287 287 boronic acids, trifluoroborates 9 184 184 isocyanates, isothiocyanates 10 81 81 alpha halide ketones (dual functionalities counted twice)
Similarities Data binary User checked-out reagent (1), or not (0). Where the cosine between C0001 and C003 is: Item User C001 C002 C003 Anthony icholls 1 0 1 Andrew Grant 0 1 1 Morten Langgard 0 1 0 1 = checked-out, 0 = not checked out 3500 3000 Frequency 2500 2000 1500 almost all-against-all 1000 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Binned Amazon.com Similarities* *Roughly 85% of the reagents belong in the zero bin
Architecture racle and MDL ISIS/Base not web-based system user rows user-by-item matrix item columns updates over-night possible
Results What does the frontend look like? Yet Another Similarity Measure? A Dream Come True? Possible ways forwards ther info revealed
Frontend, and That little bit extra riginal CIMS CIMS-Recommend Available amount Location
Amazon.com vs ther Similarities Lingos and 3 fingerprints are calculated (ECFP6, FPFP6, MDL Public keys). TopX hits compared to topx Amazon-hits. verlap (%) MaxHits* ECFP6 FPFP6 Lingo MDL Public Keys 10 12.3 12.7 3.8 13.4 20 21.3 21.9 4.6 23.1 Amazon Hito Molame 1 C0455 2 C0020 3 C0134 4 C0001 FP/Lingos Hito Molame 1 C0135 2 C0700 3 C0932 4 C0134 Max C0955 Max C0251 Results show that Amazon recommendations are, more or less, orthogonal to other searching techniques.
Amazon.com vs ther Similarities Top 10 structures selected from the Amazon-like selection and the ECFP4 fingerprint method for two queries Amazon Top 10 H H H H H H F ECFP4 Top 10 Cl Br H H F F
Exploiting Recommendation Systems in Reagent Selection Design idea to avoid AMES positives + R reductive amination H R Search database for anline, and get Chemists who requested aniline also requested : All AMES negatives H S The advantage of such a feature is the inherent knowledge-transfer. In the dream scenario such a reagent suggestion could solve an existing problem.
Medicinal Chemistry Poll Pre-defined sets? To diverse recommendations? Already better! Since I get everything in one go
Most Frequently Checked-ut Reagents ther information easily accessible just ask the right question. Top5 amines 140 120 H H H o. Checked-out 100 80 60 40 20 0 0 500 1000 1500 2000 2500 3000 3500 4000 Reagent Top5 aldehydes 120 100 H o. Check-out 80 60 40 20 0 0 500 1000 1500 2000 2500 Reagent
Summary Recommendation systems are useful alternatives to search algorithms since they help users to discover items they might not have found by themselves. We presented a novel dynamic similarity measure personalized information was used to produce reagent recommendations, using Amazon.com s item-to-item collaborative filtering technique. Low threshold for trying first prototype finished within 1-2 weeks (as all infrastructure was in place) maintaining data can readily be updated nightly, weekly In the dream scenario such a [reagent] suggestion could solve an existing problem. not there just yet (too little data need more info ) ur recommendations are, more or less, orthogonal to other similarity measures. Positive comments in small MedChem poll. In the end, what we want is happy satisfied customers!
Jens Sadowski for presenting! Acknowledgments
Exploiting the Amazon.com People Who Bought Also Bought Algorithm in Reagent Selection Abstract. Amazon.com s People who bought [this book] also bought [these books] is a popular feature on numerous web-sites nowadays. The use of such arecommendersystemcanbeexploitedinmanyareas,alsoindrugdesign.in the current work a system to recommend reagents has been developed, using the item-to-item collaborative filtering technique. The goal is to enhance discovery, surfacing reagents from deep in our corporate reagent database; reagents that medicinal chemists might not have found on their own. Another potential advantage of using personalized information is the inherent knowledge-transfer. That is, in a dream scenario a reagent recommendation could solve an existing problem. Moreover, this novel similarity measure differs from other similarity measures; as it is based on user-item information and not descriptions of molecular structures. It will be shown that the recommendations are, more or less, orthogonal to other methods.