Exploiting the Amazon.com People Who Bought Also Bought Algorithm in Reagent Selection. Christian Tyrchan, Niklas Falk and Jonas Boström



Similar documents
A NOVEL RESEARCH PAPER RECOMMENDATION SYSTEM

Recommendation Tool Using Collaborative Filtering

The Need for Training in Big Data: Experiences and Case Studies

MOLECULAR REPRESENTATIONS AND INFRARED SPECTROSCOPY

Greg Linden, Brent Smith, and Jeremy York Amazon.com

Intelligent Web Techniques Web Personalization

CHEM 51LB EXP 1 SPECTROSCOPIC METHODS: INFRARED AND NUCLEAR MAGNETIC RESONANCE SPECTROSCOPY

How to create a web-based molecular structure database with free software

recommendation in e-commerce

Machine Learning using MapReduce

Carboxylic Acid Derivatives and Nitriles

Identification of Unknown Organic Compounds

for excitation to occur, there must be an exact match between the frequency of the applied radiation and the frequency of the vibration

Dashboards as Easy To Use as Amazon

HOMEWORK PROBLEMS: IR SPECTROSCOPY AND 13C NMR. The peak at 1720 indicates a C=O bond (carbonyl). One possibility is acetone:

Math 215 HW #6 Solutions

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

KNIME Enterprise server usage and global deployment at NIBR

Experiment 11. Infrared Spectroscopy

TAN Triaminononane H 2 N NH 2. (4 aminomethyl 1,8 octanediamine) Trifunctional amine with low molecular weight CAS NO

Increase Conversion and Sales, Not your Marketing Budget

RECOMMENDATION METHOD ON HADOOP AND MAPREDUCE FOR BIG DATA APPLICATIONS

A Clustering Model for Mining Evolving Web User Patterns in Data Stream Environment

Austin Peay State University Department of Chemistry CHEM 1021 TESTING FOR ORGANIC FUNCTIONAL GROUPS

An Overview of Knowledge Discovery Database and Data mining Techniques

Challenges and Opportunities in Data Mining: Personalization

Web Development QUESTIONNAIRE. Version: 1.0 BIG!

Computational Tools for Medicinal Chemists Increasing the Dimensions of Drug Discovery. Dr Robert Scoffin CEO

Molecular descriptors and chemometrics: a powerful combined tool for pharmaceutical, toxicological and environmental problems.

Using Data Mining and Machine Learning in Retail

Search Engines. Stephen Shaw 18th of February, Netsoc

AN APPROACH TO ANTICIPATE MISSING ITEMS IN SHOPPING CARTS

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

User Data Analytics and Recommender System for Discovery Engine

Categorical Data Visualization and Clustering Using Subjective Factors

Avg / 25 Stnd. Dev. 8.2

RECOMMENDATION SYSTEM

Determining the Structure of an Organic Compound

Data Mining for Web Personalization

The Data Mining Process

Infrared Spectroscopy 紅 外 線 光 譜 儀

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

CHEM 51LB: EXPERIMENT 5 SPECTROSCOPIC METHODS: INFRARED AND NUCLEAR MAGNETIC RESONANCE SPECTROSCOPY

Extending E-Ticketing Service with Mobile Transactions

Group Testing a tool of protecting Network Security

Survival Organic Chemistry Part I: Molecular Models

Prediction of Heart Disease Using Naïve Bayes Algorithm

Personalized Information Management for Web Intelligence

Recommending News Articles using Cosine Similarity Function Rajendra LVN 1, Qing Wang 2 and John Dilip Raj 1

PTAC: Applied Chemistry COURSE OUTLINE & OBJECTIVES ESC Approved November 19, 2004

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Customer Analytics. Turn Big Data into Big Value

ANALYTICS IN BIG DATA ERA

Implementing a Recommender system with graph database Prototype

Mass Spec - Fragmentation

WHITE PAPER WORK PROCESS AND TECHNOLOGIES FOR MAGENTO PERFORMANCE (BASED ON FLIGHT CLUB) June, Project Background

Symmetric Stretch: allows molecule to move through space

Web analytics: Data Collected via the Internet

D-optimal plans in observational studies

Big Data Analytics Verizon Lab, Palo Alto

MYRIAD, HITACHI, ORACLE & FRIEDLI JOIN FORCES TO MAP THE ENTIRE HUMAN PROTEOME

Magento-Extension for personalized Recommendations

E-Commerce Installation and Configuration Guide

Mark Bennett. Search and the Virtual Machine

Automated Collaborative Filtering Applications for Online Recruitment Services

EXPERIMENT 1: Survival Organic Chemistry: Molecular Models

Optimal trading? In what sense?

May 2015 Robert Gibbon & Jochen Stroobants

Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices

De novo design in the cloud from mining big data to clinical candidate

The Sierra Clustered Database Engine, the technology at the heart of

E-Commerce and the Entrepreneur

RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING

Fast Trading and Prop Trading

SPE and HPLC. Dr Iva Chianella Lecturer in Analytical Chemistry Cranfield Health +44 (0)

Business Challenges and Research Directions of Management Analytics in the Big Data Era

Search Result Optimization using Annotators

How to create and interpret the predictive analysis of a compound

Big Data Text Mining and Visualization. Anton Heijs

BIG DATA: IT MAY BE BIG BUT IS IT SMART?

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

INFRARED SPECTROSCOPY (IR)

Tableau Server Scalability Explained

Reactions of Aldehydes and Ketones

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Recommender Systems Seminar Topic : Application Tung Do. 28. Januar 2014 TU Darmstadt Thanh Tung Do 1

Database Software. What Is a Database, and How Does It Work?

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Typical Infrared Absorption Frequencies. Functional Class Range (nm) Intensity Assignment Range (nm) Intensity Assignment

Chemistry Notes for class 12 Chapter 13 Amines

A Biologically Inspired Approach to Network Vulnerability Identification

ammonium salt (acidic)

Detection and mitigation of Web Services Attacks using Markov Model

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

Resonance Structures Arrow Pushing Practice

Collaborative Filtering. Radek Pelánek

TF-IDF. David Kauchak cs160 Fall 2009 adapted from:

Transcription:

Exploiting the Amazon.com People Who Bought Also Bought Algorithm in Reagent Selection Christian Tyrchan, iklas Falk and Jonas Boström

Setting the Scene The current trend is that drug discovery projects are treated as processes creativity might be hampered, and little room for Serendipity? We need new ways of working we want creative users (not feeling stuck in processes) Making novel compounds is at the heart of drug design Thus, the aim of the current work is to enhance discovery, surfacing reagents from deep in the catalog that our chemists wouldn't find on their own. Using a novel approach, where similarity is based on users (not structures).

Internet Success Stories ew Technologies ew Sciences Finite State Machines Item-to-Item Collaborative Filtering (ew approaches to improve searches)

Recommendation Systems are best known for their use on e-commerce Web sites. attempts to present items that are likely to be of interest to the user. The idea of recommending items at checkout is nothing new

The Harry Potter Shopping Cart Amazon.com saw the opportunity to personalize impulse buys

The Harry Potter Shopping Cart The idea of recommending items at checkout is nothing new

Recommendation Systems Typically, a recommender system compares the user's profile to some reference characteristics, and seeks to predict the 'rating' that a user would give to an item they had not yet considered. Should help a customer find and discover new, relevant, and interesting items Two main categories (based on how the recommendations are made): Content-based recommendations the information item user will be recommended items similar to the ones the user preferred in the past Collaborative recommendations social environment user will be recommended items that people with similar taste liked in the past

Content-based and Collaborative Systems Content-based recommendations nly the movies that have a high degree of similarity to what the user s preference are would be recommended. Collaborative recommendations start by finding a set of customers whose purchased items overlap the user s purchased items. The algorithm aggregates items from these similar customers, eliminates items the user has already purchased, and recommends the remaining items to the user. focus on finding similar users represents a user as an -dimensional vector of items.

Recommendations needed to work... from sparse data often just a few purchases. it needed to be fast high-quality in real-time. the system needed to scale to massive numbers huge amounts of data. the algorithm must respond immediately to new information customer data is volatile. one of the existing methods were good enough Traditional collaborative filtering does little or no offline computation, nline computation scales with the number of customers and catalog items. The algorithm is impractical on large data sets. Content-based recommendations no news (unless randomization)

Item-to-Item Collaborative Filtering item-to-item collaborative filtering matches each of the user s purchased items to similar items, then combines those similar items into a recommendation list. To determine the most-similar match for a given item, the algorithm builds a similaritems table by finding items that customers tend to purchase together. Amazon.com's item-to-item approach computes the cosine between binary vectors representing the purchases in a user-item matrix. Given two vectors of attributes (A and B) the cosine similarity (θ) is represented using a dot product and magnitude as: Recommendations based on items which are most similar to query item. Greg Linden et al. "Amazon.com Recommendations: Item-to-Item Collaborative Filtering, IEEE Internet Computing, 2003, 7, 76-80.

Since it works for Amazon.com, why not try it... to help medicinal chemist select reagents from chemical databases enhance discovery, surfacing reagents from deep in the catalog that our chemists wouldn't find on their own.

Exploiting the Amazon.com People Who Bought Also Bought algorithm in Reagent Selection ot only suggesting new reagents, but also solving problems? For example, suggesting possible bioisosters: + reductive amination R H R Final product may be genetoxic. Design idea to avoid AMES positives R H Genetoxic AMES test is one measure of genetic toxicity Aromatic amines are often unwanted fragments in drug design (GeneToxic). Regulatory view: If carcinogenic in animals, it will be a carcinogen in man.

Strategy Collect Data Set of Chemical Reagents Get Check-out information Generate Similarity Matrix using Cosine Similarities Import Matrix into an racle database Display Recommendations ISIS/db query items (reagents) which are most similar to query item (reagent). Check-out information

Reagent Data Set Extract reagents in Stockroom ( CIMS ) checked out the last 5yrs 42 304 reagents Filter amount!=0 tweak-1 canonical SMILES generated counter salts were removed (and reagents merged) unique compound id s assigned 12513 unique Grouping Assign reagents into 10 functional classes, by SMARTS mapping: tweak-2 Times Check-ut 100 90 80 70 60 50 40 30 20 10 0 Check-out only once 10229 reagents could be mapped onto the 10 functional classes. 194 unique chemists. Reagents

Tweak 1 counter-ions Ca 5000 entries include a counter-ion Different salts should give the same results For example, the reagent below exists with and without the hydrochloride salt F F ClH F F F F 3,3,3-TRIFLURPRPYLAMIE 3,3,3-TRIFLURPRPYLAMIE HYDRCHLRIDE The salts are removed, and the data are merged for the vectors.

Tweak 2 functional classes A search for amines should only recommend other amines + R reductive amination H R Class Reagents Freq FunctionalGroups 1 3982 8902 primary and secondary amines 2 4349 4772 acids, acid halides, anhydrides, carbamates, carbonates, esters 3 2426 2515 aromatic halides 4 1047 2002 alkyl halides 5 281 281 sulphonyl chlorides 6 2150 3023 alcohols 7 1073 1623 aldehydes, ketones 8 287 287 boronic acids, trifluoroborates 9 184 184 isocyanates, isothiocyanates 10 81 81 alpha halide ketones (dual functionalities counted twice)

Similarities Data binary User checked-out reagent (1), or not (0). Where the cosine between C0001 and C003 is: Item User C001 C002 C003 Anthony icholls 1 0 1 Andrew Grant 0 1 1 Morten Langgard 0 1 0 1 = checked-out, 0 = not checked out 3500 3000 Frequency 2500 2000 1500 almost all-against-all 1000 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Binned Amazon.com Similarities* *Roughly 85% of the reagents belong in the zero bin

Architecture racle and MDL ISIS/Base not web-based system user rows user-by-item matrix item columns updates over-night possible

Results What does the frontend look like? Yet Another Similarity Measure? A Dream Come True? Possible ways forwards ther info revealed

Frontend, and That little bit extra riginal CIMS CIMS-Recommend Available amount Location

Amazon.com vs ther Similarities Lingos and 3 fingerprints are calculated (ECFP6, FPFP6, MDL Public keys). TopX hits compared to topx Amazon-hits. verlap (%) MaxHits* ECFP6 FPFP6 Lingo MDL Public Keys 10 12.3 12.7 3.8 13.4 20 21.3 21.9 4.6 23.1 Amazon Hito Molame 1 C0455 2 C0020 3 C0134 4 C0001 FP/Lingos Hito Molame 1 C0135 2 C0700 3 C0932 4 C0134 Max C0955 Max C0251 Results show that Amazon recommendations are, more or less, orthogonal to other searching techniques.

Amazon.com vs ther Similarities Top 10 structures selected from the Amazon-like selection and the ECFP4 fingerprint method for two queries Amazon Top 10 H H H H H H F ECFP4 Top 10 Cl Br H H F F

Exploiting Recommendation Systems in Reagent Selection Design idea to avoid AMES positives + R reductive amination H R Search database for anline, and get Chemists who requested aniline also requested : All AMES negatives H S The advantage of such a feature is the inherent knowledge-transfer. In the dream scenario such a reagent suggestion could solve an existing problem.

Medicinal Chemistry Poll Pre-defined sets? To diverse recommendations? Already better! Since I get everything in one go

Most Frequently Checked-ut Reagents ther information easily accessible just ask the right question. Top5 amines 140 120 H H H o. Checked-out 100 80 60 40 20 0 0 500 1000 1500 2000 2500 3000 3500 4000 Reagent Top5 aldehydes 120 100 H o. Check-out 80 60 40 20 0 0 500 1000 1500 2000 2500 Reagent

Summary Recommendation systems are useful alternatives to search algorithms since they help users to discover items they might not have found by themselves. We presented a novel dynamic similarity measure personalized information was used to produce reagent recommendations, using Amazon.com s item-to-item collaborative filtering technique. Low threshold for trying first prototype finished within 1-2 weeks (as all infrastructure was in place) maintaining data can readily be updated nightly, weekly In the dream scenario such a [reagent] suggestion could solve an existing problem. not there just yet (too little data need more info ) ur recommendations are, more or less, orthogonal to other similarity measures. Positive comments in small MedChem poll. In the end, what we want is happy satisfied customers!

Jens Sadowski for presenting! Acknowledgments

Exploiting the Amazon.com People Who Bought Also Bought Algorithm in Reagent Selection Abstract. Amazon.com s People who bought [this book] also bought [these books] is a popular feature on numerous web-sites nowadays. The use of such arecommendersystemcanbeexploitedinmanyareas,alsoindrugdesign.in the current work a system to recommend reagents has been developed, using the item-to-item collaborative filtering technique. The goal is to enhance discovery, surfacing reagents from deep in our corporate reagent database; reagents that medicinal chemists might not have found on their own. Another potential advantage of using personalized information is the inherent knowledge-transfer. That is, in a dream scenario a reagent recommendation could solve an existing problem. Moreover, this novel similarity measure differs from other similarity measures; as it is based on user-item information and not descriptions of molecular structures. It will be shown that the recommendations are, more or less, orthogonal to other methods.