Nathan Brown. The Application of Consensus Modelling and Genetic Algorithms to Interpretable Discriminant Analysis.

Size: px
Start display at page:

Download "Nathan Brown. The Application of Consensus Modelling and Genetic Algorithms to Interpretable Discriminant Analysis. nathan.brown@novartis."

Transcription

1 Nathan Brown The Application of Consensus Modelling and Genetic Algorithms to Interpretable Discriminant Analysis Workshop Chemoinformatics in Europe: Research and Teaching 30 th May 2006

2 Discriminant Analysis Using a GA Predictive vs. Diagnostic Modelling Discriminant Analysis with a Genetic Algorithm Consensus and Splice Modelling Experimental Studies MDDR: 1130 renin and 636 COX inhibitors 1 Oral drugs: 1082 FDA-approved drugs 2 1. Hert, J.; Willet, P.; Wilton, D. J.; Acklin, P.; Azzaoui, K.; Jacoby, E.; Schuffenhauer, A.; Comparison of Fingerprint-Based Methods for Virtual Screening Using Multiple Bioactive Reference Structures. J. Chem. Inf. Comput. Sci. 2004, 44, Vieth, M.; Siegel, M. G.; Higgs, R. E.; Watson, I. A.; Robertson, D. H.; Savin, K. A.; Durst, G. L.; Hipskind, P. A. Characteristic Physical Properties and Structural Fragments of Marketed Oral Drugs. J. Med. Chem. 2004, 47, June 2006

3 Predictive versus Diagnostic Models Highly predictive models tend to obfuscate what is important for the property being modelled Highly interpretable models tend to be less effective in prediction power However, both objectives are very important * Adapted from a diagram by Richard Lewis We want highly predictive models that can also guide our decision-making processes 3 6 June 2006

4 Discriminant Analysis Supervised learning Dependent variable is known for dataset and used in training the model with the independent variables Optimize separation of classes Evolve weights for binned descriptors Score solutions according to ability to separate objects Discover descriptor ranges that are important for discrimination which can then be applied to make informed decisions 1. Gillet, V. J.; Willett, P.; Bradshaw, J. Identification of Biological Activity Profiles Using Substructural Analysis and Genetic Algorithms. J. Chem. Inf. Comput. Sci. 1998, 38, June 2006

5 Chromosome Encoding Selection of N descriptors Each descriptor partitioned into B i bins Each bin can take any value in the range {0 W} Chromosome length is then (N B i ) PSA N = 3 B = {4, 7, 5} MW ClogP 5 6 June 2006

6 Descriptor Selection Calculate physicochemical descriptors Cluster descriptors (not objects) Select descriptors that are: more orthogonal, and more interpretable for the medicinal chemist Some dataset dependency 6 6 June 2006

7 Fitness Functions 1 Initial Enhancement (IE) Emphasises enrichment in top NACT% of recalled molecules i.e. mean rank of all actives recalled after NACT Global Enhancement (GE) Emphasises enrichment of all actives in recalled molecules i.e. mean rank of all actives Maximum Difference Enhancement (MDE) Emphasises maximum difference in scores between the two classes 7 6 June 2006

8 Fitness Functions 2 Existing fitness function used a combination of evaluations: Number of actives in the top N% Average rank of actives over entire rank Maximised Difference Enhancement (MDE) The difference of the average rank of the two classes being discriminated MDE will tend to result in molecules where the separation between the two classes is maximised globally and rewarding ranks with more interesting molecules in the initial part of the rank 8 6 June 2006

9 Consensus Models Aim to reduce stochastic effects of using a single chromosome 9 6 June 2006

10 Splice Models Essentially a manual recombination operator to effect a more optimal solution model based on feedback and intuition 10 6 June 2006

11 Renin Consensus Discrimination Model Global Enhancement 1 0 M1TR M1TS1 M1TS2 CM1TS1 CM1TS2 M5TR M5TS1 M5TS2 CM5TS1 CM5TS2 M6TR M6TS1 M6TS2 CM6TS1 CM6TS2 Model Single Model Result Consensus Model Result 11 6 June 2006

12 Renin Consensus Discrimination Model Global Enhancement 1 0 M1TR M1TS1 M1TS2 CM1TS1 CM1TS2 M5TR M5TS1 M5TS2 CM5TS1 CM5TS2 M6TR M6TS1 M6TS2 CM6TS1 CM6TS2 Model Single Model Result Consensus Model Result 12 6 June 2006

13 SM6TS2 Renin Splice Discrimination Model Global Enhancement M1TS1 M1TS2 CM1TS1 CM1TS2 M5TS1 M5TS2 CM5TS1 CM5TS2 M6TS1 M6TS2 CM6TS1 CM6TS2 SM6TR SM6TS1 Model Single Model Result Consensus Model Result Splice Model Result 13 6 June 2006

14 SM6TS2 Renin Splice Discrimination Model Global Enhancement M1TS1 M1TS2 CM1TS1 CM1TS2 M5TS1 M5TS2 CM5TS1 CM5TS2 M6TS1 M6TS2 CM6TS1 CM6TS2 SM6TR SM6TS1 Model Single Model Result Consensus Model Result Splice Model Result 14 6 June 2006

15 COX Splice Discrimination Model Global Enhancement M1TR M1TS1 M1TS2 CM1TS1 CM1TS2 M2TR M2TS1 M2TS2 CM2TS1 CM2TS2 Model Single Model Result Consensus Model Result 15 6 June 2006

16 COX Splice Discrimination Model Global Enhancement M1TR M1TS1 M1TS2 CM1TS1 CM1TS2 M2TR M2TS1 M2TS2 CM2TS1 CM2TS2 Model Single Model Result Consensus Model Result 16 6 June 2006

17 Comparative Study: Oral vs. Non-Oral Drugs Oral vs. non-oral drugs dataset 1 GA model compared with models generated with Naïve Bayes Classifier (NBC) Support Vector Machines (SVM) Investigating: Consistency of results Interpretation of models 1. Vieth, M.; Siegel, M. G.; Higgs, R. E.; Watson, I. A.; Robertson, D. H.; Savin, K. A.; Durst, G. L.; Hipskind, P. A. Characteristic Physical Properties and Structural Fragments of Marketed Oral Drugs. J. Med. Chem. 2004, 47, June 2006

18 Oral Drug Discrimination Model Global Enhancement Training Set Test Set 1 Test Set 2 GA NBC SVM 18 6 June 2006

19 Model Interpretability Model weights indicate Important descriptors Important ranges Used to guide decisionmaking processes Similarity searching Filtering rules Rules are focused on domain of interest 19 6 June 2006

20 Conclusions Consensus and splice models provide consistently improved results GA models provide greater or similar interpretability than other methods applied here Models are transparent as to which descriptors and their ranges are of greatest importance in discriminating Indications that the GA and NBC methods could be applied in combination Investigation of complementarity 1. Ganguly, M.; Brown, N.; Schuffenhauer, A.; Ertl, P.; Gillet, V. J.; Greenidge, P. A. Introducing the Consensus Modeling Concept in Genetic Algorithms: Application to Interpretable Discrimination Analysis. Submitted to J. Chem. Inf. Mod June 2006

21 Areas the Student Covered Cluster analysis Druglikeness Discriminant analysis Variable selection Genetic algorithms Statistical learning methods Java programming Method development 21 6 June 2006

22 What does the student gain? Coding and adapting software Tackling everyday challenges of research Performing research in industry Application-context drug research Empowered to pursue their own research 22 6 June 2006

23 What do the mentors gain? Freedom to pursue an avenue of interest Developing skills in student mentoring A new viewpoint with new ideas Assisting in training the next generation of scientists 23 6 June 2006

24 Acknowledgements University of Sheffield Milan Ganguly Val Gillet Peter Willett UCSF Jérôme Hert Cheminformatics Peter Ertl Stephen Jelfs Computer-Aided Drug Discovery Paulette Greenidge Richard Lewis Nikolaus Stiefl Molecular & Library Informatics Kamal Azzaoui Edgar Jacoby Ansgar Schuffenhauer 24 6 June 2006

Fingerprint-Based Virtual Screening Using Multiple Bioactive Reference Structures

Fingerprint-Based Virtual Screening Using Multiple Bioactive Reference Structures Fingerprint-Based Virtual Screening Using Multiple Bioactive Reference Structures Jérôme Hert, Peter Willett and David J. Wilton (University of Sheffield, Sheffield, UK) Pierre Acklin, Kamal Azzaoui, Edgar

More information

Cheminformatics and its Role in the Modern Drug Discovery Process

Cheminformatics and its Role in the Modern Drug Discovery Process Cheminformatics and its Role in the Modern Drug Discovery Process Novartis Institutes for BioMedical Research Basel, Switzerland With thanks to my colleagues: J. Mühlbacher, B. Rohde, A. Schuffenhauer

More information

Data Visualization in Cheminformatics. Simon Xi Computational Sciences CoE Pfizer Cambridge

Data Visualization in Cheminformatics. Simon Xi Computational Sciences CoE Pfizer Cambridge Data Visualization in Cheminformatics Simon Xi Computational Sciences CoE Pfizer Cambridge My Background Professional Experience Senior Principal Scientist, Computational Sciences CoE, Pfizer Cambridge

More information

KNIME Enterprise server usage and global deployment at NIBR

KNIME Enterprise server usage and global deployment at NIBR KNIME Enterprise server usage and global deployment at NIBR Gregory Landrum, Ph.D. NIBR Informatics Novartis Institutes for BioMedical Research, Basel 8 th KNIME Users Group Meeting Berlin, 26 February

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

Cheminformatics and Pharmacophore Modeling, Together at Last

Cheminformatics and Pharmacophore Modeling, Together at Last Application Guide Cheminformatics and Pharmacophore Modeling, Together at Last SciTegic Pipeline Pilot Bridging Accord Database Explorer and Discovery Studio Carl Colburn Shikha Varma-O Brien Introduction

More information

A Statistician s View of Big Data

A Statistician s View of Big Data A Statistician s View of Big Data Max Kuhn, Ph.D (Pfizer Global R&D, Groton, CT) Kjell Johnson, Ph.D (Arbor Analytics, Ann Arbor MI) What Does Big Data Mean? The advantages and issues related to Big Data

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Integrating Medicinal Chemistry and Computational Chemistry: The Molecular Forecaster Approach

Integrating Medicinal Chemistry and Computational Chemistry: The Molecular Forecaster Approach Integrating Medicinal Chemistry and Computational Chemistry: The Molecular Forecaster Approach Molecular Forecaster Inc. www.molecularforecaster.com Company Profile Founded in 2010 by Dr. Eric Therrien

More information

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next

More information

speed thought Getting the most of CHEMAXON Integration June 2006 of The Power of at the

speed thought Getting the most of CHEMAXON Integration June 2006 of The Power of at the ETL Data Mining Workflow Engine In Database Analytics Process Knowledge Creation How Soon Can We Deliver? Which Project Is Most Successful? What More Information Do We Need? Where Is The Risk In My Portfolio?

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

Use of Predictive ADME in Library Profiling and Lead Optimization

Use of Predictive ADME in Library Profiling and Lead Optimization Use of Predictive ADME in Library Profiling and Lead Optimization Osman F. Güner and Robert D. Brown 223 rd ACS National Meeting April 2002, Orlando Florida Why Predictive ADME in Early Discovery? The

More information

Ph.D. in Bioinformatics and Computational Biology Degree Requirements

Ph.D. in Bioinformatics and Computational Biology Degree Requirements Ph.D. in Bioinformatics and Computational Biology Degree Requirements Credits Students pursuing the doctoral degree in BCB must complete a minimum of 90 credits of relevant work beyond the bachelor s degree;

More information

Working with telecommunications

Working with telecommunications Working with telecommunications Minimizing churn in the telecommunications industry Contents: 1 Churn analysis using data mining 2 Customer churn analysis with IBM SPSS Modeler 3 Types of analysis 3 Feature

More information

We use Reaxys intensively for hit identification, hit-to-lead and lead optimization.

We use Reaxys intensively for hit identification, hit-to-lead and lead optimization. CASE STUDY Dr. Fabio C. Tucci, COO of Epigen Biosciences We use Reaxys intensively for hit identification, hit-to-lead and lead optimization. CREATING NEW ASSETS Epigen Biosciences is a start-up pharmaceutical

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Consensus Scoring to Improve the Predictive Power of in-silico Screening for Drug Design

Consensus Scoring to Improve the Predictive Power of in-silico Screening for Drug Design Consensus Scoring to Improve the Predictive Power of in-silico Screening for Drug Design Masato Okada Faculty of Science and Technology, Masato Tsukamoto Faculty of Pharmaceutical Sciences, Hayato Ohwada

More information

A Survey on Intrusion Detection System with Data Mining Techniques

A Survey on Intrusion Detection System with Data Mining Techniques A Survey on Intrusion Detection System with Data Mining Techniques Ms. Ruth D 1, Mrs. Lovelin Ponn Felciah M 2 1 M.Phil Scholar, Department of Computer Science, Bishop Heber College (Autonomous), Trichirappalli,

More information

Molecular descriptors and chemometrics: a powerful combined tool for pharmaceutical, toxicological and environmental problems.

Molecular descriptors and chemometrics: a powerful combined tool for pharmaceutical, toxicological and environmental problems. Molecular descriptors and chemometrics: a powerful combined tool for pharmaceutical, toxicological and environmental problems. Roberto Todeschini Milano Chemometrics and QSAR Research Group - Dept. of

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Université de Montpellier 2 Hugo Alatrista-Salas : [email protected]

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr Université de Montpellier 2 Hugo Alatrista-Salas : [email protected] WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection

More information

Learning from Diversity

Learning from Diversity Learning from Diversity Epitope Prediction with Sequence and Structure Features using an Ensemble of Support Vector Machines Rob Patro and Carl Kingsford Center for Bioinformatics and Computational Biology

More information

Pre-Masters. Science and Engineering

Pre-Masters. Science and Engineering Pre-Masters Science and Engineering Science and Engineering Programme information Students enter the programme with a relevant first degree and study in English on a full-time basis for either 3 or 2 terms

More information

Structure of Presentation. The Role of Programming in Informatics Curricula. Concepts of Informatics 2. Concepts of Informatics 1

Structure of Presentation. The Role of Programming in Informatics Curricula. Concepts of Informatics 2. Concepts of Informatics 1 The Role of Programming in Informatics Curricula A. J. Cowling Department of Computer Science University of Sheffield Structure of Presentation Introduction The problem, and the key concepts. Dimensions

More information

Mining a Corpus of Job Ads

Mining a Corpus of Job Ads Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department

More information

Machine Learning with MATLAB David Willingham Application Engineer

Machine Learning with MATLAB David Willingham Application Engineer Machine Learning with MATLAB David Willingham Application Engineer 2014 The MathWorks, Inc. 1 Goals Overview of machine learning Machine learning models & techniques available in MATLAB Streamlining the

More information

Final Project Report

Final Project Report CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

More information

Big Data Analytics for Healthcare

Big Data Analytics for Healthcare Big Data Analytics for Healthcare Jimeng Sun Chandan K. Reddy Healthcare Analytics Department IBM TJ Watson Research Center Department of Computer Science Wayne State University 1 Healthcare Analytics

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

An Introduction to Data Mining

An Introduction to Data Mining An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

More information

Careers in Management Consulting, Pharma and Biotech 9 JUL 2009

Careers in Management Consulting, Pharma and Biotech 9 JUL 2009 Careers in Management Consulting, Pharma and Biotech 9 JUL 2009 Alex Szidon background Biology undergrad (Dartmouth 94) ; Biochem Ph.D (UCSF 2002) L.E.K. management consulting in life sciences Business

More information

LUCKY AHMED Department of Chemistry and Biochemistry Yale University, New Haven, CT 06511 Email: [email protected]

LUCKY AHMED Department of Chemistry and Biochemistry Yale University, New Haven, CT 06511 Email: lucky.ahmed@yale.edu LUCKY AHMED Department of Chemistry and Biochemistry Yale University, New Haven, CT 06511 Email: [email protected] EDUCATION PhD in Computational Chemistry Spring- Dissertation Title: Computational

More information

Feature Subset Selection in E-mail Spam Detection

Feature Subset Selection in E-mail Spam Detection Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature

More information

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE ACCELERATING PROGRESS IS IN OUR GENES AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE GENESPRING GENE EXPRESSION (GX) MASS PROFILER PROFESSIONAL (MPP) PATHWAY ARCHITECT (PA) See Deeper. Reach Further. BIOINFORMATICS

More information

The following module is compulsory for students who do not have an A-level pass in Mathematics. CH1M Chemistry M 20 4

The following module is compulsory for students who do not have an A-level pass in Mathematics. CH1M Chemistry M 20 4 BSc Chemistry For students entering Part 1 in 2011/2 Awarding Institution: Teaching Institution: Relevant QAA subject Benchmarking group(s): Faculty: Programme length: Date of specification: Programme

More information

TRTML - A Tripleset Recommendation Tool based on Supervised Learning Algorithms

TRTML - A Tripleset Recommendation Tool based on Supervised Learning Algorithms TRTML - A Tripleset Recommendation Tool based on Supervised Learning Algorithms Alexander Arturo Mera Caraballo 1, Narciso Moura Arruda Júnior 2, Bernardo Pereira Nunes 1, Giseli Rabello Lopes 1, Marco

More information

STRUCTURE-GUIDED, FRAGMENT-BASED LEAD GENERATION FOR ONCOLOGY TARGETS

STRUCTURE-GUIDED, FRAGMENT-BASED LEAD GENERATION FOR ONCOLOGY TARGETS STRUCTURE-GUIDED, FRAGMENT-BASED LEAD GENERATION FOR ONCOLOGY TARGETS Stephen K. Burley Structural GenomiX, Inc. 10505 Roselle Street, San Diego, CA 92121 [email protected] www.stromix.com Summary Structural

More information

Predictive Data modeling for health care: Comparative performance study of different prediction models

Predictive Data modeling for health care: Comparative performance study of different prediction models Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath [email protected] National Institute of Industrial Engineering (NITIE) Vihar

More information

How To Change Medicine

How To Change Medicine P4 Medicine: Personalized, Predictive, Preventive, Participatory A Change of View that Changes Everything Leroy E. Hood Institute for Systems Biology David J. Galas Battelle Memorial Institute Version

More information

Health Spring Meeting May 2008 Session # 42: Dental Insurance What's New, What's Important

Health Spring Meeting May 2008 Session # 42: Dental Insurance What's New, What's Important Health Spring Meeting May 2008 Session # 42: Dental Insurance What's New, What's Important Floyd Ray Martin, FSA, MAAA Thomas A. McInteer, FSA, MAAA Jonathan P. Polon, FSA Dental Insurance Fraud Detection

More information

Learning outcomes. Knowledge and understanding. Competence and skills

Learning outcomes. Knowledge and understanding. Competence and skills Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata Up Your R Game James Taylor, Decision Management Solutions Bill Franks, Teradata Today s Speakers James Taylor Bill Franks CEO Chief Analytics Officer Decision Management Solutions Teradata 7/28/14 3 Polling

More information

MA2823: Foundations of Machine Learning

MA2823: Foundations of Machine Learning MA2823: Foundations of Machine Learning École Centrale Paris Fall 2015 Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe agathe.azencott@mines paristech.fr TAs: Jiaqian Yu [email protected]

More information

Putting IBM Watson to Work In Healthcare

Putting IBM Watson to Work In Healthcare Martin S. Kohn, MD, MS, FACEP, FACPE Chief Medical Scientist, Care Delivery Systems IBM Research [email protected] Putting IBM Watson to Work In Healthcare 2 SB 1275 Medical data in an electronic or

More information

De novo design in the cloud from mining big data to clinical candidate

De novo design in the cloud from mining big data to clinical candidate De novo design in the cloud from mining big data to clinical candidate Jérémy Besnard Data Science For Pharma Summit 28 th January 2016 Overview the 3 bullet points Cloud based data platform that can efficiently

More information

Program Overview. Updated 06/13

Program Overview. Updated 06/13 Program Overview Biomedical Informatics is an interdisciplinary science that involves both the conceptual and practical tools from diverse disciplines for the understanding, invention, generation and propagation

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

THE CAMBRIDGE CRYSTALLOGRAPHIC DATA CENTRE (CCDC)

THE CAMBRIDGE CRYSTALLOGRAPHIC DATA CENTRE (CCDC) ABOUT THE CAMBRIDGE CRYSTALLOGRAPHIC DATA CENTRE (CCDC) The CCDC is the trusted research institution responsible for the 50-year old Cambridge Structural Database (CSD) and its applications. Used by thousands

More information

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov Data Integration Lectures 16 & 17 Lectures Outline Goals for Data Integration Homogeneous data integration time series data (Filkov et al. 2002) Heterogeneous data integration microarray + sequence microarray

More information

Program Overview. Updated 06/13

Program Overview. Updated 06/13 Program Overview Computing systems and technologies have become increasingly essential for modern practice of medicine, pharmaceutical and clinical research, efficient and effective management of health

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

The Need for Training in Big Data: Experiences and Case Studies

The Need for Training in Big Data: Experiences and Case Studies The Need for Training in Big Data: Experiences and Case Studies Guy Lebanon Amazon Background and Disclaimer All opinions are mine; other perspectives are legitimate. Based on my experience as a professor

More information

Prof. Elizabeth Raymond Department of Chemistry Western Washington University

Prof. Elizabeth Raymond Department of Chemistry Western Washington University Prof. Elizabeth Raymond Department of Chemistry Western Washington University Keys to Success 1. Be informed. 2. Do not self select. 3. Find something that interests you...... have FUN, but do not limit

More information

Big Data Challenges. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.

Big Data Challenges. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres. Big Data Challenges technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Data Deluge: Due to the changes in big data generation Example: Biomedicine

More information

The Artificial Prediction Market

The Artificial Prediction Market The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

More information

Data Mining Solutions for the Business Environment

Data Mining Solutions for the Business Environment Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania [email protected] Over

More information

Analysis Tools and Libraries for BigData

Analysis Tools and Libraries for BigData + Analysis Tools and Libraries for BigData Lecture 02 Abhijit Bendale + Office Hours 2 n Terry Boult (Waiting to Confirm) n Abhijit Bendale (Tue 2:45 to 4:45 pm). Best if you email me in advance, but I

More information

Scoring Functions and Docking. Keith Davies Treweren Consultants Ltd 26 October 2005

Scoring Functions and Docking. Keith Davies Treweren Consultants Ltd 26 October 2005 Scoring Functions and Docking Keith Davies Treweren Consultants Ltd 26 October 2005 Overview Applications Docking Algorithms Scoring Functions Results Demonstration Docking Applications Drug Design Lead

More information

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences

More information

Knowledge-based systems and the need for learning

Knowledge-based systems and the need for learning Knowledge-based systems and the need for learning The implementation of a knowledge-based system can be quite difficult. Furthermore, the process of reasoning with that knowledge can be quite slow. This

More information

High-Throughput Screening at The University of Chicago Cellular Screening Center. Sam Bettis Technical Director [email protected].

High-Throughput Screening at The University of Chicago Cellular Screening Center. Sam Bettis Technical Director sbettis@bsd.uchicago. igh-throughput Screening at The University of Chicago Cellular Screening Center Sam Bettis Technical Director [email protected] igh-throughput Screening at The University of Chicago! Cellular Screening

More information

Detecting client-side e-banking fraud using a heuristic model

Detecting client-side e-banking fraud using a heuristic model Detecting client-side e-banking fraud using a heuristic model Tim Timmermans [email protected] Jurgen Kloosterman [email protected] University of Amsterdam July 4, 2013 Tim Timmermans, Jurgen

More information

1 Topic. 2 Scilab. 2.1 What is Scilab?

1 Topic. 2 Scilab. 2.1 What is Scilab? 1 Topic Data Mining with Scilab. I know the name "Scilab" for a long time (http://www.scilab.org/en). For me, it is a tool for numerical analysis. It seemed not interesting in the context of the statistical

More information

How To Understand Protein-Protein Interaction And Inhibitors

How To Understand Protein-Protein Interaction And Inhibitors Protein-Protein Interactions and Inhibitors Alan Naylor Independent Consultant Optibrium Consultants Meeting Cambridge 27 th November 2012 Why PPI inhibitors? PPIs are involved in many biological / disease

More information

Mammoth Scale Machine Learning!

Mammoth Scale Machine Learning! Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes

More information

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML www.bsc.es A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML Josep Ll. Berral, Nicolas Poggi, David Carrera Workshop on Big Data Benchmarks Toronto, Canada 2015 1 Context ALOJA: framework

More information

Personalized Predictive Medicine and Genomic Clinical Trials

Personalized Predictive Medicine and Genomic Clinical Trials Personalized Predictive Medicine and Genomic Clinical Trials Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute http://brb.nci.nih.gov brb.nci.nih.gov Powerpoint presentations

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research

More information

Role of Social Networking in Marketing using Data Mining

Role of Social Networking in Marketing using Data Mining Role of Social Networking in Marketing using Data Mining Mrs. Saroj Junghare Astt. Professor, Department of Computer Science and Application St. Aloysius College, Jabalpur, Madhya Pradesh, India Abstract:

More information

High Performance Computing Initiatives

High Performance Computing Initiatives High Performance Computing Initiatives Eric Stahlberg September 1, 2015 DEPARTMENT OF HEALTH AND HUMAN SERVICES National Institutes of Health National Cancer Institute Frederick National Laboratory is

More information

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Martin Hlosta, Rostislav Stríž, Jan Kupčík, Jaroslav Zendulka, and Tomáš Hruška A. Imbalanced Data Classification

More information

Microarray Data Mining: Puce a ADN

Microarray Data Mining: Puce a ADN Microarray Data Mining: Puce a ADN Recent Developments Gregory Piatetsky-Shapiro KDnuggets EGC 2005, Paris 2005 KDnuggets EGC 2005 Role of Gene Expression Cell Nucleus Chromosome Gene expression Protein

More information

MS1b Statistical Data Mining

MS1b Statistical Data Mining MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Administrivia and Introduction Course Structure Syllabus Introduction to

More information

Use of social media data for official statistics

Use of social media data for official statistics Use of social media data for official statistics International Conference on Big Data for Official Statistics, October 2014, Beijing, China Big Data Team 1. Why Twitter 2. Subjective well-being 3. Tourism

More information

Machine learning for algo trading

Machine learning for algo trading Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with

More information

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology http://tinyurl.com/bioinf525-w16

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology http://tinyurl.com/bioinf525-w16 Course Director: Dr. Barry Grant (DCM&B, [email protected]) Description: This is a three module course covering (1) Foundations of Bioinformatics, (2) Statistics in Bioinformatics, and (3) Systems

More information

203.4770: Introduction to Machine Learning Dr. Rita Osadchy

203.4770: Introduction to Machine Learning Dr. Rita Osadchy 203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:

More information

Machine Learning. 01 - Introduction

Machine Learning. 01 - Introduction Machine Learning 01 - Introduction Machine learning course One lecture (Wednesday, 9:30, 346) and one exercise (Monday, 17:15, 203). Oral exam, 20 minutes, 5 credit points. Some basic mathematical knowledge

More information

Predictive Analytics Certificate Program

Predictive Analytics Certificate Program Information Technologies Programs Predictive Analytics Certificate Program Accelerate Your Career Offered in partnership with: University of California, Irvine Extension s professional certificate and

More information

Original article: A SIMPLE CLICK BY CLICK PROTOCOL TO PERFORM DOCKING: AUTODOCK 4.2 MADE EASY FOR NON-BIOINFORMATICIANS

Original article: A SIMPLE CLICK BY CLICK PROTOCOL TO PERFORM DOCKING: AUTODOCK 4.2 MADE EASY FOR NON-BIOINFORMATICIANS Original article: A SIMPLE CLICK BY CLICK PROTOCOL TO PERFORM DOCKING: AUTODOCK 4.2 MADE EASY FOR NON-BIOINFORMATICIANS Syed Mohd. Danish Rizvi 1, Shazi Shakil* 2, Mohd. Haneef 2 1 Department of Biosciences,

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

A Logistic Regression Approach to Ad Click Prediction

A Logistic Regression Approach to Ad Click Prediction A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi [email protected] Satakshi Rana [email protected] Aswin Rajkumar [email protected] Sai Kaushik Ponnekanti [email protected] Vinit Parakh

More information

Electronic Health Records: An introduction to openehr and archetypes

Electronic Health Records: An introduction to openehr and archetypes Electronic Health Records: An introduction to openehr and archetypes Dr. Sebastian Garde CCR Workshop Munich 29 th April 2008 Expectations Timely information and reports for ALL professions with a minimum

More information

The INFUSIS Project Data and Text Mining for In Silico Modeling

The INFUSIS Project Data and Text Mining for In Silico Modeling The INFUSIS Project Data and Text Mining for In Silico Modeling Henrik Boström 1,2, Ulf Norinder 3, Ulf Johansson 4, Cecilia Sönströd 4, Tuve Löfström 4, Elzbieta Dura 5, Ola Engkvist 6, Sorel Muresan

More information

Computational Drug Repositioning by Ranking and Integrating Multiple Data Sources

Computational Drug Repositioning by Ranking and Integrating Multiple Data Sources Computational Drug Repositioning by Ranking and Integrating Multiple Data Sources Ping Zhang IBM T. J. Watson Research Center Pankaj Agarwal GlaxoSmithKline Zoran Obradovic Temple University Terms and

More information

Artificial Intelligence and Machine Learning Models

Artificial Intelligence and Machine Learning Models Using Artificial Intelligence and Machine Learning Techniques. Some Preliminary Ideas. Presentation to CWiPP 1/8/2013 ICOSS Mark Tomlinson Artificial Intelligence Models Very experimental, but timely?

More information