Learning from Diversity

Similar documents
Data Mining. Nonlinear Classification

Supervised Learning (Big Data Analytics)

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Classification of Bad Accounts in Credit Card Industry

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Knowledge Discovery and Data Mining

T cell Epitope Prediction

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Machine learning for algo trading

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Active Learning SVM for Blogs recommendation

Azure Machine Learning, SQL Data Mining and R

Data Mining Techniques for Prognosis in Pancreatic Cancer

The Data Mining Process

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

Predicting linear B-cell epitopes using string kernels

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Server Load Prediction

Decision Trees from large Databases: SLIQ

Programming Exercise 3: Multi-class Classification and Neural Networks

Data Mining - Evaluation of Classifiers

On-line Spam Filter Fusion

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Classification Problems

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Predict Influencers in the Social Network

Learning outcomes. Knowledge and understanding. Competence and skills

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Statistics Graduate Courses

The Artificial Prediction Market

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Predicting Good Probabilities With Supervised Learning

Car Insurance. Havránek, Pokorný, Tomášek

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Data Mining Part 5. Prediction

Ensemble Approach for the Classification of Imbalanced Data

HT2015: SC4 Statistical Data Mining and Machine Learning

Decision Support Systems

Car Insurance. Prvák, Tomi, Havri

L25: Ensemble learning

Supervised Feature Selection & Unsupervised Dimensionality Reduction

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

STA 4273H: Statistical Machine Learning

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

A Tutorial in Genetic Sequence Classification Tools and Techniques

Data Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario jdu43@uwo.ca

CSCI567 Machine Learning (Fall 2014)

Knowledge Discovery and Data Mining

LCs for Binary Classification

Support Vector Machines with Clustering for Training with Very Large Datasets

Chapter 6. The stacking ensemble approach

Data Mining Practical Machine Learning Tools and Techniques

Beating the MLB Moneyline

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

How To Identify A Churner

Data Mining Analysis of HIV-1 Protease Crystal Structures

Predicting Flight Delays

Identifying SPAM with Predictive Models

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

A Novel Feature Selection Method Based on an Integrated Data Envelopment Analysis and Entropy Mode

MS1b Statistical Data Mining

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

A SURVEY OF TEXT CLASSIFICATION ALGORITHMS

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Big Data Analytics for Healthcare

Comparison of Data Mining Techniques used for Financial Data Analysis

Decompose Error Rate into components, some of which can be measured on unlabeled data

Predicting Student Persistence Using Data Mining and Statistical Analysis Methods

Homework Assignment 7

Multi-class protein fold recognition using adaptive codes

Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class

Detecting Corporate Fraud: An Application of Machine Learning

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Social Media Mining. Data Mining Essentials

Machine Learning with MATLAB David Willingham Application Engineer

Predictive Data modeling for health care: Comparative performance study of different prediction models

Knowledge Discovery and Data Mining

Using multiple models: Bagging, Boosting, Ensembles, Forests

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Classification and Regression by randomforest

How To Solve The Kd Cup 2010 Challenge

Getting Even More Out of Ensemble Selection

A semi-supervised Spam mail detector

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Analysis Tools and Libraries for BigData

Support Vector Machine (SVM)

Predicting borrowers chance of defaulting on credit loans

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Local features and matching. Image classification & object localization

Transcription:

Learning from Diversity Epitope Prediction with Sequence and Structure Features using an Ensemble of Support Vector Machines Rob Patro and Carl Kingsford Center for Bioinformatics and Computational Biology University of Maryland Nov. 16, 2010

Overview Challenge: epitope-antibody recognition Solution: ensemble of support vector machines Trained with probabilistic extension Variety of feature classes : physicochemical properties, string kernels, structure Performance of individual methods and ensemble

Problem Overview The Challenge The Details Binding Site{? }Binding Site Measure binding affinity aff(p i ) [0, 65536] C + = {p i aff(p i ) [10000, 65536]} 6, 841 binders C = {p i aff(p i ) [0, 1000]} 20, 437 non-binders Learn a function to predict binding http://visualscience.ru, 2010 Binding with linear epitopes Simpler sequence affinity relation f : P [0, 1] f (p i ) 0.5 = p C + f (p i ) < 0.5 = p C

System Overview Individual classifiers trained on various features f 0 f 1... f M Decision Trees, Boosted / Bagged / Random Forests, Naive Bayes, Logistic Regression, Maximum Entropy Classification, (Balanced) Winnow Classifiers, etc. Support Vector Machines (SVM)? 0 0.5 1.0 } } C - Unlikely Binder C + Likely Binder Aggregate scores of classifiers Produces prediction for binding class

Probabilistic SVMs Ideally we want a confidence in each prediction (Platt:1999) For each prediction, we obtain a posterior probability Allows ranking of predictions by posterior Aids in classifier combination 0 0.5 1.0 } } C - C +

Combining Predictions Probabilistic SVMs trained on various features using various kernels, with various parameters Combined by weighted voting: th Weight of the j classifier based on cross-validation performance Normalizing constant Posterior of positive class label th for j classifier

Choosing Features To train SVMs we translate each peptide p i into a feature vector x i Good features are essential Good features should Be discriminative Lead to class separability Be efficient to compute Real features Capture partial information Separate data subsets Are often complementary Consider many useful features predictive power

Which Features? ILAMRSHYPF Sequence Features Physico Biochemical Features Structure Features k-spectrum kernel mismatch kernel substitution kernel string subsequence kernel sparse spatial sample kernel BLOSUM encoding AAIndex encoding Local composition Peptide/Structure shape complementarity

Sequence Features (String Kernels) String kernels assign a similarity to a pair of strings K-spectrum kernel Consider all (K) k-mers that occur in the training set Encode each peptide as a vector v R K v j = { 1 if p contains the j th k-mer 0 otherwise or v j can encode the frequency of the j th k-mer in the peptide. Other string kernels: Mismatch kernel, Substitution kernel, Restricted gappy kernel, String subsequence kernel, Sparse Spatial Sample (SSS) kernel

Compositional Features Consider physicochemical properties of each peptide sequence Hydropathy, Antigenicity, Structure preference etc. Average property over entire peptide Map each peptide to a scalar v R Amino Acid Hydropathy A R N D C Q E G H I... A/L R/K N/M D/F C/P Q/S E/T G/W H/Y I/V 1.8-4.5-3.5-3.5 2.5-3.5-3.5-0.4-3.2 4.5 3.8-3.9 1.9 2.8 -.6-0.8-0.7-0.9-1.3 4.2 4.5 3.8 1.8 1.9-4.5-0.8-3.2-1.3-0.6 2.8 } 0.44

Local Compositional Features Physicochemical features can be useful but are global Epitope is only a subset of the peptide Amino Acid Hydropathy A R N D C Q E G H I... A/L R/K N/M D/F C/P Q/S E/T G/W H/Y I/V 1.8-4.5-3.5-3.5 2.5-3.5-3.5-0.4-3.2 4.5 3.8-3.9 1.9 2.8 -.6-0.8-0.7-0.9-1.3 4.2 4.5 3.8 1.8 1.9-4.5-0.8-3.2-1.3-0.6 2.8... 3.36 2.5-0.26... 0.3 Consider a sliding window of a given length w Move window along the peptide from left to right Average values over window Concatenate output to represent the peptide

Orthogonal Encoding Orthonormal representation proposed by Qian1988 C E H R Orthogonal Encoding Matrix 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 Map each amino acid a j p i to a 20 long bit-vector v j x i = v 0 v 1... v k 1 for an amino acid of length k

Property Encoding Orthogonality is not actually important in our application C E H R BLOSUM 62 9-1 -1... -4 0 0... -3-1 0... -3-1 -1 Replace the indicator vector by something more informative e.g. a row from a BLOSUM or PAM matrix

AAIndex Encoding The Amino Acid Index (AAIndex) (Kawashima2008) compiles a growing list of different phyiscochemical and biochemical properties of amino acids... 544 to date! C E H R Property Matrix 0 1... K Is it possible to make use of all this information? Use non-linear factor matrix of AAIndex (Nanni2010)

Structural Features Consider how well IgG and a peptide fit together. Poor Shape Good Shape Complementarity Complementarity Approximate native peptide conformation Choose most common sidechain positions Relax energy Experimentally measured IgG conformation frequency score frequency score Compute 2000 "best" dockings using ZDock (Chen, Li, and Wang 2003) Feature vector given by histogram of docking scores

Results Table vs. ensemble Features AUROC AUPR AUROC AUPR k-spectrum 0.85 0.70-0.043-0.072 Sparse Spatial Sample 0.87 0.73-0.023-0.042 Nonlinear Fisher Mat. 0.86 0.69-0.024-0.082 Statistical Analysis Mat. 0.85 0.67-0.025-0.102 BLOSUM Encoding 0.86 0.70-0.024-0.072 Local Composition 0.88 0.74-0.013-0.032 Structure 0.74 0.53-0.153-0.242 ensemble 0.893 0.772 2 nd Place 0.892 0.766-0.001-0.006 3 rd Place 0.864 0.691-0.029-0.081 4 th Place 0.855 0.689-0.038-0.083 using various physicochemical features

Performance Curves ROC

Performance Curves P/R

Conclusions Many good features exist They capture some non-overlapping information Ensemble solutions, used properly, are effective Structure features are hard to compute Much room for improvement here Simple features should not be discounted The local composition feature was the best single classifier We didn t encounter anyone using this in the literature!

Thanks Funding: NIH grant 1R21AI085376 and NSF grant 0849899 to C.K. For many interesting and useful conversations: Geet Duggal, Darya Filippova, Justin Malin, Guillaume Marçais, Saket Navlakha, Emre Sefer