Learning from Diversity

Size: px

Start display at page:

Download "Learning from Diversity"

Marshall Pitts
10 years ago
Views:

1 Learning from Diversity Epitope Prediction with Sequence and Structure Features using an Ensemble of Support Vector Machines Rob Patro and Carl Kingsford Center for Bioinformatics and Computational Biology University of Maryland Nov. 16, 2010

2 Overview Challenge: epitope-antibody recognition Solution: ensemble of support vector machines Trained with probabilistic extension Variety of feature classes : physicochemical properties, string kernels, structure Performance of individual methods and ensemble

3 Problem Overview The Challenge The Details Binding Site{? }Binding Site Measure binding affinity aff(p i ) [0, 65536] C + = {p i aff(p i ) [10000, 65536]} 6, 841 binders C = {p i aff(p i ) [0, 1000]} 20, 437 non-binders Learn a function to predict binding Binding with linear epitopes Simpler sequence affinity relation f : P [0, 1] f (p i ) 0.5 = p C + f (p i ) < 0.5 = p C

4 System Overview Individual classifiers trained on various features f 0 f 1... f M Decision Trees, Boosted / Bagged / Random Forests, Naive Bayes, Logistic Regression, Maximum Entropy Classification, (Balanced) Winnow Classifiers, etc. Support Vector Machines (SVM)? } } C - Unlikely Binder C + Likely Binder Aggregate scores of classifiers Produces prediction for binding class

5 Probabilistic SVMs Ideally we want a confidence in each prediction (Platt:1999) For each prediction, we obtain a posterior probability Allows ranking of predictions by posterior Aids in classifier combination } } C - C +

6 Combining Predictions Probabilistic SVMs trained on various features using various kernels, with various parameters Combined by weighted voting: th Weight of the j classifier based on cross-validation performance Normalizing constant Posterior of positive class label th for j classifier

7 Choosing Features To train SVMs we translate each peptide p i into a feature vector x i Good features are essential Good features should Be discriminative Lead to class separability Be efficient to compute Real features Capture partial information Separate data subsets Are often complementary Consider many useful features predictive power

8 Which Features? ILAMRSHYPF Sequence Features Physico Biochemical Features Structure Features k-spectrum kernel mismatch kernel substitution kernel string subsequence kernel sparse spatial sample kernel BLOSUM encoding AAIndex encoding Local composition Peptide/Structure shape complementarity

9 Sequence Features (String Kernels) String kernels assign a similarity to a pair of strings K-spectrum kernel Consider all (K) k-mers that occur in the training set Encode each peptide as a vector v R K v j = { 1 if p contains the j th k-mer 0 otherwise or v j can encode the frequency of the j th k-mer in the peptide. Other string kernels: Mismatch kernel, Substitution kernel, Restricted gappy kernel, String subsequence kernel, Sparse Spatial Sample (SSS) kernel

10 Compositional Features Consider physicochemical properties of each peptide sequence Hydropathy, Antigenicity, Structure preference etc. Average property over entire peptide Map each peptide to a scalar v R Amino Acid Hydropathy A R N D C Q E G H I... A/L R/K N/M D/F C/P Q/S E/T G/W H/Y I/V } 0.44

11 Local Compositional Features Physicochemical features can be useful but are global Epitope is only a subset of the peptide Amino Acid Hydropathy A R N D C Q E G H I... A/L R/K N/M D/F C/P Q/S E/T G/W H/Y I/V Consider a sliding window of a given length w Move window along the peptide from left to right Average values over window Concatenate output to represent the peptide

12 Orthogonal Encoding Orthonormal representation proposed by Qian1988 C E H R Orthogonal Encoding Matrix Map each amino acid a j p i to a 20 long bit-vector v j x i = v 0 v 1... v k 1 for an amino acid of length k

13 Property Encoding Orthogonality is not actually important in our application C E H R BLOSUM Replace the indicator vector by something more informative e.g. a row from a BLOSUM or PAM matrix

14 AAIndex Encoding The Amino Acid Index (AAIndex) (Kawashima2008) compiles a growing list of different phyiscochemical and biochemical properties of amino acids to date! C E H R Property Matrix K Is it possible to make use of all this information? Use non-linear factor matrix of AAIndex (Nanni2010)

15 Structural Features Consider how well IgG and a peptide fit together. Poor Shape Good Shape Complementarity Complementarity Approximate native peptide conformation Choose most common sidechain positions Relax energy Experimentally measured IgG conformation frequency score frequency score Compute 2000 "best" dockings using ZDock (Chen, Li, and Wang 2003) Feature vector given by histogram of docking scores

16 Results Table vs. ensemble Features AUROC AUPR AUROC AUPR k-spectrum Sparse Spatial Sample Nonlinear Fisher Mat Statistical Analysis Mat BLOSUM Encoding Local Composition Structure ensemble nd Place rd Place th Place using various physicochemical features

17 Performance Curves ROC

18 Performance Curves P/R

19 Conclusions Many good features exist They capture some non-overlapping information Ensemble solutions, used properly, are effective Structure features are hard to compute Much room for improvement here Simple features should not be discounted The local composition feature was the best single classifier We didn t encounter anyone using this in the literature!

20 Thanks Funding: NIH grant 1R21AI and NSF grant to C.K. For many interesting and useful conversations: Geet Duggal, Darya Filippova, Justin Malin, Guillaume Marçais, Saket Navlakha, Emre Sefer

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15