Two Heads Better Than One: Metric+Active Learning and Its Applications for IT Service Classification

29 Ninth IEEE International Conference on Data Mining Two Heads Better Than One: Metric+Active Learning and Its Applications for IT Service Classification Fei Wang 1,JimengSun 2,TaoLi 1, Nikos Anerousis 2 1 School of Computing and Information Sciences, Florida International University, Miami, FL 33199 2 IBM T. J. Watson Research Center, Service Department, Yorktown Heights, NY 98-218 Abstract Large IT service providers track service requests and their execution through problem/change tickets. It is important to classify the tickets based on the problem/change description in order to understand service quality and to optimize service processes. However, two challenges exist in solving this classification problem: 1) ticket descriptions from different classes are of highly diverse characteristics, which invalidates most standard distance metrics; 2) it is very expensive to obtain high-quality labeled data. To address these challenges, we develop two seemingly independent methods 1) Discriminative Neighborhood Metric Learning (DNML) and 2) Active Learning with Median Selection (ALMS), both of which are, however, based on the same core technique: iterated representative selection. A case study on real IT service classification application is presented to demonstrate the effectiveness and efficiency of our proposed methods. I. INTRODUCTION In IT outsourcing environments, every day thousands of problem and change requests are generated on a diverse set of issues related to all kinds of software and hardware. Those requests need to be resolved correctly and quickly in order to meet service level agreements (SLAs). Under this environment, when devices fail or applications need to be upgraded, the person recovering such failures or applying the patch is likely to be sitting thousand miles from the affected component. The reliance on outsourcing of technology support has fundamentally shift the dependencies between participating organizations. In such a complex environment, all service requests are handled and tracked through tickets by various problem & change management systems. A ticket is opened with certain symptom description, and routed to appropriate Service Matter Experts (SMEs) for resolution. The solution is documented when the ticket is closed. The job of SMEs is to resolve tickets quickly and correctly in order to meet SLAs. Another important job role is Quality Analysts (QAs), whose responsibility is to analyze recent tickets in order to identify the opportunity for the service-quality improvement. For example, a frequent password resets on one system may be due to the inconsistent password period in that system. Instead of resetting the password all the times, the password period should be adjusted properly. Or sometimes one patch or fix for a particular server should also be applied to all servers with the same configuration, instead of creating multiple tickets with exactly the same work order on each of those systems. It requires great understanding of the current ticket distribution for QAs to identify any opportunity for optimization. In another word, it is important to classify those tickets based on their description and resolution accurately in a timely manner. In this paper, we address the ticket classification problem that lies in the core of IT service quality improvement with significant practical impact and great challenges: Standard distance metrics do not apply due to diverse characteristics of raw features. More specifically, the ticket description and solution are highly diverse and noisy. Since different SMEs can describe the same problem quite differently and the descriptions are typically short and noisy. Also depending on the types of problems, the description can vary significantly. There are almost no high-quality labeled data. Tickets are handled by SMEs, who often do not have the incentive or ability to classify the tickets accurately, due to heavy workload and incomplete information. On the other hand, QAs, who have the right ability and incentive, do not have cycle to manually label all the tickets. To address these two challenges, we propose a novel hybrid approach that leverage both active learning and metric learning. The contributions of this paper are the following: We propose Discriminative Neighborhood Metric Learning (DNML) that learns a domain-specific distance metric using the overall data distribution and a small set of labeled data. We propose Active Learning with Median Selection (ALMS) that progressively selects the representative data points that need to be labeled, which is naturally a multi-class algorithm. We combine metric and active learning steps into a unified process over the data. Moreover, our algorithm can automatically detect the number of classes contained in the data set. We demonstrate the effectiveness and efficiency of DNML and ALMS over several datasets compared to several existing methods. -4786/9 $26. 29 IEEE DOI 1.119/ICDM.29.13 122

because of its homogeneous assumption of all the feature dimensions. Therefore many researchers propose to learn a Mahalanobis distance which measures the distance between data x i and x j by d m (x i, x) = (x i x j ) C(x i x j ) (2) Figure 1. The basic algorithm flowchart, where the red blobs represent the labeled points, and the gray blobs correspond to unlabeled points. The rest of the paper is organized as follows: Section II presents the methods for metric and active learning. Section III demonstrates the practical case-study of the proposed methods on IT ticket classification application. Finally, Section V concludes. II. METRIC+ACTIVE LEARNING In this section we will introduce our algorithm in detail. First we give an overview of our algorithm. A. Algorithm Overview The basic procedure of our algorithm is to iterate the following procedure: Learn a distance metric from the labeled data set, and then classify the unlabeled data points using the nearest neighbor classifier with the learned metric. We call the data points in the same class a cluster. Select the median from each cluster. For each cluster X i, we partition it into a labeled set Xi L (whose labels are given initially) and an unlabeled set Xi U (whose labels are predicted by the nearest neighbor classifier). Then the median for X i is defined as m i = arg min x c i 2 (1) x Xi U where c i is the mean of X i. Add the selected points into the labeled set. Fig.1 shows the graphical view of the basic algorithm flowchart. B. Discriminative Neighborhood Metric Learning As we know that a good distance metric plays a central role in many data mining and machine learning algorithms (e.g., Nearest Neighbor classifier, k-means algorithm). Usually the Euclidean distance cannot satisfy our requirements where C R d d is a positive semi-definite covariance matrix used to incorporate the correlations of different feature dimensions. In this paper, we consider to learn a low-rank covariance matrix C, such that with the learned distance metric, the within-class compactness and betweenclass scatterness are maximized. Different from linear discriminant analysis [1], which seeks for a discriminant subspace in a global sense, our algorithm aims to learn a distance metric with enhanced local discriminability. To define such local discriminability, we first introduce the definition of two types of neighborhoods [8]: Definition 1: Homogeneous Neighborhood. The homogeneous neighborhood of x i, denoted as Ni o,isthe N i o - nearest data points of x i with the same label, Ni o is the size of Ni o. Definition 2: Heterogeneous Neighborhood. The heterogeneous neighborhood of x i, denoted as Ni e,isthe N i e - nearest data points of x i with different labels, Ni e is the size of Ni e. Basedontheabovetwodefinitions, we can define the local compactness of point x i as C i = d 2 j:x j Ni o m (x i, x j ) (3) and the local scatter ness of point x i as S i = d 2 k:x k Ni e m (x i, x k ) (4) Then the local discriminability of the data set X with respect to the distance metric d m can be defined as J = i C i j:x j N (x i = o i x j ) C(x i x j ) (x i x k ) () C(x i x k ) i S i k:x k N e i The goal of our algorithm is to minimize J, which is equivalent to minimize the local compactness and maximize the local scatterness simultaneously. Fig.2 provides an intuitive graphical illustration of the theme behind our algorithm. However, the minimization of J in Eq.() to get an optimal C is not an easy task as there are d(d +1)/2 variables to solve provided that C is symmetric. Recall that we require C to be a low-rank matrix and C is positive semi-definite, then by incomplete Choleskey factorization, we can decompose C to C = WW (6) where W R d r and r is the rank of C. Inthisway, we only need to solve W instead of solving the entire C. 123

(a) The neighborhoods of x i with Euclidean distance (b) The neighborhoods of x i with the learned distance Figure 2. The homogeneous and heterogeneous neighborhoods of x i with the regular Euclidean and learned Mahalanobis distance metrics. The blob with dark green in the middle of the circle is x i, and the blobs with green and blue correspond to the points in Ni o and N i e. The goal of DNML is to learn a distance metric that pulls the points in Ni o towards x i while push the points in Ni e away from x i. Table I DECOMPOSED NEWTOWN S PROCEDURE FOR SOLVING THE TRACE RATIO PROBLEM Input: Matrices M C and M S, precision ε, dimensiond Output: Trace Ratio value λ and matrix W Procedure: 1. Initialize λ =, t = 2. Do eigenvalue decomposition to M C λ tm S 3. Let (β k (λ), w k (λ)) be the k-th eigenvalue, eigenvector pair obtained from step 2, define the first order Taylor expansion ˆβ k (λ) =β k (λ)+β k (λt)(λ λt), where β k (λt) =w k(λ t) M S w k (λ t) 4. Define ˆf d (λ) to be the sum of the smallest d ˆβ k (λ), solve ˆd d (λ) =and set the root to be λ t+1. If λ t+1 λ t <ɛ,gotostep6;otherwiset = t +1, go to step 2 6. Output λ = λ, andw to be the eigenvectors w.r.t. the smallest d eigenvalues of M C M S Combining Eq.(6) and Eq.(), we can derive the following optimization problem tr(w M C W) min W tr(w (7) M S W) where M C = (x i x j )(x i x j ) (8) i j:x j Ni o M S = (x i x k )(x i x k ) (9) i k:x k Ni e are the compactness and scatterness matrices respectively. Therefore, our problem (7) becomes a trace quotient minimization problem, and we can make use of the decomposed Newtown s method [3]. C. Active Learning with Median Selection Recently a novel active learning method called Transductive Experimental Design (TED) [11] is proposed, which aims to select the k most representative points contained in the data set. Despite the theoretical soundness and empirical success of TED, there are still some limitations: Although the name suggests TED is transductive, it does make use of any label information contained in the data set. In fact, TED just uses the whole data set (include labeled and unlabeled) to select the k most representative data, such that the linear reconstruction loss of the whole data set using the selected points are minimized. In this sense, TED is an unsupervised method. As the authors analyzed in [11], TED tends to select the data points with large norms (where the authors analyzed that these points are hard to predict). However, these selected points lie on the border of the data distribution area, and those data points could be outliers that would mislead the classification process. Based on the above analysis, we propose to (1) make use of the label information; (2) select the representative points locally. Specifically, we first learn a distance metric using our DNML method introduced in last section and then apply the nearest neighbor classifier to classify the unlabeled points. In this way, the whole data set is partitioned into several classes, and for each class, we just select the median point as defined in Eq.(1). 4 3 2 1 1 2 class1 class2 class3 class4 selected 3 4 2 2 4 (a) TED 4 3 2 1 1 2 class1 class2 class3 class4 selected 3 4 2 2 4 (b) Local Median Figure 3. Active learning results. (a) shows the results of transductive experimental design; (b) shows the results of our local median selection method. Fig.3 illustrates a toy example on the difference between TED and our local median selection method. The data set here is generated from 4 Gaussians, and each Gaussian contains 1 data points. We treat each Gaussian as a class. Initially we randomly label 1% of the data points and use TED to select 4 most representative data points shown as black triangles in Fig.3(a), from which we can see that these points all lie on the border of these Gaussians. Fig.3(b) shows the results of our local median selection method, where we first apply DNML to learn a proper distance metric from the labeled points and then use such metric to classify the whole data set, and finally we select one median from each class. From the figure we observe that the selected points are representative for each Gaussian. An issue that is worth mentioning here is that our algorithm in fact can be viewed as an approximated version of 124

Table II THE METRIC+ACTIVE LEARNING ALGORITHM 6 6 Inputs: Training data, Ni o, N i e, precision ɛ, dimensiond, number of iteration steps T Outputs: The selected points and learned W Procedure: for t=1:t 1. Construct M S, M C from the training data 2. Learn a proper distance metric 3. Count the number of classes k in the training data, apply the learned metric to classify the unlabeled data using the nearest neighbor classifier 4. Select the median in each class and add them into the training data pool end local TED, wherewefirst partition the data set into several local regions using the learned distance metric, and then select exactly ONE representative point in each region. As data mean is the most representative point for a set of data in the sense of Euclidean loss, we select the median which is closest to the data mean from the candidate set. The whole algorithm procedure is summarized in Table II. III. TICKET CLASSIFICATION: ACASE STUDY In this section we present the detailed experimental results on applying our proposed active learning scheme for ticket classification. First we will describe the basic characteristics of the data set. A. The Data Set There are totally 4182 tickets from 27 classes. We use the bag-of-words features, which results in a 3882 dimensional space. After eliminating the duplicate and null tickets, there are 2222 tickets remained. The class distribution is shown in Fig.4(a), from which we can observe that the classes are highly imbalanced and there are many rare classes with only a few data points. We identify a class as rare class if and only if the number of data points contained in it is less than 2. In our experiments, we eliminate those rare classes, which results in a data set of size 2161 from classes, and the class distribution is shown in Fig.4(b). Besides rare classes, we also observe that the data set is highly sparse and there are also a set of rare features. The original feature distribution is shown in Fig.(a), where we accumulated the times that each feature appears in each class. We identify a feature as a rare feature if and only if its total appearance times in the data set is less than 1. After eliminating those rare features, we obtain a data set with 669 features and its distribution is shown in Fig.(b). Finally, we also eliminated the data with only these rare features, which makes the final data set containing 213 tickets with 669 features. number of data points 4 3 2 1 1 2 2 3 Figure 4. classes 1 1 (a) Original data number of data points 4 3 2 1 1 2 3 4 6 7 8 9 1 11 12 13 14 (b) Data distribution after eliminating the rare classes Class distribution of original data and the data with no rare 1 2 2 3 3 feature index (a) Original data feature distribution 1 2 3 4 6 feature index (b) Data feature distribution with no rare features Figure. Feature distribution of the original data and the data with rare feature eliminated. B. Distance Metric Learning In this part of experiments, we first test the effectiveness of our DNML algorithm on the ticket data set, where we use our algorithm to obtain a distance metric, and then use such distance metric to perform nearest neighbor classification and get the final classification accuracy. Such procedure is repeated times and we report the average classification accuracies and standard deviations as in Fig.6. The size of homogeneous and heterogeneous neighborhoods are set to 3 manually, and the rank of the covariance matrix C is set to 4. From the figure we observe the superiority of our metric learning method. Specifically, with the learned metric, our DNML method clearly outperforms the original NN method, which validates that DNML can learn a better distance metric. As there are two sets of parameters in our DNML method, one is the rank r of the covariance matrix C, the other is the sizes of the homogeneous neighborhood N o and heterogeneous neighborhood N e (denoted as n o and n e ). Therefore we also conducted a set of experiments to test the sensitivity of DNML with respect to those parameters. Fig.7 shows how the algorithm performance varies with respect to the rank of the covariance matrix C, wherewe randomly label half of the tickets as the training set, and the remaining tickets are used for testing. We set the sizes of N o and N e to be 3. The results in Fig.7 are summarized over independent runs. From the figure we can see that 6 4 2 6 4 2 12

classification accuracy.9.9.8.8.7 DNML NN NB RLS SVM not that sensitive with respect to the variation of Ni o and Ni e. From Fig.8 we can see that when the neighborhood sizes are small, the algorithm performance is better than that when the neighborhood sizes are large. This is possibly because the distribution of the data set is complicated and data in different classes are highly overlapped, then when we enlarge the neighborhoods to include more data points, the learned distance metric might be corrupted by some noisy points, which will make the final classification results inaccurate..7 2 3 4 6 7 percentage of labeled tickets (%) Figure 6. Classification accuracy comparison with different supervised learning methods. The x-axis represents the percentage of randomly labeled tickets, and the y-axis denotes the averaged classification accuracy. the final ticket classification results are stable with respect to the choice of the rank of the covariance matrix, except the cases when the rank is too small (i.e., 1 in our case), since in those cases too much information will be lost. When the rank becomes too large, some noise contained in the data set could be retained, therefore the performance of our algorithm would go down a little, and the choices of r [2, 4] are all reasonable. classification accuracy.87.86.8.84.83.82.81.8.79 1 2 3 4 6 rank of the covariance matrix Figure 7. The sensitivity of the performance of our algorithm with respect to the rank r of the covariance matrix C. Weset Ni o = N i e =3,and half of the data set are labeled as the training data. We also test the sensitivity of our algorithm with respect to the choices of the sizes of Ni o and Ni e, the results of which are shown in Fig.8, where the x-axis and y-axis correspond to the sizes of Ni o and N i e, and the z-axis denotes the classification accuracy averaged over independent runs. Here we assume that the sizes homogeneous and heterogeneous neighborhoods are the same for all the data points. For each run, we randomly label % of the tickets as training data, and the rest as testing data. From Fig.8 we can clearly see that the whole surface of z = f(x, y) is flat, which means that the performance of our algorithm is Accuracy.8.84.83 N i e 1 Figure 8. The sensitivity of the performance of our algorithm with respect to the choices of Ni o and N i e, and half of the data set are labeled as the training data. C. Integrated Active Learning and Distance Metric Learning In our implementation, we initially label 2% of the data set, and then apply the various active learning methods. Since there are totally classes in the ticket data set, for each method, we select points from the unlabeled set in each round. For all the approaches that use DNML, we set N o = N e =3, and the rank of the covariance matrix is set to 4. Fig.9 illustrates the results of these algorithms summarized over independent runs, where the x-axis represents the percentage of selected points, and the y-axis denotes the averaged classification accuracy as well as the standard deviation. From the figure we can clearly see that with our DNML+LMED method, the classification accuracy will ascend faster compared to other methods. IV. RELATED WORKS In this section we will briefly review some previous works that are closely related to our metric+active learning method. A. Distance Metric Learning Distance metric learning plays a central role in real world applications. According to [1], these approaches can mainly be categorized into two classes: unsupervised and supervised. Here we mainly review the supervised methods, which learn distance metric from the data set with some supervised information. Usually the information takes the 1 N i o 126

classification accuracy 1.9.8.7.6 DNML+LMED. LMED DNML+Rand DNML+TED.4 1 2 2 3 3 percentage of actively labeled tickets (%) Figure 9. The classification accuracy vs. number of selected tickets plot. The x-axis represents the percentage of labeled tickets, and the y-axis represents the classification accuracy averaged over independent runs. form of pairwise constraints, which indicating whether a pair of data points belong to the same class (usually referred to as must-link constraints) or different classes (cannotlink) constraints. Then these algorithms aim to learn a proper distance metric under which the data with must-link constraints are as compact as possible, while the data with cannot-link constraints are far apart from each other. Some typical approaches include the side-information method [9], Relevant Component Analysis (RCA) [],anddiscriminant Component Analysis (DCA)[2].OurDiscriminant Neighborhood Metric Learning (DNML) method can also be viewed as a supervised method, however, we make use of the labeled data together with their labels, which is different from those pairwise constraints. B. Active Learning In many real-world problems, we face the problems when unlabeled data are abundant but labeling data is expensive to obtain (e.g., in text classification, it is expensive and time consuming to ask the users to label the documents manually, however, it is quite easy to obtain a large amount of unlabeled documents by crawling the web). In such a scenario the learning algorithm can actively query the user/teacher for labels. This type of iterative supervised learning is called active learning. Since the learner chooses the examples, the number of examples to learn a concept can often be much lower than the number required in normal supervised learning. Two classical active learning algorithms are Tong and Koller s Simple SVM algorithm [6] and Seung et al. s Query By Committee (QBC) algorithm [4]. However, the simple SVM algorithm is coupled with the Support Vector Machine (SVM) classifier [7] and is only applicable to two-class problems. For the QBC algorithm, we need to construct a committee of models that represent different regions of the version space and have some measure of disagreement among committee members, which is usually difficult for real world applications. Recently, Yu et al. [11] proposed another active learning algorithm called Transductive Experimental Design (TED), which aims to find some most representative points that can optimally reconstruct the whole data set in the sense of Euclidean sense. Our median selection strategy introduced in this paper is similar to TED, and we have analyzed in section II-C the superiority of our algorithm. V. CONCLUSIONS We present a novel metric+active learning method for IT service ticket classification in this paper. Our method combines both the strengths of metric learning and active learning methods. Finally the experimental results on both benchmark and real ticket data sets are presented to demonstrate the effectiveness of the proposed method. ACKNOWLEDGEMENT The work is partially supported by NSF CAREER Award IIS-4628 and a 28 IBM Faculty Award. REFERENCES [1] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, San Diego, California, 199. [2] S. Hoi, W. Liu, M. Lyu, and W. Ma. Learning distance metrics with contextual constraints for image retrieval. In Proceedings of CVPR26, 26. [3] Y. Jia, F. Nie, and C. Zhang. Trace ratio problem revisited. IEEE Transactions on Neural Networks, 29. [4] H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of COLT, pages 287 294, 1992. [] N. Shental, T. Hertz, D. Weinshall, and M. Pavel. Adjustment learning and relevant component analysis. In Proceedings of ECCV, pages 776 79, 22. [6] S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:4 66, 21. [7] V. N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA, 199. [8] F. Wang and C. Zhang. Feature extraction by maximizing the neighborhood margin. In Proceedings of CVPR, 27. [9] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with application to clustering with sideinformation. In Advances in Neural Information Processing Systems, volume, pages 12, 23. [1] L. Yang. Distance metric learning: A comprehensive survey. Technical report, Michgan State University, 26. [11] K. Yu, J. Bi, and V. Tresp. Active learning via transductive experimental design. In Proceedings of ICML, pages 181 188, 26. 127