Listwise Approach to Learning to Rank - Theory and Algorithm

Transcription

1 Fen Xia* Institute of Automation, Chinese Academy of Sciences, Beijin, , P. R. China. Tie-Yan Liu Microsoft Research Asia, Sima Center, No.49 Zhichun Road, Haidian District, Beijin, , P. R. China. Jue Wan Wenshen Zhan Institute of Automation, Chinese Academy of Sciences, Beijin, , P. R. China. Han Li Microsoft Research Asia, Sima Center, No.49 Zhichun Road, Haidian District, Beijin, , P. R. China. Abstract This paper aims to conduct a study on the listwise approach to learnin to rank. The listwise approach learns a rankin function by takin individual lists as instances and minimizin a loss function defined on the predicted list and the round-truth list. Existin work on the approach mainly focused on the development of new alorithms; methods such as RankCosine and ListNet have been proposed and ood performances by them have been observed. Unfortunately, the underlyin theory was not sufficiently studied so far. To amend the problem, this paper proposes conductin theoretical analysis of learnin to rank alorithms throuh investiations on the properties of the loss functions, includin consistency, soundness, continuity, differentiability, convexity, and efficiency. A sufficient condition on consistency for rankin is iven, which seems to be the first such result obtained in related research. The paper then conducts analysis on three loss functions: likelihood loss, cosine loss, and cross entropy loss. The latter two were used in RankCosine and ListNet. The use of the likelihood loss leads to the development of Appearin in Proceedins of the 5 th International Conference on Machine Learnin, Helsinki, Finland, 008. Copyriht 008 by the author(s)/owner(s). *The work was performed when the first authors was an intern at Microsoft Research Asia. a new listwise method called ListMLE, whose loss function offers better properties, and also leads to better experimental results. 1. Introduction Rankin, which is to sort objects based on certain factors, is the central problem of applications such as information retrieval (IR) and information filterin. Recently machine learnin technoloies called learnin to rank have been successfully applied to rankin, and several approaches have been proposed, includin the pointwise, pairwise, and listwise approaches. The listwise approach addresses the rankin problem in the followin way. In learnin, it takes ranked lists of objects (e.., ranked lists of documents in IR) as instances and trains a rankin function throuh the minimization of a listwise loss function defined on the predicted list and the round truth list. The listwise approach captures the rankin problems, particularly those in IR in a conceptually more natural way than previous work. Several methods such as RankCosine and ListNet have been proposed. Previous experiments demonstrate that the listwise approach usually performs better than the other approaches (Cao et al., 007)(Qin et al., 007). Existin work on the listwise approach mainly focused on the development of new alorithms, such as RankCosine and ListNet. However, there was no sufficient theoretical foundation laid down. Furthermore, the strenth and limitation of the alorithms, and the relations between the proposed alorithms were still

2 not clear. This larely prevented us from deeply understandin the approach, more critically, from devisin more advanced alorithms. In this paper, we aim to conduct an investiation on the listwise approach. First, we ive a formal definition of the listwise approach. In rankin, the input is a set of objects, the output is a permutation of the objects 1, and the model is a rankin function which maps a iven input to an output. In learnin, the trainin data is drawn i.i.d. accordin to an unknown but fixed joint probability distribution between input and output. Ideally we would minimize the expected 0 1 loss defined on the predicted list and the round truth list. Practically we instead manae to minimize an empirical surroate loss with respect to the trainin data. Second, we evaluate a surroate loss function from four aspects: (a) consistency, (b) soundness, (c) mathematical properties of continuity, differentiability, and convexity, and (d) computational efficiency in learnin. We ive analysis on three loss functions: likelihood loss, cosine loss, and cross entropy loss. The first one is newly proposed in this paper, and the last two were used in RankCosine and ListNet, respectively. Third, we propose a novel method for the listwise approach, which we call ListMLE. ListMLE formalizes learnin to rank as a problem of minimizin the likelihood loss function, equivalently maximizin the likelihood function of a probability model. Due to the nice properties of the loss function, ListMLE stands to be more effective than RankCosine and ListNet. Finally, we have experimentally verified the correctness of the theoretical findins. We have also found that ListMLE can sinificantly outperform RankCosine and ListNet. The rest of the paper is oranized as follows. Section introduces related work. Section 3 ives a formal definition to the listwise approach. Section 4 conducts theoretical analysis of listwise loss functions. Section 5 introduces the ListMLE method. Experimental results are reported in Section 6 and the conclusion and future work are iven in the last section.. Related Work Existin methods for learnin to rank fall into three cateories. The pointwise approach (Nallapati, 004) transforms rankin into reression or classification on 1 In this paper, we use permutation and ranked list interchaneably. sinle objects. The pairwise approach (Herbrich et al., 1999) (Freund et al., 1998) (Bures et al., 005) transforms rankin into classification on object pairs. The advantae for these two approaches is that existin theories and alorithms on reression or classification can be directly applied, but the problem is that they do not model the rankin problem in a straihtforward fashion. The listwise approach can overcome the drawback of the aforementioned two approaches by tacklin the rankin problem directly, as explained below. For instance, Cao et al. (007) proposed one of the first listwise methods, called ListNet. In ListNet, the listwise loss function is defined as cross entropy between two parameterized probability distributions of permutations; one is obtained from the predicted result and the other is from the round truth. Qin et al. (007) proposed another method called RankCosine. In the method, the listwise loss function is defined on the basis of cosine similarity between two score vectors from the predicted result and the round truth. Experimental results show that the listwise approach usually outperforms the pointwise and pariwise approaches. In this paper, we aim to investiate the listwise approach to learnin to rank, particularly from the viewpoint of loss functions. Actually similar investiations have also been conducted for classification. For instance, in classification, consistency and soundness of loss functions are well studied. Consistency forms the basis for the success of a loss function. It is known that if a loss function is consistent, then the learned classifier can achieve the optimal Bayes error rate in the lare sample limit. Many well known loss functions such as hine loss, exponential loss, and loistic loss are all consistent (cf., (Zhan, 004)(Bartlett et al., 003)(Lin, 00)). Soundness of a loss function uarantees that the loss can represent well the tareted learnin problem. That is, an incorrect prediction should receive a larer penalty than a correct prediction, and the penalty should reflect the confidence of prediction. For example, hine loss, exponential loss, and loistic loss are sound for classification. In contrast, square loss is sound for reression but not for classification (Hastie et al., 001). 3. Listwise Approach We ive a formal definition of the listwise approach to learnin to rank. Let X be the input space whose In a broad sense, methods directly optimizin evaluation measures, such as SVM-MAP (Yue et al., 007) and AdaRank (Xu & Li, 007) can also be rearded as listwise alorithms. We will, however, limit our discussions in this paper on alorithms like ListNet and RankCosine.

3 elements are sets of objects to be ranked, Y be the output space whose elements are permutations of objects, and P XY be an unknown but fixed joint probability distribution of X and Y. Let h : X Y be a rankin function, and H be the correspondin function space (i.e., h H). Let x X and y Y, and let y(i) be the index of object which is ranked at position i. The task is to learn a rankin function that can minimize the expected loss R(h), defined as: R(h) = l(h(x), y)dp (x, y), (1) X Y where l(h(x), y) is the 0 1 loss function such that l(h(x), y) = { 1, if h(x) y 0, if h(x) = y, () That is to say, we formalize the rankin problem as a new classification problem on permutations. If the permutation of the predicted result is the same as the round truth, then we have zero loss; otherwise we have one loss. In real rankin applications, the loss can be cost-sensitive, i.e., dependin on the positions of the incorrectly ranked objects. We will leave this as our future work and focus on the 0 1 loss in this paper first. Actually, in the literature of classification, people also studied the 0 1 loss first, before they eventually moved onto the cost-sensitive case. It is easy to see that the optimal rankin function which can minimize the expected loss R(h B ) = inf R(h) is iven by the Bayes rule, h B (x) = ar max P (y x), (3) y Y Since P XY is unknown, formula (1) cannot be directly solved and thus h B (x) cannot be easily obtained. In practice, we are iven independently and identically distributed (i.i.d) samples S = {(x (i), y (i) )} m P XY, we instead try to obtain a rankin function h H that minimizes the empirical loss. R S (h) = 1 m m l(h(x (i) ), y (i) ). (4) Note that for efficiency consideration, in practice the rankin function usually works on individual objects. It assins a score to each object (by employin a scorin function ), sorts the objects in descendin order of the scores, and finally creates the ranked list. That is to say, h(x (i) ) is decomposable with respect to objects. It is defined as h(x (i) ) = sort((x (i) 1 ),..., (x(i) n i )). (5) where x (i) j x (i), n i denotes the number of objects in x (i), ( ) denotes the scorin function, and sort( ) denotes the sortin function. As a result, (4) becomes: R S () = 1 m m l(sort((x (i) 1 ),..., (x(i) n i )), y (i) ). (6) Due to the nature of the sortin function and the 0 1 loss function, the empirical loss in (6) is inherently non-differentiable with respect to, which poses a challene to the optimization of it. To tackle this problem, we can introduce a surroate loss as an approximation of (6), followin a common practice in machine learnin. ((x (i) 1 R φ S () = 1 m m φ((x (i) ), y (i) ), (7) where φ is a surroate loss function and (x (i) ) = ),..., (x(i) n i )). For convenience in notation, in the followin sections, we sometimes write φ y () for φ((x), y) and use bold symbols such as to denote vectors since for a iven x, (x) becomes a vector. 4. Theoretical Analysis 4.1. Properties of Loss Function We analyze the listwise approach from the viewpoint of surroate loss function. Specifically, we look at the followin properties 3 of it: (a) consistency, (b) soundness, (c) continuity, differentiability, and convexity, and (d) computational efficiency in learnin. Consistency is about whether the obtained rankin function can convere to the optimal one throuh the minimization of the empirical surroate loss (7), when the trainin sample size oes to infinity. It is a necessary condition for a surroate loss function to be a ood one for a learnin alorithm (cf., Zhan (004)). Soundness is about whether the loss function can indeed represent loss in rankin. For example, an incorrect rankin should receive a larer penalty than a correct rankin, and the penalty should reflect the confidence of the rankin. This property is particularly important when the size of trainin data is small, because it can directly affect the trainin results. 4.. Consistency We conduct analysis on learnin to rank alorithms from the viewpoint of consistency. As far as we know, 3 In addition, converence rate is another issue to consider. We leave it as future work.

4 this is the first work discussin the consistency issue for rankin. In the lare sample limit, minimizin the empirical surroate loss (7) amounts to minimizin the followin expected surroate loss R φ () = E X,Y {φ y ((x))} = E X {Q((x))} (8) where Q((x)) = y Y P (y x)φ y ((x)). Here we assume (x) is chosen from a vector Borel measurable function set, whose elements can take any value from Ω R n. When the minimization of (8) can lead to the minimization of the expected 0 1 loss (1), we say the surroate loss function is consistent. A equivalent definition can be found in Definition. Actually this equivalence relationship has been discussed in related work on the consistency of classification (Zhan, 004). Definition 1. We define Λ y as the space of all possible probabilities on the permutation space Y, i.e., Λ Y {p R Y : y Y p y = 1, p y 0}. Definition. The loss φ y () is consistent on a set Ω R n with respect to the rankin loss (1), if the followin conditions hold: p Λ Y, assume y = ar max y Y p y and Yy c denotes the space of permutations after removin y, we have inf Q() < inf Q() Ω Ω,sort() Y c y We next ive sufficient conditions of consistency in rankin. Definition 3. A permutation probability space Λ Y is order preservin with respect to object i and j, if the followin conditions hold: y Y i,j {y Y : y 1 (i) < y 1 (j)} where y 1 (i) denotes the position for object i in y, denote σ 1 y as the permutation which exchanes the positions of object i and j while hold others unchaned for y, we have p y > p σ 1 y. Definition 4. The loss φ y () is order sensitive on a set Ω R n, if φ y () is a non-neative differentiable function and the followin two conditions hold: 1. y Y, i < j, denote σy as the permutation which exchanes the object on position i and that on position j while holds others unchaned for y, if y(i) < y(j), then φ y () φ σy () and with at least one y, the strict inequality holds.. If i = j, then either y Y i,j, φ y() i or y Y i,j, φ y() i φ y() j one y, the strict inequality holds. φ y() j,, and with at least Theorem 5. Let φ y () be an order sensitive loss function on Ω R n. n objects, if its permutation probability space is order preservin with respect to n 1 objective pairs (j 1, j ), (j, j 3 ),, (j n 1, j n ). Then the loss φ y () is consistent with respect to (1). Due to space limitations, we only ive the proof sketch. First, we can show if the permutation probability space is order preservin with respect to n 1 objective pairs (j 1, j ), (j, j 3 ),, (j n 1, j n ), then the permutation with the maximum probability is y = (j 1, j,, j n ). Second, for an order sensitive loss function, for any order preservin object pairs (j 1, j ), the vector which minimizes Q() in (8) should assin a larer score to j 1 than to j. This can be proven by the chane of loss due to exchanin the scores of j 1 and j. Given all these results and Definition, we can prove Theorem 5 by means of contradiction. Theorem 5 ives sufficient conditions for a surroate loss function to be consistent: the permutation probability space should be order preservin and the function should be order sensitive. Actually, the assumption of order preservin has already been made when we use the scorin function and sortin function for rankin. The property of order preservin has also been explicitly or implicitly used in previous work, such as Cossock and Zhan (006). The property of order sensitive shows that startin with a round truth permutation, the loss will increase if we exchane the positions of two objects in it, and the speed of increase in loss is sensitive to the positions of objects Case Studies We look at the four properties of three loss functions Likelihood Loss We introduce a new loss function for listwise approach, which we call likelihood loss. The likelihood loss function is defined as: φ((x), y) = lo P (y x; ) (9) n exp((x y(i) )) where P (y x; ) = n k=i exp((x y(k))). Note that we actually define a parameterized exponential probability distribution over all the permutations iven the predicted result (by the rankin function), and define the loss function as the neative lo likelihood of the round truth list. The probability distribution turns out to be a Plackett-Luce model (Marden, 1995). The likelihood loss function has the nice properties as below.

5 First, the likelihood loss is consistent. The followin proposition shows that the likelihood loss is order sensitive. Therefore, accordin to Theorem 5, it is consistent. Due to the space limitations, we omit the proof. Proposition 6. The likelihood loss (9) is order sensitive on Ω R n. Second, the likelihood loss function is sound. For simplicity, suppose that there are two objects to be ranked (similar arument can be made when there are more objects). The two objects receive scores of 1 and from a rankin function. Fiure 1(a) shows the scores, and the point = ( 1, ). Suppose that the first object is ranked below the second object in the round truth. Then the upper left area above line = 1 corresponds to correct rankin; and the lower riht area incorrect rankin. Accordin to the definition of likelihood loss, all the points on the line = 1 +d has the same loss. Therefore, we say the likelihood loss only depends on d. Fiure 1(b) shows the relation between the loss function and d. We can see the loss function decreases monotonously as d increases. It penalizes neative values of d more heavily than positive ones. This will make the learnin alorithm focus more on avoidin incorrect rankins. In this reard, the loss function is a ood approximation of the 0 1 loss. d (a) = d 1 = 1 1 (b) φ d vector of the round truth and that of the predicted result. φ((x), y) = 1 (1 ψ y(x) T (x) ). (10) ψ y (x) (x) The score vector of the round truth is produced by a mappin ψ y ( ) : R d R, which retains the order in a permutation, i.e, ψ y (x y(1) ) > > ψ y (x y(n) ). First, we can prove that the cosine loss is consistent, iven the followin proposition. Due to space limitations, we omit the proof. Proposition 7. The cosine loss (10) is order sensitive on Ω R n. Second, the cosine loss is not very sound. Let us aain consider the case of rankin two objects. Fiure (a) shows point = ( 1, ) representin the scores of the predicted result and point ψ representin the round truth (which depends on the mappin function ψ). We denote the anle from point to line = 1 as α, and the anle from ψ to line = 1 as α ψ. We investiate the relation between the loss and the anle α. Fiure (b) shows the cosine loss as a function of α. From this fiure, we can see that the cosine loss is not a monotonously decreasin function of α. When α > α ψ, it increases quickly, which means that it can heavily penalize correct rankins. Furthermore, the mappin function and thus α ψ can also affect the loss function. Specifically, the curve of the loss function can shift from left to riht with different values of α ψ. Only when α ψ = π/, it becomes a relatively satisfactory representation of loss for the learnin problem. Fiure 1. (a) Rankin scores of predicted result; (b) Loss φ v.s. d for the likelihood loss. Third, it is easy to verify that the likelihood loss is continuous, differentiable, and convex (Boyd & Vandenberhe, 004). Furthermore, the loss can be computed efficiently, with time complexity of linear order to the number of objects. With the above ood properties, a learnin alorithm which optimizes the likelihood loss will become powerful for creatin a rankin function Cosine Loss The cosine loss is the loss function used in RankCosine (Qin et al., 007), a listwise method. It is defined on the basis of the cosine similarity between the score ψ = 1 φ α (a) 1 π α ψ π α (b) Fiure. (a) Rankin scores of predicted result and round truth; (b) Loss φ v.s. anle α for the cosine loss. Third, it is easy to see that the cosine loss is continuous, differentiable, but not convex. It can also be computed in an efficient manner with a time complexity linear to the number of objects.

6 Cross Entropy Loss The cross entropy loss is the loss function used in List- Net (Cao et al., 007), another listwise method. The cross entropy loss function is defined as: φ((x), y) = D(P (π x; ψ y ) P (π x; )) (11) n exp(ψ y (x π(i) )) where P (π x; ψ y ) = n k=i exp(ψ y(x π(k) )) P (π x; ) = n exp((x π(i) )) n k=i exp((x π(k))) where ψ is a mappin function whose definition is similar to that in RankCosine. First, we can prove that the cross entropy loss is consistent, iven the followin proposition. Due to space limitations, we omit the proof. Proposition 8. The cross entropy loss (11) is order sensitive on Ω R n. Second, the cross entropy loss is not very sound. Aain, we look at the case of rankin two objects. = ( 1, ) denotes the rankin scores of the predicted result. ψ denotes the rankin scores of the round truth (dependin on the mappin function). Similar to the discussions in the likelihood loss, the cross entropy loss only depends on the quantity d. Fiure 3(a) illustrates the relation between, ψ, and d. Fiure 3(b) shows the cross entropy loss as a function of d. As can be seen that the loss function achieves its minimum at point d ψ, and then increases as d increases. That means it can heavily penalize those correct rankins with hiher confidence. Note that the mappin function also affects the penalization. Accordin to mappin functions, the penalization on correct rankins can be even larer than that on incorrect rankins. ψ = d 1 = 1 φ Alorithm 1 ListMLE Alorithm Input: trainin data{(x (1), y (1) ),..., (x (m), y (m) )} Parameter: learnin rate η, tolerance rate ɛ Initialize parameter ω repeat for i = 1 to m do Input (x (i), y (i) ) to Neural Network and compute radient ω with current ω Update ω = ω η ω end for calculate likelihood loss on the trainin set until chane of likelihood loss is below ɛ Output: Neural Network model ω set of convex function is closed under addition (Boyd & Vandenberhe, 004). However, it cannot be computed in an efficient manner. The time complexity is of exponential order to the number of objects. Table 1 ives a summary of the properties of the loss functions. All the three loss functions as aforementioned are consistent, as well as continuous and differentiable. The likelihood loss is better than the cosine loss in terms of convexity and soundness, and is better than the cross entropy loss in terms of time complexity and soundness. 5. ListMLE We propose a novel listwise method referred to as ListMLE. In learnin of ListMLE, we employ the likelihood loss as the surroate loss function, since it is proven to have all the nice properties as a surroate loss. On the trainin data, we actually maximize the sum of the likelihood function with respect to all the trainin queries. m lo P (y (i) x (i) ; ). (1) d (a) 1 (b) d ψ d We choose Stochastic Gradient Descent (SGD) as the alorithm for conductin the minimization. As rankin model, we choose linear Neural Network (parameterized by ω). Alorithm 1 shows the learnin alorithm based on SGD. Fiure 3. (a) Rankin scores of predicted result and round truth; (b) Loss φ v.s. d for the cross entropy loss. Third, it is easy to see that the cross entropy loss is continuous and differentiable. It is also convex because the lo of a convex function is still convex, and the 6. Experimental Results We conducted two experiments to verify the correctness of the theoretical findins. One data set is synthetic data, and the other is the LETOR benchmark data for learnin to rank (Liu et al., 007).

7 Table 1. Comparison between different surroate losses. Loss Consistency Soundness Continuity Differentiability Convexity Complexity Likelihood O(n) Cosine O(n) Cross entropy O(n! n) 6.1. Experiment on Synthetic Data We conducted an experiment usin a synthetic data set. We created the data as follows. First, we randomly sample a point accordin to the uniform distribution on the square area [0, 1] [0, 1]. Then we assin to the point a score usin the followin rule, y = x x + ɛ where ɛ denotes a random variable normally distributed with mean of zero and standard deviation of In total, we enerate 15 points and their scores in this way, and create a permutation on the points based on their scores, which forms an instance of rankin. We repeat the process and make 100 trainin instances, 100 validation instances, and 100 testin instances. We applied RankCosine, List- Net 4, and ListMLE to the data. We tried different score mappin functions for RankCosine and ListNet, and used five most representative ones, i.e., lo(15 r), 15 r, 15 r, (15 r) and exp(15 r), where r denotes the positions of objects. We denote the mappin functions as lo, sqrt, l, q, and exp for simplicity. The experiments were repeated 0 times with different initial values of parameters in the Neural Network model. Table shows the means and standard deviations of the accuracies and Mean Averae Precision (MAP)(Baeza-Yates & Ribeiro-Neto, 1999) of the three alorithms. The accuracy measures the proportion of correctly ranked instances and MAP 5 is a commonly used measure in IR. As shown in the table, ListMLE achieves the best performance amon all the alorithms in terms of both accuracy and MAP, owin to ood properties of its loss function. The accuracies of RankCosine and ListNet vary accordin to the mappin functions. Especially, RankCosine achieves an accuracy of only when usin the mappin function exp while when usin the mappin function l. This result indicates that the performances of the cosine loss and the cross entropy loss depend on the mappin functions, while findin a suitable mappin function is not easy. Furthermore, RankCosine has a larer variance than ListMLE and ListNet. The likely explanation is that RankCosine s 4 The top-1 version of the cross entropy loss was employed as in the oriinal work (Cao et al., 007). 5 When calculatin MAP, we treated the top-1 items as relevant and the other as irrelevant. Table. The performance of three alorithms on the synthetic data set. Alorithm Accuracy MAP ListMLE 0.9 ± ± 0.00 ListNet-lo ± ± 0.00 ListNet-sqrt ± ± 0.00 ListNet-l ± ± ListNet-q ± ± 0.00 ListNet-exp 0.83 ± ± RankCosine-lo ± ± RankCosine-sqrt ± ± RankCosine-l ± ± 0.00 RankCosine-q 0.10 ± ± RankCosine-exp ± ± performance is sensitive to the initial values of parameters due to the non-convexity of its loss function. 6.. Experiment on OHSUMED Data We also conducted an experiment on OHSUMED, a benchmark data set for learnin to rank provided in LETOR. There are in total 106 queries, and 16,140 query-document pairs upon which relevance judments are made. The relevance judments are either definitely relevant, possibly relevant, or not relevant. The data was in the form of feature vector and relevance label. There are in total 5 features. We used the data split provided in LETOR to conduct five-fold cross validation experiments. In evaluation, besides MAP, we adopted another measures commonly used in IR: Normalized Discounted Cumulative Gain (NDCG) (Jarvelin & Kekanainen, 000). Note that here the round truth in the data is iven as partial rankin, while the methods need to use total rankin (permutation) in trainin. To bride the ap, for RankCosine and ListNet, we adopted the methods proposed in the papers (Cao et al., 007) (Qin et al., 007). For ListMLE we randomly selected one perfect permutation for each query from amon the possible perfect permutations based on the round truth. We applied RankCosine, ListNet, and ListMLE to the data. The results reported below are those averaed over five trials. As shown in Fiure 4, ListMLE achieves the best performance amon all the alorithms. Especially, on NDCG@1, it has more than

8 5-point ains over RankCosine which is at the second place. We also conducted the t-test on the improvements of ListMLE over the other two alorithms. The results show that the improvements are statistically sinificant for and (p-value < 0.05) MAP ListMLE ListNet RankCosine Fiure 4. Rankin performance on OHSUMED data. 7. Conclusion In this paper, we have investiated the theory and alorithms of the listwise approach to learnin to rank. We have pointed out that to understand the effectiveness of a learnin to rank alorithm, it is necessary to conduct theoretical analysis on its loss function. We propose investiatin a loss function from the viewpoints of (a) consistency, (b) soundness, (c) continuity, differentiability, convexity, and (d) efficiency. We have obtained some theoretical results on consistency of rankin. We have conducted analysis on the likelihood loss, cosine loss, and cross entropy loss. The result indicates that the likelihood loss has better properties than the other two losses. We have then developed a new learnin alorithm usin the likelihood loss, called ListMLE and demonstrated its effectiveness throuh experiments. There are several directions which we can further explore. (1) We want to conduct more theoretical analysis on the properties of loss functions, for example, weaker conditions for consistency and the rates of converence. () We plan to study the case where costsensitive loss function is used instead of the 0 1 loss function in definin the expected loss. (3) We plan to investiate other surroate loss functions with the tools we have developed in this paper. References Baeza-Yates, R., & Ribeiro-Neto, B. (Eds.). (1999). Modern information retrieval. Addison Wesley. Bartlett, P. L., Jordan, M. I., & McAuliffe, J. D. (003). Convexity, classification, and risk bounds (Technical Report 638). Statistics Department, University of California, Berkeley. Boyd, S., & Vandenberhe, L. (Eds.). (004). Convex optimization. Cambride University. Bures, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., & Hullender, G. (005). Learnin to rank usin radient descent. Proceedins of ICML 005 (pp ). Cao, Z., Qin, T., Liu, T. Y., Tsai, M. F., & Li, H. (007). Learnin to rank: From pairwise approach to listwise approach. Proceedins of the 4th International Conference on Machine Learnin (pp ). Corvallis, OR. Cossock, D., & Zhan, T. (006). Subset rankin usin reression. COLT (pp ). Freund, Y., Iyer, R., Schapire, R. E., & Siner, Y. (1998). An efficient boostin alorithm for combinin preferences. Proceedins of ICML (pp ). Hastie, T., Tibshirani, R., & Friedman, J. H. (Eds.). (001). The elements of statistical learnin: Data minin, inference and prediction. Spriner. Herbrich, R., Graepel, T., & Obermayer, K. (1999). Support vector vector learnin for ordinal reression. Proceedins of ICANN (pp ). Jarvelin, K., & Kekanainen, J. (000). Ir evaluation methods for retrievin hihly relevant documents. Proceedins of SIGIR (pp ). Lin, Y. (00). Support vector machines and the bayes rule in classification. Data Minin and Knowlede Discovery, Liu, T. Y., Qin, T., Xu, J., Xion, W. Y., & Li, H. (007). Letor: Benchmark dataset for research on learnin to rank for information retrieval. Proceedins of SIGIR. Marden, J. I. (Ed.). (1995). Analyzin and modelin rank data. London: Chapman and Hall. Nallapati, R. (004). Discriminative models for information retrieval. Proceedins of SIGIR (pp ). Qin, T., Zhan, X.-D., Tsai, M.-F., Wan, D.-S., Liu, T.- Y., & Li, H. (007). Query-level loss functions for information retrieval. Information processin and manaement. Xu, J., & Li, H. (007). Adarank: a boostin alorithm for information retrieval. Proceedins of SIGIR (pp ). Yue, Y., Finley, T., Radlinski, F., & Joachims, T. (007). A support vector method for optimization averae precision. Proceedins of SIGIR (pp ). Zhan, T. (004). Statistical analysis of some multicateory lare marin classification methods. Journal of Machine Learnin Research, 5,