Learning-Based Summarisation of XML Documents

Transcription

1 Learning-Base Summarisation of XML Documents Massih R. Amini Anastasios Tombros Nicolas Usunier Mounia Lalmas {first University Pierre an Marie Curie Queen Mary, University of Lonon 8, rue u capitaine Scott Department of Computer Science 7505, Paris Lonon E 4NS France Unite Kingom Abstract. Documents formatte in extensible Markup Language (XML) are available in collections of various ocument types. In this paper, we present an approach for the summarisation of XML ocuments. The novelty of this approach lies in that it is base on features not only from the content of ocuments, but also from their logical structure. We follow a machine learning, sentence extractionbase summarisation technique. To fin which features are more effective for proucing summaries, this approach views sentence extraction as an orering task. We evaluate our summarisation moel using the INEX an SUMMAC atasets. The results emonstrate that the inclusion of features from the logical structure of ocuments increases the effectiveness of the summariser, an that the learnable system is also effective an well-suite to the task of summarisation in the context of XML ocuments. Our approach is generic, an is therefore applicable, apart from entire ocuments, to elements of varying granularity within the XML tree. We view these results as a step towars the intelligent summarisation of XML ocuments. Introuction With the growing availability of on-line text resources, it has become necessary to provie users with systems that obtain answers to queries in a manner which is both efficient an effective. In various information retrieval (IR) tasks, single ocument text summarisation (SDS) systems are esigne to help users to quickly fin the neee information [9, 24]. For example, SDS can be couple with conventional search engines an help users to evaluate the relevance of ocuments [34] for proviing answers to their queries. The original problem of summarisation requires the ability to unerstan an synthesise a ocument in orer to generate its abstract. However, ifferent attempts to prouce human quality summaries have shown that this process of abstraction is highly complex, since it nees to borrow elements from fiels such as linguistics, iscourse unerstaning an language generation [23, 6]. Instea, most stuies consier the task of text summarisation as the extraction of text spans (typically sentences) from the original ocument; scores are assigne to text units an the best-scoring spans are presente in the summary. These approaches transform the problem of abstraction into a simpler problem of ranking spans from an original text accoring to their relevance to be part of the ocument summary. This kin of summarisation is relate to the task of ocument

2 2 Massih R. Amini Anastasios Tombros Nicolas Usunier Mounia Lalmas retrieval, where the goal is to rank ocuments from a text collection with respect to a given query in orer to retrieve the best matches. Although such an extractive approach oes not perform an in-epth analysis of the source text, it can prouce summaries that have proven to be effective [9, 24, 34]. To compute sentence scores, most previous stuies aopt a linear weighting moel which combines statistical or linguistic features characterising each sentence in a text [2]. In many systems, the set of feature weights are tune manually; this may not be tractable in practice, as the importance of ifferent features can vary for ifferent text genres [4]. Machine Learning (ML) approaches within the classification framework, have shown to be a promising way to combine automatically sentence features [7, 32, 5, 2]. In such approaches, a classifier is traine to istinguish between two classes of sentences: summary an non-summary ones. The classifier is learnt by comparing its output to a esire output reflecting global class information. This framework is limite in that it makes the assumption that all sentences from ifferent ocuments are comparable with respect to this class information. Here we explore a ML approach for SDS base on ranking. The main rationale of this approach is to learn how to best combine sentence features such that within each ocument, summary sentences get higher scores than non-summary ones. This orering criterion correspons exactly to what the learnt function is use for, i.e. orering sentences. Statistical features that we consier in this work, are partly from the stateof-art, an they inclue cue-phrases an positional inicators [2, 9], an title-keywor similarity [9]. In aition, we propose a new contextual approach base on topic ientification to extract meaningful features from sentences. In this paper, we apply the ML approach for summarisation to XML ocuments. The XML format is becoming increasingly popular [26], an this has cause a consierable interest in the content-base retrieval of XML ocuments, mainly through the INEX initiative [3]. In XML retrieval, ocument components, rather than entire ocuments, are retrieve. As the number of XML components is typically large (much larger than that of ocuments), it is essential to provie users of XML IR systems with overviews of the contents of the retrieve elements. The element summaries can then be use by searchers in an interactive environment. In traitional (i.e. non XML) interactive information retrieval, a summary is usually associate with each ocument; in interactive XML retrieval, a summary can be associate with each retrieve XML component. Because of the nature of XML ocuments, users can also browse within the XML ocument containing that element. One metho to facilitate browsing, is to isplay the logical structure of the ocument containing the retrieve elements (e.g. in a Table of Contents format). In this way, summaries can also be associate with the other elements forming the ocument, in aition to the retrieve elements themselves [30]. The choice of the meaningful granularity of elements to be summarise is also currently being investigate [3], as some retrieve elements may simply be too short to be summarise. The summarisation of XML ocuments is also beginning to raw attention from researchers [, 20, 26, 30]. In our experiments we have consiere sentences for extractive summarisation, so from now on, we will refer to sentences as the basic text-units to be extracte.

3 Learning-Base Summarisation of XML Documents 3 A major aim of this paper is to investigate the effectiveness of an XML summarisation approach by combining structural an content features to extract sentences for summaries. More specifically, a further novel feature of our work is that we make use of the logical structure of ocuments to enhance sentence characterisation. In XML ocuments, a tree-like structure, which correspons to the logical structure of the source ocument, is encoe. For example, an article can be seen as the root of the tree, an sections, subsections an paragraphs can be arrange in branches an leaves of the tree. We select a number of features from this logical structure, an learn what features are best preictors of summary-worthy sentences. The contributions of this work are therefore twofol: first, we propose an justify the effectiveness of a ranking algorithm, instea of the mostly use classification error criterion in ML approaches for SDS, an secon, we investigate the summarisation of XML ocuments by taking into account features relating both to the content an the logical structure of the ocuments. The ultimate aim of our approach is to generate summaries for components of XML ocuments at any level in the logical structure hierarchy. Since at present the evaluation of such summaries is har (ue to the lack of appropriate resources), we consier an XML article to be an XML element, an we use its content an structure to learn how we can best summarise it. Our approach is sufficiently generic to be applie to a component at any level of the logical structure of an XML ocument. In the remainer of the paper, we first iscuss, in section 2, relate work on ML approaches base on the classification framework an outline our ML approach for summarisation. In section 3 we present the structural an content features that we use to represent sentences for this task. In Section 4 we outline our evaluation methoology. In section 5 we present the results of our evaluation using two atasets from the INitiative for the Evaluation of XML retrieval (INEX) [3] an the Computation an Language collection (cmp-lg) of TIPSTER SUMMAC [28]. Finally, in section 6 we iscuss the outcomes of this stuy an we also raw some pointers for the continuation of this research. 2 Trainable text summarisers The purpose of this section is to present evience that, for SDS, a ranking framework is better suite for the learning of a scoring function than a classification framework. To this en, we efine two trainable text summarisers learnt using a classification an a ranking criterion, an show upon the choice of these learning criteria why our proposition hols. In both cases, we aim to learn a scoring function h : R n R which represents the best linear combination of sentence features accoring to the learning criterion in use uner the supervise setting. We chose to use a simple linear combination of sentence features for two reasons. First, uner the classification framework, it has been shown that simple linear classifiers like the Naive Bayes moel [7], or a Support Vector Machine [5] perform as well as more complex non-linear classifiers [5]. Seconly, in orer to compare fairly between the ranking an classification approaches we fix the class of the scoring function (linear in our case) an consier two ifferent

4 4 Massih R. Amini Anastasios Tombros Nicolas Usunier Mounia Lalmas learning criteria evelope uner these two frameworks. The choice of the best ranking function class for SDS is beyon the scope of the paper. In the following, we first present notations use in the rest of the paper an give a brief review of the classification framework for text summarisation, an then present the main motivation for using an alternative ML approach base on orering criteria for this task. 2. Notations We enote by D the collection of ocuments in the training set an assume that each ocument in D is compose of a set of sentences 2, = (s k ) k {,..., } where is the length of ocument in terms of the number of sentences composing. Each sentence s = (s i ) i {,...,n} is characterise by a set of n structural an statistical features that we present in Section 3. Without loss of generality, we assume that every feature is a positive real value for any sentence. Uner the supervise setting, we suppose that a binary relevance jugment vector y = (y k ), y k {, }, k is associate to each ocument ; y k inicates whether the sentence s k in belongs, or not, to the summary. 2.2 Text summarisation as a classification task In this section, we present the classification framework for SDS which is the most use learning scheme for this task in literature. We first present a classification learning criterion relate to the minimisation of the misclassification error, an then present a logistic classifier that we prove to be aequate for this optimisation. Misclassification error rate The working principle of classification approaches to SDS is to associate class label to summary (or relevant) sentences, an class label to non-summary (or irrelevant) ones, an to use a learning algorithm to iscover for each sentence s the best combination weights of its features h(s), with the goal of minimising the error rate of the classifier (or its classification loss enote by L C ), that is, the expectation that a sentence is incorrectly classifie by the output classifier. L C (h) = E ([[yh(s) < 0]]) () where [[pr]] is equal to if preicate pr hols an 0 otherwise. The computation of this expecte error rate epens on the probability istribution from which each pair (sentence, class) is suppose to be rawn ientically an inepenently. In practice, since this istribution is unknown, the true error rate cannot be compute exactly an it is estimate over a labele training set by the empirical error rate ˆL c given by ˆL C (h, S) = S [[yh(s) < 0]] (2) s S 2 Recall that in extractive summarisation, the summary of a ocument is mae of a subset of its sentences.

5 Learning-Base Summarisation of XML Documents 5 where S represents the set of all sentences appearing in D. We notice here that sentences from ifferent ocuments are comparable with respect to a global class information. A irect optimisation of the empirical error rate (equation 2) is not tractable as this function is not ifferentiable. Schapire an Singer [25] motivate e yh(s) as a ifferentiable upper boun to [[yh(s) < 0]]. This follows because for all x, e x [[x < 0]]. Figure shows the graphs of these two misclassification error functions as well as the log-likelihoo loss function introuce below with respect to yh; negative (positive) values of yh imply incorrect (correct) classification. The exponential an log-likelihoo criteria are ifferentiable upper bouns of the misclassification error rate. These functions are also convex, so stanar optimisation algorithms can be use to minimise them. Frieman et al. have shown in [2] that the function h minimising E(e yh(s) ) is a logistic classifier whose output estimates p(y = s), the posterior probability of the class relevant given a sentence s. 6 Miscalssification Exponentiel Log-likelihoo 5 4 Loss yh Fig.. Misclassification, exponential an log-likelihoo loss functions with respect to yh. In many ML approaches, the optimisation criterion to train a logistic classifier is the binomial log-likelihoo function E log( + e 2yh(s) ). The reason is that from a statistical point of view, e yh(s) is not equal to the log of any probability mass function on ± as it is the case for log(+e 2yh(s) ). Nevertheless, Frieman et al. have shown that the optimisation of both criteria is effective an that the population minimisers of E log( + e 2yh(s) ) an E(e yh(s) ) coincie [2].

6 6 Massih R. Amini Anastasios Tombros Nicolas Usunier Mounia Lalmas For the ranking case, we will aopt a similar logistic moel an show that the minimisation of the exponential loss has a real avantage over the log-binomial in terms of computational complexities (see Section 2.3). Logistic moel for classification For the classification case, we propose to learn the parameters Λ = (λ,..., λ n ) of the feature combination h(s) = n i= λ is i by training a logistic classifier whose output estimates p(relevant s) = in orer to +e 2h(s) minimise the empirical exponential boun estimate on the training set: L c exp(s; Λ) = S e y y {,} s S y n i= λisi (3) where S an S are respectively the set of relevant an irrelevant sentences in the training set S an S is the number of sentences in S. For the minimisation of L c exp, we employ an iterative scaling algorithm [7]. This proceure is shown in Algorithm. Starting from some arbitrary set of parameters Λ = (λ,..., λ n ), the algorithm fins iteratively a new set of parameters Λ + = (λ + δ,..., λ n + δ n ) that yiel a moel of lower L c exp. At every iteration t, the upate of each λ i in this algorithm is to take λ (t+) i where each δ (t) i, i {,..., n} satisfies δ (t) i λ (t) i + δ (t) i s i e h(s,λ = 2 log s S s i e h(s,λ s S We erive this upate rule in Appenix A. After convergence, sentences of a new ocument are ranke with respect to the output of the classifier, an those with the highest scores are extracte to form the summary of the ocument. An avantage of Algorithm is that its complexity is linear in the number of examples, times the total number of iterations ( S t). This is interesting, since the number of sentences in the training set is generally large. In the following, we introuce our ranking framework for SDS. (t) ) (t) ) 2.3 Text summarisation as an orering task The classification framework for SDS has several rawbacks. First, the assumption that all sentences from ifferent ocuments are comparable with respect to a class information is not correct. Inee, text summaries epen more on the content of their respective ocuments than on a global class information. Furthermore, ue to a high number of irrelevant sentences, a classifier will typically achieve a low misclassification rate if, inepenently of where relevant sentences are ranke, it always assigns the class

7 Learning-Base Summarisation of XML Documents 7 Algorithm : Classification Base Trainable Extractive Summariser Input : S = S S Initialise: Normalise each sentence vector s S such that i si = Set the value of feature weights Λ 0 = (λ 0,..., λ 0 n) with some arbitrary values 0 t repeat for i to n o s ie h(s,λ λ (t+) i λ (t) i + log s S 2 s ie h(s,λ (t) ) s S en t t + until Convergence of L c exp(s; Λ) ; Output : Λ F Create a summary for each new ocument by taking the n first sentences in with regar to the output of the linear combination of sentence features with Λ F (t) ) irrelevant to every sentence in the collection. Therefore, it is important to compare the relevance of each sentence with respect to each other within every ocument in the training set, in other wors, to learn a ranking function that assigns higher scores to relevant sentences of a ocument than to irrelevant ones. A Framework for learning a ranking function for SDS The problem of learning a trainable summariser base on ranking can be formalise as follows. For each ocument in D we enote by S an S respectively the sets of relevant an irrelevant sentences appearing in with respect to its summary. The ranking function can be represente by a function h that reflects the partial orering of relevant sentences over irrelevant ones for each ocument in the training set. For a given ocument, if we consier two sentences s an s such that s is preferre over s (s S an s S ) then h ranks s higher than s D, (s, s ) S S h(s) > h(s ) Finally, in orer to learn the ranking function we nee a relevance jugment escribing which sentence is preferre to which one. This information is given by binary jugments provie for ocuments in the training set. For these ocuments, sentences belonging (or not) to the summary are labele as + (or ). Following [], we can efine the goal of learning a ranking function h as the minimisation of the ranking loss L R efine as the average number of relevant sentences score below irrelevant ones in every ocument in D

8 8 Massih R. Amini Anastasios Tombros Nicolas Usunier Mounia Lalmas L R (h, D) = D S D S s S s S [[h(s) h(s )]] (4) Note that this formulation is similar to the misclassification error rate. The main ifference, is that instea of classifying sentences as relevant/irrelevant for the summary, a ranking algorithm classifies pairs of sentences. More specifically, it consiers the pair of sentences (s, s ) from the same ocument, such that one of the two sentences is relevant. Learning a scoring function h, which gives higher score to the relevant sentence than to the irrelevant one is then equivalent to learning a classifier which correctly classifies the pair. The Ranking Logistic Algorithm Here we are intereste in the esign of an algorithm which allows (a) to fin efficiently a function h in the family of linear ranking functions minimising equation (4), an (b) that this function generalises well on a given test set. In this paper we aress the first problem, an provie empirical evience for the performance of our ranking algorithm on ifferent test sets. There exist several ranking algorithms in the ML literature, base on the perceptron [27] or AaBoost - calle RankBoost []. For the SDS task, as the total number of sentences in the collection may be high, we nee a simple an efficient ranking algorithm. Perceptron-base ranking algorithms woul lea to quaratic complexity in the number of examples, whereas the RankBoost algorithm in its stanar setting oes not search a linear combination of the input features. In this paper, we consier the class of linear ranking functions n D, s h(s, B) = β i s i (5) where B = (β,..., β n ) are the vector weights of the ranking function that we aim to learn. Similar to the explanation given in section 2.2, a logistic moel is aapte to ranking 3 : p(relevant (s, s )) = + e 2 n (6) i= βi(si s i ) is well suite for learning the parameters of the combination B by minimising an exponential upper boun on the ranking loss L R, (equation 4): L r exp(d; B) = D D S S i= (s,s ) S S e n i= βi(s i si) (7) The interesting property of this exponential loss for ranking functions is that it can be compute in time linear to the number of examples, simply by rewriting equation (7) as follows: L r exp(d; B) = D D S S ( s S e n i= βis i )( s S e n i= βisi ) (8) 3 The choice of linear ranking functions, in our case, makes it convenient to represent a pair of sentences (s, s ) by the ifference of their representative vectors, (s s,..., s n s n) as h(s) h(s ) becomes n i= βi(si s i).

9 Learning-Base Summarisation of XML Documents 9 For the ranking case, this property makes it convenient to optimise the exponential loss rather than the corresponing binomial log-likelihoo L r b(d; B) = D D S S (s,s ) S S log( + e 2 n i= βi(si s i ) ) (9) Inee, the computation of the maximum likelihoo of equation (9) requires to consier all the pairs of sentences, an leas to a complexity quaratic in the number of examples. Thus, although ranking algorithms consier the pairs of examples, in the special case of SDS, the propose algorithm is of complexity linear to the number of examples through the use of the exponential loss. For the optimisation of equation (8) we have employe the same iterative scaling proceure as in the classification case. We call our algorithm LinearRank, its pseuocoe is shown in Algorithm 2 an its upate rule (B t+ B t + Σ t ) is erive in Appenix B. Algorithm 2: Ranking Base Trainable Extractive Summariser - LinearRank Input : S D S Initialise: Normalise each sentence vector s such that i si =, i.e. i, si [0, ] Set the value of feature weights B 0 = (β, 0..., βn) 0 with some arbitrary values 0 t repeat for i to n o β (t+) i β (t) i + log 2 S D S S S D s S s S e h(s,b (t) ) e h(s,b(t)) ( s i + s i) s S e h(s,b (t) ) e h(s,b(t)) ( + s i s i) s S en t t + until Convergence of L r exp(d; B) ; Output : B F Create a summary for each new ocument by taking the n first sentences in with regar to the linear combination of sentence features with B F The most similar work to ours is that of Freun et al. [] who propose the Rank- Boost algorithm. In both cases the parameters of the combination are learnt by minimising a convex function. However, the main ifference is that we propose here to learn a linear combination of the features by irectly optimising equation (8), while RankBoost learns iteratively a nonlinear combination of the features by aaptively resampling the training ata.

10 0 Massih R. Amini Anastasios Tombros Nicolas Usunier Mounia Lalmas 3 Summarising XML ocuments In the following, we introuce the sentence features that we use as the input of the trainable summarisers efine in the previous section. Here, we take the logical structure of ocuments into account when proucing summaries, as well as the content, an we learn an effective combination of features for summarisation. Although for evaluation purposes we use the INEX an SUMMAC collections, which contain scientific articles, our approach coul apply to any ocuments formatte in XML where the logical structure is available. The summarisation of scientific texts through sentence extraction has been extensively stuie in the past [33]. In our approach, we o not explicitly take avantage of the iiosyncratic nature of scientific articles, but we rather propose a generic approach that is, in essence, genre-inepenent. In the next section, we present the specific etails of our approach. 3. Document features for summarisation In this section we outline the features of XML ocuments that we employe in our summarisation moel. Structural features Past work on SDS (e.g. [9, 7]) has implicitly trie to take the structure of certain ocument types into account when extracting sentences. In [7], for example, the leaing an trailing paragraphs in a ocument are consiere important, an the position of sentences within these paragraphs is also recore, an use, as a feature for summarisation. In our work, we move into an explicit use of structural features by taking into account the logical structure of XML ocuments. Our aim here is to investigate more precisely from which component of a ocument the summary is more likely to be generate. The structural features we use in our approach are:. The epth of the element in which the sentence is containe (e.g. section, subsection, subsubsection, etc.). 2. The sibling number of the element in which the sentence is containe (e.g. st, mile, last). 3. The number of sibling elements of the element in which the sentence is containe. 4. The position in the element of the paragraph in which the sentence is containe (e.g. first, or not). These features are generic, an can be applie to an entire ocument, or to components at any level of the XML tree that can be meaningfully summarise (i.e. components not too small to be summarise). These are just some of the features that can be use for moeling structural information; many of them have been consiere for example in XML retrieval approaches (see [3]). Content features Terms containe in the title of a ocument have long been recognise as effective features for automatic summarisation [9]. Our basic content-only query (COQ) comprises terms in the title of the ocument (Title query), as well as the title keywors augmente by the most frequent terms in the ocument (up to 0 such terms)

11 Learning-Base Summarisation of XML Documents (Title-MFT query). The rationale of these approaches is that these terms shoul appear in sentences that are worthwhile incluing in summaries. The importance of title terms for SDS can also be extene to components of finer granularity (e.g. sections, subsections, etc.), by using the title of the ocument to fin relevant sentences within any component, or, where appropriate, by using meaningful titles of components. Since the Title query may be very short, sentences similar to the title which o not contain title keywor terms will have a similarity measure null with the Title query. To overcome this problem we have employe query-expansion techniques such as Local Context Analysis (LCA) [37] or thesaurus expansion methos (i.e. WorNet [0]), as well as a learning-base expansion technique. These three expansion techniques are escribe next. Expansion via WorNet an LCA From the Title query, we forme two other queries, reflecting local links between the title keywors an other wors in the corresponing ocument: Title-LCA query, inclues keywors in the title of a ocument an the wors that occur most frequently in sentences that are most similar to the Title query accoring to the cosine measure. Title-WN, inclues expane title keywors an all their first orer synonyms using WorNet. We use the cosine measure in orer to compute a preliminary score between any sentence of a ocument an these four queries (Title, Title-MFT, Title-LCA, Title-WN). The scoring measure oubles the cosine scoring of sentences containing acronyms (e.g. HMM (Hien Markov Moels), NLP (Natural Language Processing)), or cue-terms, e.g. in this paper, in conclusion, etc. The use of acronyms an cue phrases in summarisation has been emphasise in the past by [9, 7]. Learning-base expansion technique We also inclue two queries by forming wor clusters in the ocument collection. This is another source of information about the relevance of sentences to summaries. It is a more contextual approach compare to the title-base queries, as it seeks to take avantage of the co-occurrence of terms within sentences all over the corpus, as oppose to the local information provie by the titlebase queries. We form ifferent term-clusters base on the co-occurrence of wors in the ocuments of the collection. For iscovering these term-clusters, each wor w in the vocabulary V is first characterise as a vector w =< n(w, ) > D representing the number of occurrences of w in each ocument D [4]. Uner this representation, wor clustering is performe using the Naive-Bayes clustering algorithm maximising the Classification Maximum Likelihoo criterion [3, 29]. We have arbitrary fixe the number of clusters to V 00. From these clusters, we first expan the title query by aing wors which are in the same wor-clusters as the title keywors. We enote this novel query by Extene concepts with wor clusters query. Secon, we represent each sentence in a ocument, as well as the ocument title, in the space of wor-clusters as vectors containing the

12 2 Massih R. Amini Anastasios Tombros Nicolas Usunier Mounia Lalmas number of occurrences of wors in each wor-cluster in that sentence, or ocument title. We refer to this vector representation of ocument titles as Projecte concepts on wor clusters queries. The first approach (Extene concepts with wor clusters) is a query expansion technique similar to those escribe above using wornet or LCA. The secon approach is a projection technique, closely relate to Latent Semantic Analysis [8]. Table shows some wor-clusters foun for the SUMMAC ata collection; it can be seen from this example that each cluster can be associate to a general concept. Wor-Clusters Cluster i: transuction language grammar set wor information moel number wors rules rule lexical Cluster j: tag processing speech recognition morphological korean morpheme Table. An example of term clusters foun for the SUMMAC ata collection. 3.2 Relate Work There have been few researchers that have investigate the summarisation of information available in XML format. In [], the work focuses on retaining the structure of the source ocument in the summary. A textual summary of a ocument is create by using lexical chains. The textual summary is then combine with the overall structure of the ocument with the aim of preserving the structure of the original ocument an of superimposing the summary on that structure. In [26], the iea of generating semantic thumbnails (essentially summaries) of ocuments in XML format is suggeste. The authors propose to utilise the ontologies embee in XML an RDF ocuments in orer to evelop the semantic thumbnails. Litkowski [20] has use some iscourse analysis of XML ocuments for summarisation. In some other work [6], the tree representation of XML ocuments is use to generate tree structural summaries; these are summaries that focus on the structural properties of trees an o not correspon to summaries in the conventional sense of the term as use in IR research. Operations such as nesting an repetition reuction in the XML trees are use. In the above approaches, features pertaining to the logical structure of XML ocuments are not taken into account when proucing summaries. Structural clues are use by work on summarisation of other ocument types, e.g. s [8], or technical ocuments [36]. In these summarisation approaches, known features of the structure of ocuments are exploite in orer to prouce summaries (e.g. the presence of a FAQ, or a question/answer section in technical ocuments).

13 Learning-Base Summarisation of XML Documents 3 4 Experiments In our experiments we use 2 ata sets - the INEX [3] an SUMMAC [28] test collections. For each ataset, we carrie out evaluation experiments for testing (a) the query expansion effect, (b) the learning effect an the best learning scheme for SDS between classification an ranking, an (c) the effect of structure features. For point (b), we teste the performance of a linear scoring function learnt with a ranking an a classification criterion. The combination weights of the scoring function are learnt via the logistic moel optimising the ranking criterion (8) by the LinearRank algorithm (Algorithm 2) an the classification criterion (3) using Algorithm. Furthermore, in orer to evaluate the effectiveness of learning a linear combination of sentence features for SDS uner the ranking framework, we compare the performance of the LinearRank algorithm an the RankBoost algorithm [] which learn a non-linear combination of features. To measure the effect of structure features, we have learnt the best learning algorithm using COQ features alone, an using COQ features together with the structure features. 4. Datasets We use version.4 of the INEX ocument collection. This version consists of 2,07 articles of the IEEE Computer Society s publications, from 995 to 2002, totaling 494 megabytes. It contains over 8.2 million element noes of varying granularity, where the average epth of a noe is 6.9 (taking an article as the root of the tree). The overall structure of a typical article consists of a front matter (containing e.g. title, author, publication information an abstract), a boy (consisting of e.g. sections, sub-sections, sub-sub-sections, paragraphs, tables, figures, lists, citations) an a back matter (incluing bibliography an author information). The SUMMAC corpus consists of 83 articles. Documents in this collection are scientific papers which appeare in ACL (Association for Computational Linguistics) sponsore conferences. The collection has been marke up in XML by converting automatically the latex version of the papers to XML. In this ataset the markup inclues tags covering information such as title, authors or inventors, etc., as well as basic structure such as abstract, boy, sections, lists, etc. We have remove ocuments from the INEX ataset that o not possess title keywors or an abstract. From the SUMMAC ataset, we remove ocuments whose title containe no-informative wors, such as a list of proper names. From each ataset, we also remove ocuments having extractive summaries (as foun by Marcu s algorithm, see Section 4.2) compose of one sentence only, arguing that a sentence is not sufficient to summarise a scientific article. In our experiments, we use in total 6 ocuments from SUMMAC an 4, 446 ocuments from INEX collections. We extracte the logical structure of XML ocuments using freely available structure parsers. Documents are tokenise by removing wors in a stop list, an sentence bounaries within each ocument are foun using the morpho-syntactic tree tagger program [35]. In Table 2, we show some statistics about the two ocument collections use, about the abstracts provie with the two collections, an about the extracts that were create using Marcu s algorithm, as well as the training/test splits for each ataset (in

14 4 Massih R. Amini Anastasios Tombros Nicolas Usunier Mounia Lalmas all experiments the size of the training an test sets are kept fixe). Both atasets have roughly the same characteristics of sentence istribution in the articles an summaries. The summary length, in number of sentences, is approximately 9 an 6 in average for the Summac an INEX collections respectively. Data set comparison Source SUMMAC INEX Number of ocs 6(83) 4446(207) Training/Test splits 80/8 000/3446 Total # of sentences in the collection Average # of sentence per oc Maximum # of sentence per oc Minimum # of sentence per oc. 6 2 Average # of wors per sentence Size of the vocabulary Average extract size (in # of sentence) Maximum # of sentence per extract Minimum # of sentence per extract 2 3 Average abstract size (in # of sentence) Maximum # of sentence per abstract Minimum # of sentence per abstract 4 5 Table 2. Data set properties. 4.2 Experimental Setup We assume that for each ocument, summaries will only inclue sentences between the introuction an the conclusion of the ocument. A compression ratio must be specifie for extractive summaries. For both atasets we followe the SUMMAC evaluation by using a 0% compression ratio [28]. To obtain sentence-base extract summaries for all articles in both atasets, for training an evaluation purposes, we nee gol summaries. The human extraction of such reference summaries, in the case of large atasets, is not possible. To overcome this restriction we use in our experiments the author-supplie abstracts that are available with the original articles, an apply an algorithm propose by Marcu [22] in orer to generate extracts from the abstracts. This algorithm has shown a high egree of correlation to sentence extracts prouce by humans. We therefore evaluate the effectiveness of our learning algorithm on the basis of how well it matches the automatic extracts. The learning algorithms take as input the set of features efine in section 3.. Each sentence in the training set is represente as a feature vector, an the algorithms are learnt base on this input representation an the extracte summaries foun by Marcu s algorithm [22], which were use as esire outputs. For all the algorithms, on each ataset, we have generate precision an recall curves to measure the query expansion an learning effects. Precision an recall are

15 Learning-Base Summarisation of XML Documents Inex ataset - COQ features Extene concepts with wor-clusters Projecte concepts on wor-clusters Title-LCA Title Title-WN Title-MFT Precision Recall 0.8 Summac ataset - COQ features Extene concepts with wor-clusters Projecte concepts on wor-clusters Title-LCA Title Title-WN Title-MFT 0.6 Precision Recall Fig. 2. Precision-Recall curves at 0% compression ratio for the COQ features on INEX (top) an SUMMAC (bottom) atasets. Each point represents the mean performance for 0 cross-valiation fols. The bars show stanar eviations for the estimate performance.

16 6 Massih R. Amini Anastasios Tombros Nicolas Usunier Mounia Lalmas compute as follows: # of sentences in the extract an also in the gol stanar Precision = total # of sentences in the extract # of sentences in the extract an also in the gol stanar Recall = total # of sentences in the gol stanar Precision an recall values are average over 0 ranom splits of the training/test sets. We have also measure the break-even point at 0% compression ratio for the 3 learning algorithms an the best COQ feature (Table 3). 5 Analysis of Results We examine the results from three viewpoints: in Section 5. we present the effectiveness of each of the content only queries (COQ) alone, as well as the query expansion effect, in Section 5.2 we examine the performance of the three learning algorithms, an in Section 5.3 we look into the effectiveness of our summarisation approach for XML ocuments. 5. Query expansion effects In Figure 2, we present the precision an recall graphs showing the effectiveness of content-only features for SDS without the learning effect (i.e. by using each content feature iniviually to rank the sentences). The orer of effectiveness of the features seems to be consistent across the two atasets: extene concepts with wor clusters are the most effective, followe by projecte concepts on wor clusters an title with local context analysis. Title with the most frequent terms in the ocument is the least effective feature in both cases. The high effectiveness obtaine with wor clusters (extene an projecte concepts with wor clusters) emonstrates that the contextual approach investigate here is effective an shoul be further exploite for SDS. 5.2 Learning algorithms In Figure 3, we present the precision an recall graphs obtaine through the combination of content an structure features for the two atasets when using the three learning algorithms. For comparison, we isplay the Precision-Recall curves obtaine for the best CO feature (Extene concepts) with those obtaine from the learning algorithms. A first result is that the combination of features by learning outperforms each feature alone. The results also show that the two orering algorithms are more effective in both atasets than the logistic classifier. This fining corroborates with the justification given in Section 2.3. When comparing the two orering algorithms, we see that Algorithm 2 (Linear- Rank) slightly outperforms the RankBoost algorithm for low recall values. Since both

17 Learning-Base Summarisation of XML Documents Inex ataset - Learning effect Combining COQ an SF - LinearRank Combining COQ an SF - RankBoost Combining COQ an SF - Logistic Classifier Combining COQ features - LinearRank Extene Concepts 0.7 Precision Recall 0.8 Summac ataset - Learning effect Combining COQ an SF - LinearRank Combining COQ an SF - RankBoost Combining COQ an SF - Logistic Classifier Combining COQ features - LinearRank Extene Concepts 0.6 Precision Recall Fig. 3. Precision-Recall curves at 0% compression ratio for the learning effects on INEX (top) an SUMMAC (bottom) atasets.

18 8 Massih R. Amini Anastasios Tombros Nicolas Usunier Mounia Lalmas orering algorithms optimise the same criteria (equation 8), the ifference in performance can be explaine by the class of functions that each algorithm learns. The Rank- Boost algorithm outputs a nonlinear combination of the features, while with the LinearRank algorithm we obtain a linear combination of these features. As the space of features is small, the non-linear RankBoost moel has low bias an high variance an hence attempts to overfit the ata. We have notice this effect in both test collections by comparing Precision an Recall curves for RankBoost on the test an the training sets. Our experimental results suggest that a ranking criterion is better suite to the SDS task than a classification criterion. Moreover, a simple logistic moel performs better than a non-linear algorithm an, epening on the implementation, can be significantly faster to train than RankBoost. This leas to the conclusion that such a linear moel, i.e. optimising equation (8), can be a goo choice for learning a summariser, in particular when consiering structural features. 5.3 Summarisation effectiveness By looking at the ata in Figure 3 from the point of view of comparing the effectiveness of the summariser with ifferent features, one can note that the combination of content an structure features yiels greater effectiveness than the use of content features alone. This result seems to hol equally for both ocument sets for most recall points. In terms of break-even points, (Table 3), the increase in effectiveness is approximately 3% for the RankBoost an LinearRank algorithms in both ata sets 4. This provies evience that the use of structural features improves the effectiveness of the task of SDS. It is to be note that as the structural features we consiere here are iscrete, the orering of sentences with respect to ifferent structural components was, hence, not possible. Training the learning moels using only these features i not provie significant results either (we chose not to isplay these results as they were not informative). The fact that structural features increase the performance of the learning moels when they are ae to CO features, is in our opinion ue to that structural features provie non-reunant information compare to CO features. Break-even points (%) Data sets Best COQ Classifier RankBoost LinearRank COQ features COQ+SF COQ features COQ+SF COQ features COQ+SF SUMMAC INEX Table 3. Break-even points at 0% compression ratio for learning algorithms an the best COQ feature: Extene title keywors with wor-clusters. Each value represents the mean performance for 0 cross-valiation fols. From the set of structure features use in our experiments (Section 3.), the epth of the sentence s component an the paragraph s position containing summary sentences 4 The same performance increase is also obtaine from the classifier.

19 Learning-Base Summarisation of XML Documents 9 within the component (i.e. whether it is in the first paragraph or not of a component) got the highest weights with both ranking algorithms. Any sentence in the first paragraph of any first sections of a ocument, containing relevant COQ features, thus got high scores. In our experiments, these two structural features were the most effective for SDS. It is well known that, in scientific articles, sentences in the first parts of sections such as Introuction an Conclusions are useful for summarisation purposes [9, 7]. Our results agree with this, as the increase weights for the paragraph s position in a component suggests. The features corresponing to the position of elements with respect to their siblings are less effective than epth an paragraph position, but features inicating the position of an element as the first or the last sibling have a higher impact than when the element was the mile sibling. We shoul also note that the feature corresponing to the number of siblings of an element was the least conclusive in all of our experiments; its utility seeme to highly epen on the ataset. For the specific case of scientific text, from the set of structure features use, a set of features which is known to be effective was weighte higher by our summarisation metho. One way to view this result is that our metho correctly ientifie features that are known to be effective for this ocument genre, an has therefore the potential to perform equally well in other ocument genre. This in turn, can be seen as an inication that the use of structure features coul be applie to ocument collections of ifferent genre. The availability of suitable ocument collections containing ifferent ocument types will be necessary in orer to test this assertion. By looking at the ata in Table 3 (an Figures 2 an 3), one can note that effectiveness when using the INEX collection is always lower than when using the SUMMAC collection. This ifference in effectiveness can be attribute to the ifferent characteristics of the two atasets. The INEX collection contains many more ocuments than SUMMAC, an is also a more heterogeneous ataset. In aition, the logical structure of INEX ocuments is more complex than that of the SUMMAC collection. These factors are likely to cause the small ifference in effectiveness between the two collections. 6 Discussion an conclusions The results presente in the previous section are encouraging in relation to our two main motivations: a novel learning algorithm for SDS, an the inclusion of structure, in aition to content, features for the summarisation of XML ocuments. In terms of the algorithms, it was shown that using the same logistic moel, but choosing a ranking criterion instea of a classification one, leas to a notable performance increase. Moreover, compare to RankBoost, the LinearRank algorithm performs better an it also has the potential to be implemente in a simpler manner. This property may make this latter algorithm an effective an efficient choice for the task of SDS. In terms of the summarisation of XML ocuments by using content an structure features, the results emonstrate that for both atasets, the inclusion of structural features improve the effectiveness of learning algorithms for SDS. The improvements are not ramatic, but they are consistent across both atasets an across most recall points. This consistency suggests that the inclusion of features from the logical structure of XML ocuments is effective.

20 20 Massih R. Amini Anastasios Tombros Nicolas Usunier Mounia Lalmas The ultimate aim of our approach for the summarisation of XML ocuments is to prouce summaries for components at any level of granularity (e.g. section, subsection, etc.). The content an structure features that we presente in Section 3. can be applie to any level of granularity. For example, the epth of an element, the sibling number of an element in which a sentence is containe, the number of sibling elements in which the sentence is containe, an the position in the element of the paragraph in which the sentence is containe (i.e. the structure features in section 3.) can be applie to entire ocuments, sections, subsections, etc. Essentially, they can be applie to any XML element that can be meaningfully summarise, i.e. that is informative an long enough to make its summarisation meaningful [3]. In particular, the most effective content (expane concepts with wor clusters an projecte concepts on wor clusters), an structure features (epth of element an position of paragraph in the element), can be applie to various granularity levels within an XML tree. The effectiveness of such an approach however, cannot be teste until atasets with human prouce summaries, or summary extracts, at component level become available. We shoul also note that we focus on generic (rather than query-biase) summaries for evaluation purposes, but the propose moel can be applie to both types of summarisation. In Section 5.3 we mentione that the results provie us with some inication that the use of structural features can also be effective for summarising XML ocuments from atasets containing ocuments other than scientific articles. One possible irection for future research woul therefore be to examine this issue in more etail, an to ientify appropriate atasets of non-scientific XML ata for summarisation. The list of structural features that we use in this stuy is short, so a larger variety of features coul be investigate. When moving into ocument collections of ifferent types, it will be worthwhile to investigate whether useful structural features can be erive automatically, e.g. by looking at a collection s DTD. Some further interesting issues that arise when consiering the summarisation at any structural level, relate to the choice of the appropriate components to be summarise. For example, it may be unrealistic to provie summaries of very small size components, or of components that are not informative enough. One of the main research issues in XML retrieval is to efine an unerstan what a meaningful retrieval unit is [3]. One irection to follow, woul be to conuct a user stuy in which to observe what kins of XML elements searchers woul prefer to see in a summarise version after the initial retrieval. Some initial investigation can be foun in [30, 3], where results inicate a positive correlation between element probability of relevance, length an user preference to see summary information. Further research in this irection is currently unerway. By looking at the results of this stuy as a whole, we can say that the work presente here achieve its main aim, to effectively summarise XML ocuments by combining content an structure features through using novel machine learning approaches. Both atasets that we use contain scientific articles, that have some inherent characteristics which may simplify the task of SDS. This work has however a greater impact, as we believe that it can be applie to atasets containing ocuments of other types. The availability of XML ata will continue to increase as, for example, XML is becoming the W3C stanar for representing ocuments (e.g. in igital libraries where content