Identifying Citing Sentences in Research Papers Using Supervised Learning

Transcription

1 Identfyng Ctng Sentences n Research Papers Usng Supervsed Learnng Kazunar Sugyama Tarun Kumar Mn-Yen Kan Ramesh C. Trpath Department of Computer Scence Indan Insttute of Department of Computer Scence Indan Insttute of Natonal Unversty of Sngapore Informaton Technology Natonal Unversty of Sngapore Informaton Technology Sngapore Allahabad, Inda Sngapore Allahabad, Inda sugyama@comp.nus.edu.sg tar.ta@gmal.com kanmy@comp.nus.edu.sg rctrpath@ta.ac.n Abstract Researchers have largely focused on analyzng ctaton lnks from one scholarly work to another. Such ctng sentences are an mportant part of the narratve n a research artcle. If we can automatcally dentfy such sentences, we can devse an edtor that helps suggest when a partcular pece of text needs to be backed up wth a ctaton or not. In ths paper, we propose a method for dentfyng ctng sentences by constructng a classfer usng supervsed learnng. Our experments show that smple language features such as proper nouns and the labels of prevous and next sentences are effectve features to dentfyng ctng sentences. Keywords nformaton retreval; dgtal lbrary; dscourse processng; ctaton analyss I. INTRODUCTION When we wrte research papers or artcles, we often make references to prevous works n our own research feld. Ctatons serve varous purposes: as evdence for clams, as an acknowledgment of other s work, among other functons. We term sentences that contan such references as ctng sentences. These statements are usually followed by a ponter to the full reference located at the end of a paper n a Reference or Bblography secton. Our am n ths work s to dentfy whether a sentence n a paper needs a ctaton or not. For example, consder the followng two sentences: Sentence 1: We want to buld a system whch can help n fndng a person s job easly. Sentence 2: The HSQL project was led by the Swedsh State Bureau wth partcpants from Sweden, Denmark, Fnland, and Norway. In the above two statements, Sentence 2 needs a ctaton because t refers to the prevous work, namely, the HSQL project. On the other hand, most people would agree that Sentence 1 does not need a ctaton because there s no descrpton of prevous work. In ths paper, we propose a method 1 for dentfyng sentences that requre ctatons such as Sentence 2 above. Whle ctng sentences are often trvally marked wth a ctaton marker (e.g., [5] or (Brown, 1990) ), an mportant dstncton n our problem s that we consder 1 Ths work s supported by a Meda Development Authorty (MDA) grant Interactve Meda Search, R detectng such a sentence when such markers are not present. We show that our approach, whch constructs a supervsed classfer from smple features, acheves a hgh level of accuracy. Wth respect to references and other works on the analyss of ctaton nformaton [1, 2], to the best of our knowledge, the work presented here s the frst work to dentfy ctng sentences usng natural language analyss. Such a module s useful n a smart authorng envronment, whch can help authors by suggestng whether statements made n the paper draft need ctatons or not. Such a system wll take n a research paper as nput and dentfy the statements that need ctaton. It can also be used at the tme of wrtng a paper by suggestng to authors whether current present statement needs a ctaton or not. Ths paper s organzed as follows: In Secton II, we revew related work on analyzng ctatons n research papers and brefly descrbe the two classfers we use n our experment. In Secton III, we descrbe our approach n dstngushng between sentence that requre ctatons and those that do not. In Secton IV, we present the expermental results for evaluatng our approach. Fnally, we conclude the paper wth a summary and drectons for future work n Secton V. II. RELATED WORK We frst revew related works on scholarly ctaton analyss, and then survey the fundamental background of two state-of-the-art classfers maxmum entropy and support vector machnes that we employ n later our experments. A. Ctaton Analyss of Research Papers Ctaton nformaton has been used for nformaton retreval snce the early stages of ths feld. As far as we know, there are two types of research n the feld of ctaton analyss of research papers, (1) ctaton count to evaluate the mpact of scentfc papers, and (2) ctaton context analyss. Ctaton count s wdely used n evaluatng the mportance of a paper because t s strongly correlated wth academc document mpact [3]. The Thomson Scentfc ISI Impact Factor (ISI IF) s the representatve approach usng ctaton count [4], whch factors ctaton count wth a movng wndow to calculate the mpact of certan publcaton venues. The advantages of ctaton count are () ts smplcty n computaton; and () ts proven track record n deployment n scentometrc applcatons. However, ctaton count has well-known lmtatons: Ctng papers wth hgh mpact and ones wth low mpact are treated equally n standard ctaton count.

2 In order to overcome ths problem, many works recently have employed the noton of PageRank [5] to better weght and control for the nfluence of papers of dfferng mpact [6, 7, 8, 9, 10]. Ctaton nformaton s statstcal n nature. Therefore, many researchers have focused on ths characterstc. For example, Kessler et al. [11] proposed to use the noton of bblographc couplng, where two documents are sad to be coupled f they share one or more references. Small [12] proposed a complementary method, termed coctaton analyss, where the smlarty between documents A and B s measured by the number of documents that cte A and B. In addton, researchers have also focused on the potental usefulness of the text assocated wth ctatons n specfc applcatons, such as text summarzaton [13, 14], thesaurus constructon [15], and nformaton retreval [16, 17]. B. Maxmum Entropy (ME) The framework of maxmum entropy [18] has already been wdely used for a varety of natural language tasks such as prepostonal phrase attachment [19], language modelng [20, 21], part-of-speech taggng [22] and text segmentaton [23]. Maxmum entropy has been shown to be an effectve and compettve algorthm n these domans. Statstcal modelng constructs a model that best accounts for some tranng data. Specfcally, for a gven emprcal probablty dstrbuton ~ p, a model p s bult to result n a dstrbuton as close to ~ p as possble. Gven a set of tranng data, there are numerous ways to choose a model p that accounts for the data. It can be shown that the probablty dstrbuton defned by Equaton (1) s the one that s closest to ~ p n the sense of Kullback-Lebler dvergence, when subjected to a set of feature constrants: k 1 P( y x) = exp (, ), (1) ( ) λ f x y Z x = 1 where p ( y x) denotes the condtonal probablty of predctng an class y on seeng the context x. f y) ( = 1,, k) are feature functons, λ ( = 1,, k) are the weghtng parameters for f y) ( = 1,, k). k s the number of features and Z(x) s a normalzaton factor to ensure that the p ( y x) scores sum to one and reflect true probabltes. Ths maxmum entropy model represents evdence wth bnary functons known as contextual predcates n the form: f ' y cp, y 1 ) = 0 If y=y and cp(x)=true otherwse where cp s the contextual predcate that maps a outcome y and context x par to {true, false}. The human expert can choose arbtrary feature functons n order to reflect the characterstcs of the problem doman as fathfully as possble. The ablty of freely ncorporatng problem-specfc knowledge n terms of feature functons gves ME models an advantage over other learnng paradgms, whch often suffer from strong feature ndependence assumptons (e.g., n the case of the naïve Bayes classfer). Once a set of features s chosen by the human expert, the correspondng maxmum entropy model can be constructed by addng features as constrants to the model and teratvely adjustng the weghts of these features automatcally to best reflect the tranng data. Formally, we requre that: E ~ p < f >= E p < f >, where E p < f >= ~ ~ p x y f x (, ) y) s the emprcal expectaton wth respect to the model dstrbuton p. Among all the models subjected to these constrants, a unque soluton exsts that preserves the uncertanty n the orgnal constrants and does not add any extra bas to the soluton ths s the maxmum entropy soluton obtaned by the tranng procedure. Gven an exponental model wth n features and a set of tranng data (emprcal dstrbuton), we need to fnd the assocated weght for each of the n features to maxmze the model s log-lkelhood: L ( p) = ~ p y)log p( y x). x, y It s mportant to select an optmal model subjected to gven constrants from the log-lnear famly. There are three well-known teratve scalng algorthms specally desgned to estmate parameters of ME models of the Equaton (1): Generalzed Iteratve Scalng [24] and Improved Iteratve Scalng [25], and Lmted Memory Varable Metrc [26]. C. Support Vector Machne (SVM) Support Vector Machne (SVM) [27] has many desrable qualtes that make t one of the most popular algorthms. It not only has a sold theoretcal foundaton, but also classfes more accurately than most other algorthms n many applcatons such as Web page classfcaton and bonformatcs tasks. Gven a tranng set of nstance label pars (x, y ) n (=1,, l) where x R s a tranng vector and y { 1, 1} l s ts class label, an SVM fnds a lnear separatng hyperplane wth the maxmal margn as a soluton to the followng optmzaton problem: 1 mn w w, b, ξ 2 T w + C ξ = 1 T subject to y ( w φ( x ) + b) 1 ξ, ξ 0. l As the orgnal problem may not be lnearly separable, x can be mapped nto a hgher dmensonal space by a functon φ. Then, SVM fnds a lnear separatng hyperplane wth the maxmal margn n ths hgher dmensonal space. C > 0 s the penalty parameter of the error term. Interestngly, T K( x, x j ) φ( x ) φ( x j ), the kernel functon, can be of dfferng forms lnear, polynomal, radal bass and, sgmod functons are often used. The SVM depends drectly on the kernel functon, and f a surrogate method that yelds the functon values can be gven, the explct Cartesan product need not be calculated, greatly savng computatonal complexty. (2)

3 Fgure 1. Overvew of our system. III. PROPOSED METHOD Fgure 1 llustrates our proposed system. Our proposed system conssts of the followng four parts. We detal each of these steps ndvdually. (1) Constructng proper tranng and test data sets, (2) Extractng the approprate features from the data, (3) Constructng the classfer, (4) Classfyng sentences as ctng or non-ctng. (1) Constructng proper tranng and test data sets For our experments, we utlze the Assocaton for Computatonal Lngustcs standard research artcle corpus, the ACL Anthology Reference Corpus (ACL ARC), dscussed n Secton IV. We frst remove Reference secton and stop words 2 from each paper. We defne a sentence that contans ctng nformaton as postve nstance, and a sentence that does not contan that as negatve nstance. Our dea s to use sentences that have exstng ctatons as tranng data, by removng the ctaton marker. If a ctaton marker s found va heurstc rules, we remove t. We then dvde our dataset nto ten equal szed parts, to be used as cross valdaton folds. As we perform 10-fold cross valdaton, we dvded the whole data set nto 90% of tranng data and 10% of test data and repeat the below expermental process 10 tmes to obtan our fnal evaluaton results. 2 Lst of 571 words obtaned from ftp://ftp.cs.cornell.edu/pub/smart/englsh.stop (2) Extractng the approprate features from the data In ths module, we extract features from each sentence n order to construct the classfer later. The features we extracted are as follows: a) Ungram - Ungram features nclude the words contaned n the sentences. After removng the stop words from the sentences, each word appearng n the sentence serves as a bnary feature, turned on for the ndvdual sentence. Ungrams are an mportant class of features as certan types of words tend to appear more frequently n ctng/non-ctng sentence. For example, the sentence, The classfcaton model s traned can be represented wth the ungrams classfcaton, model, traned ( the and s are stop words), among others. b) Bgram - By combnng two adjacent words n sentences, we create bgram features. Bgram features help n analyzng the effect of two ndependent words n combnaton. Ths combnaton helps sgnfcantly n classfcaton. Usng the same example as above, we extract the bgrams, classfcaton model, model traned (smlarly, the and s are stop words), among others. c) Proper Nouns - These are nouns that gve the names of people, locatons, systems and organzatons. Based on the presence of varous types of proper nouns, the respectve bnary features are set. Ths feature also

4 plays a sgnfcant role n detectng ctng sentences, as from experence, we know that such sentences often refer to partcular scholars, ther developed systems or nsttutons. d) Prevous and Next Sentence - We also can nclude nformaton about the classfcatons of neghborng sentences ther ctaton/non-ctaton status. For example, f the prevous sentence s a ctng sentence, the followng sentence may contnue to dscuss the same work and would be less lkely to contan an addtonal ctaton. e) Poston - The postonal feature gves nformaton about the part of document n whch the sentence appears. To mplement ths feature, we dvde the document nto sx equal parts: one part for the frst 1/6th of the document, one part for the second 1/6th, untl the fnal sxth 1/6th. We turn on one of the sx bnary features based on whch part of document the sentence appears n. These features are mportant as statements appearng n certan sectons exhbt markedly dfferent probabltes for ctaton. For example, sentences n the mddle or end of a research paper are lkely to dscuss the authors own work or the evaluaton and are less probable ctaton areas, as compaerd to the begnnng of the artcle, where authors often dscuss and credt pror work. f) Orthographc - Ths set of features check for mscellaneous formattng characterstcs, ncludng specfc orthographes used n the sentence. For example, a sentence contanng numbers or sngle captal letters may be more ndcatve of ctng sentences, as they may present comparatve results or author ntals from cted works. (3) Constructng the classfer Usng the tranng data set descrbed n (1), we construct two supervsed classfers based on two publcly avalable mplementatons of supervsed learnng frameworks: maxmum entropy (ME) 3 [18] and support vector machne (SVM) 4 [27] as descrbed n Secton II. Both methodologes were chosen as they effcently handle a large number of non-ndependent features, common to many natural language tasks. In addton, our task s a bnary classfcaton problem whether a sentence s ctng sentence or not. SVM often brngs superor results n bnary classfcaton tasks. (4) Classfyng sentences as ctng or non-ctng Gven traned classfers n (3), we can then apply the traned models to new, unseen sentences. We employ these traned models and assess the performance of the models usng accuracy as our evaluaton measure. IV. EXPERIMENTS A. Expermental Data We used the ACL Anthology Reference Corpus (ACL ARC) 5 [28]. The ACL ARC s constructed from a sgnfcant subset of the ACL Anthology 6, whch s a dgtal archve of conference and journal papers n natural language processng and computatonal lngustcs. The ACL ARC conssts of 10,921 artcles from the February 2007 snapshot of the ACL Anthology. Usng the ACL ARC, we extracted features from each of the 955,755 sentences ncluded n the 10,921 artcles. Usng regular expresson pattern matchng to fnd ctaton markers, we dentfed 112,533 sentences as ctng sentences (postve nstances), and post-processed them to remove the ctaton marker. We deemed the remanng 843,242 sentences as non-ctng sentences (negatve nstances). In our experments, the am s to dentfy whether each sentence s a ctng sentence or not. B. Expermental Results Usng each of the ndvdual feature classes from a) to f) descrbed n Secton III, we constructed both SVM and ME classfers. We evaluated our models usng smple accuracy defned as follows: (Number of correct classfcatons) Accuracy =, (Total number of test cases) where correct classfcatons means that the learned model predcts the same class as the orgnal class of the test case. 1) Classfcaton Accuracy by Cross Valdaton We frst conducted experments usng 10-fold cross valdaton for both ME and SVM. For the SVM experments, we used the default settngs of LIBSVM package, settng the kernel as the radal bass functon, and the value of C n Equaton (2) as 1.0. Table 1 shows expermental results obtaned usng classfers constructed both ME and SVM. For comparson, we also constructed an ntegrated classfer whch uses all of sx features, labeled as All n Table 1. Table 1. Accuracy obtaned by ME and SVM. Feature Accuracy (ME) Accuracy (SVM) (1) Ungram (2) Bgram (3) Proper Noun (4) Prevous and Next Sentence (5) Poston (6) Orthographc (7) All [(1) - (6)] Maxmum Entropy Modelng Toolkt (Verson ), 4 LIBSVM (Verson 2.89), 5 Verson , 6

5 Fgure 2. Classfcaton accuracy obtaned by ME. Fgure 4. Classfcaton accuracy obtaned by SVM wth dfferent value of C. Table 2. Optmal values of C and ther accuracy. Feature Optmal Value Accuracy of C (1) Ungram (2) Bgram (3) Proper Noun (4) Prevous and Next Sentence (5) Poston (6) Orthographc (7) All [(1) - (6)] Fgure 3. Classfcaton accuracy obtaned by SVM. 2) Classfcaton Accuracy on Dfferent Sze of Tranng Data For most of learnng algorthms, the sze of the tranng data affects the classfcaton accuracy. Therefore, we conducted experments to emprcally assess how classfcaton performance changes when the sze of tranng data s a subset of the full tranng data from 10% to 90%. Fgures 2 and 3 (both shown at the same scale) gve the expermental results obtaned by ME and SVM, respectvely. 3) Classfcaton Accuracy n Dfferent Value of C n SVM Fnally, we conducted a seres of experments n tunng the SVM performance. In the SVM framework, the value of C n Equaton (2) the error term denotes the tolerance of the SVM to accept msclassfcatons n the separatng hyperplane, also affects classfcaton accuracy. Thus, we conducted experments to fnd the classfcaton accuracy obtaned by dfferent values of C. Fgure 4 shows the accuracy obtaned by usng SVM wth dfferent values of C. C. Dscusson Accordng to Table 1, n the frst set of experments on cross valdaton, both ME and SVM acheved an accuracy greater than We observed a small dfference of accuracy (0.002 to 0.024) between these classfers. Therefore, we can fnd that the accuracy of ths knd of task does not depend on classfers. Especally, smple features such as Proper Noun and the context of Prevous and Next Sentence brng better results (0.882) among them. Interestngly, the Bgram feature s not so effectve among the features we used. As bgram features are often very sparse, t may be dffcult to construct an accurate classfer wth such few overlappng features. By varyng the sze of the tranng data, we obtan slght varatons n performance n both the SVM and ME frameworks. Accordng to Fgure 2, ME exhbts more performance varaton, where the accuracy of features such as Ungram, Bgram, and All s nfluenced by the sze of tranng data. In other words, the larger the sze of the tranng data, the more accurate the classfcaton results are. Accordng to Fgure 3, the SVM framework shows less varaton. When the data sze was slghtly reduced (80-90% of the orgnal), a slght mprovement n accuracy s

6 observed. For example, n Prevous and Next Sentence, whle the accuracy at 10% of tranng data s 0.872, the accuracy at 90% of tranng data s Moreover, n All, whle the accuracy at 10% of tranng data s 0.875, the accuracy at 90% of tranng data s In future work, we may nvestgate ths pecularty n more detal. Fnally, accordng to Fgure 4, when dfferent error term values of C n the SVM framework were used, we observed that the best accuracy s obtaned when the value of C s set to slghtly lower than 1.0 n each feature. The optmal values of C and ther accuracy obtaned by each feature are shown n Table 2. Together wth the frst expermental results, the results show that Proper Noun and the context of Prevous and Next Sentence features brng the best accuracy n both ME and SVM (0.882 wth C=0.9). Surprsngly, the composte classfers that use all feature classes underperform classfers that are traned only on these two sources of data. V. CONCLUSION We have descrbed a method for dentfyng ctaton sentences by constructng classfer usng supervsed learnng approaches wth smple features extracted from research papers. Expermental results showed that both proper nouns and contextual classfcaton of the prevous and next sentence are effectve features for tranng accurate models n both SVM and ME frameworks. In future work, we plan to buld an edtor that wll help authors wrte a research paper by advsng them when statements n ther draft need a ctaton or not. REFERENCES [1] S. Lawrence, C. L. Gles, and K. Bollacker: Dgtal Lbrares and Autonomous Ctaton Indexng, IEEE Computer, 32(6): 67-71, [2] I. G. Councll, C. L. Gles, and M.-Y. Kan: ParsCt: An Open- Source CRF Reference Strng Parsng Package, In Proc. of the 6th Internatonal Conference on Language Resources and Evaluaton Conference (LREC08), pages , [3] F. Narn: Evaluatve Bblometrcs: The Use of Publcaton and Ctaton Analyss n the Evaluaton of Scentfc Actvty, Computer Horzons, Cherry Hll, NJ, [4] E. Garfeld: Ctaton Indexng: Its Theory and Applcaton n Scence, Technology, and Humantes, John Wley and Sons, NY, [5] L. Page, S. Brn, R. Motwan, and T. Wnograd: The PageRank Ctaton Rankng: Brngng Order to the Web, Stanford Dgtal Lbrary Technologes Project, SIDL-WP , [6] J. Bollen, M. A. Rodrguez and H. Van De Sompel: Journal Status, Scentometrcs, 69(3): , [7] Y. Sun and C.L. Gles: Popularty Weghted Rankng for Academc Dgtal Lbrares, In Proc. of the 29th European Conference on Informaton Retreval (ECIR 2007), pages , [8] M. Krapvn and M. Marchese: Focused PageRank n Scentfc Papers Rankng, In Proc. of the 11th Internatonal Conference on Asan Dgtal Lbrares (ICADL 2008), Lecture Notes n Computer Scence (LNCS), Vol. 5362, pages , [9] N. Ma, J. Guan, and Y. Zhao: Brngng PageRank to the Ctaton Analyss, Informaton Processng and Management, 44(2), pages , [10] H. Sayyad and L. Getoor: FutureRank: Rankng Scentfc Artcles by Predctng ther Future PageRank, In Proc. of the 9th SIAM Internatonal Conference on Data Mnng, pages , [11] M. M. Kessler: Bblographc Couplng Between Scentfc Papers, Amercan Documentaton, 14(1): 10-25, [12] H. Small: Co-Ctaton n the Scentfc Lterature: A New Measure of the Relatonshp Between Two Documents, Journal of the Amercan Socety of Informaton Scence, 24(4): , [13] S. Teufel, A. Sddharthan, and D. Tdhar: Automatc Classfcaton of Ctaton Functon, In Proc. of the 2006 Conference on Emprcal Methods n Natural Language Processng (EMNLP 2006), pages , [14] V. Qazvnan and D.R. Radev: Scentfc Paper Summarzaton Usng Ctaton Summary Networks, In Proc. of the 22nd Internatonal Conference on Computatonal Lngustcs (Colng2008), pages , [15] J. Schneder: Verfcaton of Bblometrc Methods Applcablty for Thesaurus Constructon, PhD thess, Royal School of Lbrary and Informaton Studes, [16] A. Rtche, S. Teufel, and S. Robertson: Usng Terms from Ctatons for IR: Some Frst Results, In Proc. of the 29th European Conference on Informaton Retreval (ECIR 2007), pages , [17] A. Rtche, S. Robertson and S. Teufel: Comparng Ctaton Contexts for Informaton Retreval, In Proc. of the 17th Internatonal Conference on Informaton and Knowledge Management (CIKM'08), pages , [18] A. L. Berger, S. A. Della Petra, and V. J. Della Petra: A Maxmum Entropy Approach to Natural Language Processng, Computatonal Lngustcs, 22(1):39-71, [19] A. Ratnaparkh, J. Reynar, and S. Roukos: A Maxmum Entropy Model for Prepostonal Phrase Attachment, In Proc. of the ARPA Human Language Technology Workshop, pages , [20] R. Rosenfeld: Adaptve Statstcal Language Modelng: A Maxmum Entropy Apporach, PhD thess, Carnege Mellon Unversty, [21] S. F. Chen and R. Rosenfeld: A Gaussan Pror for Smoothng Maxmum Entropy Models, Techncal Report CMU-CS , Carnege Mellon Unversty, [22] A. Ratnaparkh: A Maxmum Entropy Model for Part-Of- Speech Taggng, In Proc. of the Conference on Emprcal Methods n Natural Language Processng, pages , [23] D. Beeferman, A. Berger, and J. Lafferty: Statstcal Models For Text Segmentaton, Machne Learnng, 34(1-3): , [24] J. N. Darroch and D. Ratclff: Generalzed Iteratve Scalng for Log-Lnear Models, The Annals of Mathematcal Statstcs, 43(5): , [25] S. D. Petra, V. J. Della Petra, and J. D. Lafferty: Inducng Features of Random Felds. IEEE Transactons on Pattern Analyss and Machne Intellgence, 19(4): , [26] R. Malouf: A Comparson of Algorthms for Maxmum Entropy Parameter Estmaton, In Proc. of the 6th Conference on Natural Language Learnng (CoNLL-2002). pages 49 55, [27] V. Vapnk: The Nature of Statstcal Learnng Theory, Sprnger, NY, [28] S. Brd, R. Dale, B. J. Dorr, B. Gbson, M. T. Joseph, M.-Y. Kan, D. Lee, B. Powley, D. R. Radev, Y. F. Tan: The ACL Anthology Reference Corpus: A Reference Dataset for Bblographc Research n Computatonal Lngustcs, In Proc. of the 6th Internatonal Conference on Language Resources and Evaluaton Conference (LREC08), pages , 2008.