Full-Text Clusterng Methods for Current Research Drectons Detecton * Dmtry Devyatn Ilya Thomrov Alexander Shvets Oleg Grgorev Insttute for systems analyss of RAS, Moscow devyatn@sa.ru, shvets@sa.ru, th@sa.ru oleggpolvart@yandex.ru Konstantn Popov Engelhardt Insttute of Molecular Bology of RAS, Moscow onstantn.v.popov@gmal.com Abstract The paper contans a bref overvew of full-text clusterng methods for current research drectons detecton. A novel full-text clusterng method s proposed. A dataset s created and expermental results are verfed by problem doman Regeneratve medcne experts wth PhD degrees. The proposed method s well applcable for research drectons detecton accordng to expermental results. Fnally, prospects and drawbacs of the proposed method are dscussed. * The research s supported by Russan Foundaton for Basc Research, project 14-29-05008-of_m 1 Introducton Commonly methods for research drectons detecton are based on dfferent clusterng methods [4]. The problem s that all these methods requre tunng to a research area and dataset. Common clusterng evaluaton metrcs are not applcable for drectons detecton. They are often based on degree of nsularty of clusters [1, 12]. But crtera for cluster buldng are wealy formalzed n drectons detecton tas. Several approaches use emprcal estmates, determned by experts, for solvng ths problem [6]. However, obtanng of these estmates s complcated and nondetermnstc process, whch requres nteracton between data-mnng experts and experts of analyzed research area. Moreover, crtera for ncluson of some scentfc paper to a research drecton depends on research area of ths paper, thus an opnon of experts n ths research area should be taen nto consderaton. Therefore, accordng to specalty of the problem a semautomatcally approach, that uses a small labeled part of analyzed dataset for tranng clusterng methods, s proposed n ths paper. Let D { d1, d2,.., dn} s a corpus of scentfc papers; where n s a number of Proceedngs of the XVII Internatonal Conference «Data Analytcs and Management n Data Intensve Domans» (DAMDID/RCDL 2015), Obnns, Russa, October 13-16, 2015 documents n ths collecton and C { c1, c2,.., cn} s a set of prevously nown research drectons n D. Suppose D c D s a small ncomplete set of papers that are n C. The goal s to get full set of research drectons C f and allocaton of papers to ths set. The approach conssts n applyng a qualty functon that can estmate dfference between clusterng method results and gven dstrbuton C. Then methods of constraned optmzaton are used for tunng parameters of the clusterng method. Unle well-nown classfer tranng tas, the complete drectons set C f s a pror unnown. So t s necessary to fnd a method, whch can be tuned n applance wth ths approach properly. The prmary goal of ths paper s to present an mproved method for research drectons detecton, whch can process datasets sem-automatcally. Besdes, we present results of comparson between proposed method and the state-of-the-art clusterng methods such as Brch [19], affnty propagaton [5] and ther combnatons wth topcal latent Drchlet allocaton model [3]. 2 Related wor Consder some methods for scence drectons detecton. There are three groups of methods, used for research drecton detecton. All of them use clusterng methods, but dffer n the appled measure of smlarty detecton: the co-ctaton measure [4], the text measure or the hybrd measure. The last one usually s based on an assessment of co-ctatons of papers, on the ntellgent analyss of texts and on allocaton of sgnfcant papers [6]. In [9] defnton of the prospectve research drectons s carred out usng co-word analyss,.e. papers smlarty. Ths approach s close to generatve topc dstrbuton clusterng methods. Applcaton of ths method for detecton of regulartes and trends n the area of nformaton securty s shown. The SCI base [13] s used as data source; eywords are collected from the lst of the terms allocated by authors of papers. These eywords were normalzed. To chec dynamc 152
changes n the area of nformaton securty, authors offer the whole set of methods for marng of papers. They conclude that n the area of nformaton securty there s constant drectons set, and at the same tme new drectons emerge regularly. The man dsadvantage of ths method s that t uses only structured authors eywords. Ths approach leads to decreasng of methods objectvty; because authors eywords often dstngush from the real terms of a paper. In ths wor we wll use full-text clusterng approaches because of ther unversalty: they do not need ctaton databases for proper results. Let us dscuss several clusterng methods. For example, affnty propagaton s one of common clusterng methods. In ths method a set of papers s consdered as a networ of connected nodes, and the smlarty of papers corresponds to the weght of edges n ths networ. The method s based on mechansm of passng messages between nodes of the networ [5]. Affnty propagaton detects exemplars - papers n the nput dataset that are representatve for clusters. The networ nodes send to each other messages untl a set of exemplars and clusters gradually emerges. Potental drawbac of ths method s ntally exemplars selecton. One cannot provde ntally exemplars dstrbuton for all drectons, so the exemplars can be selected naccurately. In addton, the approach can lead to naccuraces n cases when t s mpossble to dentfy an exemplar whch descrbes all the papers n the cluster properly [10]. Unle affnty propagaton, proposed method uses terms descrptors to descrbe clusters. These terms correspond to all papers of a cluster (due to the usng of the topc mportance measure for the terms weghtng) [14] that leads to better qualty of clusterng. Another common clusterng method s Brch [19]. Ths algorthm bulds a weghted balanced tree (CF-tree) n a sngle pass through the data. The tree stores nformaton about sub-clusters n leaves of the tree. Durng the clusterng each paper s added to the exstng leaf, or a new leaf s created. Brch also can be used n stream clusterng when the number of documents to be processed can be theoretcally nfnte. Specalzed methods for text clusterng are worth mentonng. In [14] well-scalable full-text clusterng approach for detectng research drectons was proposed. Hash functons are wdely used n ths method n order to get good performance. Ths method has nsuffcent coverage of a dataset: there are large amount of papers, whch do not belong to any of the clusters. In ths study we mprove classfcaton decson rule for better coverage. We also refne extractng of cores of clusters for better clusterng qualty. The methods of topc modellng based on the generatng models are often appled for full-text clusterng. For example, Latent Drchlet Allocaton (LDA) [3] s appled for clusterng of large amounts of papers. The method s an mprovement of Latent Semantc Indexng. It assumes that each object s presented n several classes whch dstrbuton s ncluded nto parametrcal famly of Drchlet dstrbuton. A drawbac of the method s nstablty to nput data, that maes the results unnterpretable [8, 18]. We use LDA as a prelmnary step for clusterng by well-nown methods. Ths apples to sgnfcantly reduce a feature space length and so to ncrease the common clusterng methods performance and qualty. 3 Method descrpton 3.1 Fast algorthm for smlar document search The cornerstone of proposed full-text clusterng method s smlar document search, prevously dscussed n [15]. Ths algorthm use nverted ndex structures for nput data. Most of the full-text search algorthms apply smlar structures for mprovng search performance [17]. Let w s a term (a word or a phrase) from a paper, t weght of the term. We use wellnown tf df measure for weghtng terms n papers. It s a product of term frequency functon tf and nverse document frequency functon df. tf ( w, d) Where n s number of tmes term w occurs n the paper d, n s a number of all terms occurrences n d. D df ( w, log d w Where D s the number of papers n corpus D, d w s the number of papers from D that have term w and the logarthm base does not matter. Defne drect ndex of papers as a functon that returns a set of dfferent terms and ther weghts from the paper d. D d) { w, t, w, t,.., w, t } dx( 1 1 2 2 n n Where n s the number of dfferent terms n the document. Defne nverted ndex of papers as a functon that returns a set of dfferent papers, n whch gven term w s occurred and weghts of ths term n these papers: Idx( w) { d1, t1, d2, t2,.., dn, tn }. The algorthm of the fast smlar document search s the followng. 1. Get terms of the paper d from ts drect ndex D dx (d). 2. Retreve lst of papers contanng the terms got n step 1 from nverted ndex I dx (w). 3. Flter papers that are wealy ntersected by the terms of the nput paper. 4. Get lsts of terms for retreved papers from D dx (d). 5. Calculate Manhattan [2] dstance between the lsts of terms got n step 4. We chose ths dstance because t s fast to calculate and n n 153
provdes qualty comparable to more complcated metrcs le cosne dstance [16]. 6. Flter papers wth hgh dstance to nput paper accordng to predefned threshold H. In step 3 of ths algorthm we cut wealy smlar papers wthout calculatng a dstance measure that sgnfcantly mproves the algorthm. Steps 1-4 can be performed n parallel, whch mproves the performance. 3.2 Clusterng algorthm Suppose a descrptor of a cluster s a set of terms, relevant to ths cluster. Let an ntal core of a cluster s a subset of hghly smlar papers, whch s used for generaton of a descrptor. A descrptor s bult based on full-text of these papers. Let Core(d ) s a functon that returns a tuple m for gven paper d, where c s an dentfer for ntal core of cluster and m s a numerc characterstc of paper d n ntal core c. Characterstc m s used for decreasng of dstance threshold H durng buldng of ntal core, that prevents mergng of ntal cores. If paper d s not presented n any ntal core, the functon Core(d ) returns empty tuple. Drect ndex of descrptors s a functon that returns a set of dfferent terms and ther weghts from the descrptor of cluster CDdx( c) { w1, t1, w2, t2,.., wn, tn } where n s count of dfferent terms n the descrptor. Inverted ndex of descrptors s a functon that returns a set of descrptors n whch gven term w s occurred and weghts of ths term n these descrptors: CIdx( w) { c1, t1, c2, t2,.., cn, tn }. We use topc mportance measure I for weghtng terms of descrptors. Ths measure for term w from corpus D and cluster c s calculated as follows: ( w, df ( w, df ( w, Dc ) I( w, ( w, X( ( w, ), where Dc s a set of papers from cluster c and X () s a Heavsde step functon. Due to the topc mportance measure, descrptor of cluster conssts mostly of terms specfc for papers of the cluster. Let H a and are clusterng thresholds, affectng on nsularty of clusters. The proposed full-text clusterng method contans two parts. The frst part could be performed ndependently on dfferent nodes of computer networ. Ths part conventonally can be called ntal cores of clusters detecton. The steps are the followng. 1. Index the nput corpus D. 2. Whle D s not empty, exclude a paper d from D. If D s empty, turn to step 6. If Core(d ) returns empty tuple, then turn to step 3, else turn to step 4. 3. Generate new cluster core ndex c. Add new tuple 1 wth m 1 to Core. Then turn to step 5. 4. Get dentfer c for current ntal core and m for current paper. Reduce m m. 5. Search smlar papers by the fast algorthm wth threshold H H a m. Add t to Core wth current value of m. Then turn to step 2. 6. Buld descrptors CD dx (c) and CI dx (w) for each ntal core c. The second part s performed n a selected supervsor node and conssts of the followng steps. 1. Search smlar descrptors by the fast algorthm usng CDdx and CI dx. Then merge these descrptors. In practce ths step can be useful, f the frst secton executes on several computer nodes. 2. Classfy all papers. For each retreved descrptor use steps 2-6 from the fast smlar search algorthm and resultng CDdx as a lst of target terms. To prevent fuzzy clusterng each paper s labelled by dentfer of ts cluster and by value of ts smlarty to descrptor of ths cluster. Classfer uses these labels to nclude each paper n the most smlar cluster. The advantage of the proposed method s ts dstrbuted nature thereby full-text clusterng of large amounts of papers can be mplemented. That s necessary for detectng current research drectons represented n wde research area. Another beneft conssts n usng ndexes whch have ncremental structure, so addng a number of papers to the corpus D does not nvolve full re-clusterng. 4 Experment 4.1 Dataset descrpton For experment a dataset of papers of the research area Regeneratve medcne s created and verfed by experts wth PhD degree. The dataset contans 112 wellcted publcly accessble papers from 2000 to 2014 dstrbuted nto 4 research drectons. We apply wdely used lngustc analyss lbrary Freelng [11] to retreve normalzed terms (words and noun phrases wthout stop words) from text of papers n our study. These terms are ncluded n the dataset. The dataset s avalable on demand, and can be extended and used accordng to the BSD lcense. Descrptor Sze Average publcaton year Bran stroe 23 2004 Chmerc antgen receptor cell 25 2009 therapy Induced plurpotent stem cells 41 2011 Wound healng burns 21 2007 4.2 Experment setup 154
Descrptor Bran stroe Chmerc antgen receptor cell therapy Induced plurpotent stem cells Wound healng burns Terms stroe, brdu test, behavoral, endothelal cell, cell brdu, psa-ncam, cortcal cell, reactve, regeneraton neuronal, endogenous, neurogeness, endostatn, cortcal, expanson nonhematopoet schemc neuron pbls, receptor chmer cd19-spec, mmunty, anttumor, adoptve fn, malgnancy b-cell, aapcs, proten fuson, nfuson cell gene sgnature, photoreceptor, epthelum retnal, retnal, suppressor, tumor, cell ps, technology ps, retnal cell, ps-derved, pscs pgmented wound chron wound heal, eratnocyte, epdermal cell, fuseng eratn, epderms regon, We use the sem-automatc approach for tunng clusterng methods that can avod emprcal parameters where the parameter settng precson prorty over recall. In our experment, 1. We use a controlled Measure Affnty propagaton Brch LDA + Affnty propagaton LDA + Brch Proposed method Precson 0.85 0.75 0.7 0.85 0.83 Recall 0.28 0.27 0.49 0.61 0.71 F-measure 0.42 0.4 0.57 0.71 0.76 setup. Clusterng parameters are calculated by maxmzaton of the qualty functon wth boundares 0 H a 1 and 0.7 0. 99 (for the proposed method). For other methods we set boundares accordng to constrants of these methods. We apply commonly used F-measure functon for classfcaton methods assessment. Ths functon based on precson and recall. Recall R s defned as the rato of the number of correctly classfed papers to sze of class n the tran set: t R, t c where t s a number of correctly classfed papers; c s a number of papers whch the method carred to other classes. Precson P s defned as the rato of number of correctly classfed papers to the total of classfed papers: t P, t where t s a number of correctly classfed papers; s a number of ncorrectly classfed papers. Then the f-measure calculates as follows: 2 ( 1) PR F, 2 P R random search method [7] for optmzaton of qualty functon. 4.3 Experment result Thus, we show that proposed method can be tuned properly usng a part of analyzed dataset. Results show that proposed method has suffcent qualty of clusterng (Table 2). Generated terms represent research drectons from the dataset. Table 3 shows experment results usng cross-valdaton technque. Both qualtatve and quanttatve assessment shows that the proposed method produces relatvely good results. In opposte, the affnty propagaton method returns unsatsfactory results. Obvously ths method s not sutable for the soluton of research drecton detecton problem. Possbly t relates to ncorrect exemplars selecton due to the mssng of the ntal dstrbuton for exemplars. We suggest that our method overcomes other examned methods due to applyng Manhattan dstance for paper smlarty estmaton, to usng of topc mportance measure for determnaton of terms of clusters and due to mplementaton of lngustc analyss for terms extracton as well. That leads to more precse estmaton of smlarty measure between papers and, consequently, to better qualty of research drectons detecton. 155
5 Concluson and future wor In ths study we nvestgate varous methods for detecton of current research drectons, whch are based on clusterng. We found that t s necessary to create a method that can be tuned successfully n a small labelled part of analyzed dataset. In ths paper we present the mproved full-text clusterng method for detecton of research drectons. Experment results show that proposed method s more applcable for research drectons detecton, than the other wdely used methods. The sem-automatc approach for tunng of clusterng method demonstrates ts performance also. Besdes, t was shown that fulltext clusterng methods wor mprecsely for thematcally heterogeneous research drectons. Ths paper s consdered as an ntal study that wll be extended n a further research. Moreover, the further development of proposed method conssts n mplementaton of hybrd metrcs for clusterng. These metrcs can be used not only texts smlarty, but also co-ctatons and co-authorshps of papers n dataset. It s necessary to contnue wor wth experts for extenson of our expermental dataset, because hybrd clusterng methods cannot process small datasets properly. References [1] Aletras N. and Stevenson M. Evaluatng topc coherence usng dstrbutonal semantcs // Proceedngs of the 10th Internatonal Conference on Computatonal Semantcs (IWCS 2013) Long Papers. 2013. pp. 13-22. [2] Blac P. E. Manhattan dstance // Dctonary of Algorthms and Data Structures. 2006. Vol. 18. [3] Ble D. M., Ng A. Y. and Jordan M. I. Latent drchlet allocaton // The Journal of machne Learnng research. 2003. Vol. 3. pp. 993 1022. [4] Cobo M. J. et al. Scence mappng software tools: Revew, analyss, and cooperatve study among tools. // Journal of the Amercan Socety for Informaton Scence and Technology. 2011. Vol. 62(7). pp. 1382 1402. [5] Frey B. J. and Duec D. Clusterng by passng messages between data ponts. // Scence. 2007. Vol. 315(5814). pp. 972-976. [6] Glanzel W. Bblometrc methods for detectng and analysng emergng research topcs. // El profesonal de la nformacon. 2012. Vol. 21(2). pp. 194-201. [7] Kaelo P. and Al M. Some varants of the controlled random search algorthm for global optmzaton // J. Optm. Theory Appl. 2006. Vol. 130(2). pp. 253-264. [8] Koltcov S., Koltsova O., Noleno S. Latent drchlet allocaton: stablty and applcatons to studes of user-generated content //Proceedngs of the 2014 ACM conference on Web scence. ACM, 2014. pp. 161-165. [9] Lee W. H. How to dentfy emergng research felds usng scentometrcs: An example n the feld of nformaton securty // Scentometrcs. 2008. Vol. 76(3). pp. 503 525. [10] Leone M. et al. Clusterng by soft-constrant affnty propagaton: applcatons to geneexpresson data //Bonformatcs. 2007. Vol. 23 (20). pp. 2708-2715. [11] Padró Lluís and Stanlovsy Evgeny. FreeLng 3.0: Towards Wder Multlngualty // Proceedngs of the Language Resources and Evaluaton Conference (LREC 2012). Istanbul, 2012. [12] Rousseeuw P. J. Slhouettes: a graphcal ad to the nterpretaton and valdaton of cluster analyss // Journal of computatonal and appled mathematcs. 1987. Vol. 20. pp. 53-65. [13] Scence Ctaton Index by Thomson Reuters http://thomsonreuters.com/scence-ctaton-ndexexpand, 2015. [14] Shvets A. et al. Proceedngs of the Scence and Informaton Conference // Detecton of Current Research Drectons Based on Full-Text Clusterng. London, 2015. [15] Sochenov I. Relatonal-stuatonal data structures, algorthms and methods for search and analytcal tass solvng [n Russan]. PhD thess. Insttute for systems analyss of RAS, Moscow, 2014. [16] Suvorov R. E. and Sochenov I. V. Method for detectng relatonshps between sc-tech documents based on topc mportance characterstc. [In Russan] ISA RAS, Moscow, 2013. Vol. 1. pp. 33-40. [17] Sten B. Prncples of hash-based text retreval //Proceedngs of the 30th annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval. ACM, 2007.. 527-534 [18] Vorontsov K. and Potapeno A. Addtve regularzaton of topc models // Machne Learnng. 2014. pp. 1-21. [19] Zhang T., Ramarshnan R., and Lvny M. BIRCH: an effcent data clusterng method for very large databases. // ACM SIGMOD Record. - ACM, 1996. Vol. 25(2). pp. 103-114. 156