Assessing Italian Research in Statistics: Interdisciplinary or Multidisciplinary?

Transcription

1 Assessing Italian Research in Statistics: Interdisciplinary or Multidisciplinary? Sandra De Francisci Epifani*, Maria Gabriella Grassia**, Nicole Triunfo**, Emma Zavarrone* Abstract In this paper, we assess cross disciplinary of research produced by the Italian Academic Statisticians (IAS) combining text mining and bibliometrics techniques Textual and bibliometric approaches have together advantages and disadvantages, and provide different views on the same interlinked corpus of scientific publications. In addition textual information in such documents, jointly citations also constitute huge networks that yield additional information. We incorporate both points of view and show how to improve on existing text-based and bibliometric methods. In particular, we propose an hybrid clustering procedure based on Fisher s inverse chi-square method as the preferred method for integrating textual content and citation information. Given clustered papers, it s possible to evaluate ISI subject categories (SCs) as descriptive labels for statistical documents, and to address individual researchers interdisciplinary. Keywords: Bibliometrics, Text mining, Social network Analysis, Hybrid Clustering 1 Introduction Increasing dissemination of scientific and technological publications via web sides, and their availability in large-scale bibliographic databases, opened to massive opportunities for improving classification and bibliometric cartography for science and technology. This metascience benefits of the continuous arise of computing power and development of new algorithms. The purpose of mapping, charting or cartography of scientific fields is the knowledge of the structure and the evolution for different areas of research and link other fields, based on scientific publications. Research fields can be profiled using different keywords i.e. in terms of prolific authors, major concepts, important publications and journals, institutions, regions and countries, etc. Knowledge about the amount of activity in various fields and about new, emerging and converging fields is important to organizations, research institutions and nations. Quantitative information can be used for evaluation of research performance, interdisciplinary, collaboration, internationalization and for the support of innovation management, science and technology policies (for example, what fields should be supported through funding?). Such policies are crucial for competitive positions at university. We focus on cross disciplinary within scientific areas of research Italian Universities using clustering algorithms and techniques in bibliometrics and text mining. The multidisciplinary context given by statistical affords an excellent opportunity to examine the methods used to study interdisciplinary and integration. 2 Background Research that occurs at the intersection between disciplines is thought to lead to great advances in science (Porter and Rafols, 2009). Interdisciplinary research would be supported and encouraged to solve new statistical challenges. A cynical disposition to this problem is eloquently stated in Brewer (1999): The world has problems, but universities have departments. The term interdisciplinary tends to be tacitly understood by researchers, without shared definition. We adopt the definition suggested by Porter et al. (2007), given by the National Academies (2005): interdisciplinary research requires an integration of concepts, theories, techniques and/or data from two or more bodies of specialized knowledge. Multidisciplinary research may incorporate elements of other specialized knowledges, but without * UNIVERSITA' IULM - Via Carlo Bo, 1 Milano ** Dipartimento di Matematica e Statistica, Università degli Studi Federico II Napoli via Cintia, Napoli

2 interdisciplinary synthesis (Wagner et al., 2011) which includes more than single parts. Analysis of cross disciplinary improves traditional indicators assessing and quantifying interdisciplinary research (Morillo et al., 2001) (fig.1). Fig. 1: Interdisciplinary and multidisciplinary Indicators of different disciplinary describe heterogeneity of a bibliometric set obtained starting from predefined categories i.e. using a top-down approach, we allocate the set on the global map of science. Network coherence indicators are constructed to measure the intensity of similarity relations within a bibliometric set, i.e. using a bottom-up approach, which reveals the structural consistency of the publications network (Rafols and Meyer, 2010). Instead of exploring large-scale trends in publications using a top-down approach, it is necessary to have a large amount of data that represents the research track of each statistician using a bottom-up approach. We suggest to measure one or more individuals versed in statistics. Therefore, an unsupervised approach is optimal as such methods can find trends in data without prior knowledge of its structure. Substantial distinction between text world and graph world refers to different parts of views on a collection of interlinked publications. In addition, textual information such as citations, kept in documents, are large networks, which yield additional information. To create groups of publications in clusters or groups of documents, we consider two complementary approaches. In integrated or hybrid analysis we include how to improve existing text-based and graph analytic (or bibliometric) methods by deeply merging textual content with the structure of the citation graph. The main difference between text world and graph world refers to an interlinked data collection such as World Wide Web and bibliographic databases containing written scientific communications. These documents contain textual information that can be mined for knowledge by using text mining techniques. Moreover, each document refers to other documents that are related in some way. Most scientific papers indeed cite previous research on which it is based or which is considered to be relevant for the subject. These citations are collected in the bibliography of a publication. Although various reasons are conceivable for citing other works, citations usually imply endorsement or recommendation of previous work. All citations among publications or hyperlinks among Web pages constitute extremely large networks, of which the World Wide Web is the biggest example. Instead of the Web, where each Web page can have hyperlinks to any other page, a citation network or literature network is a kind of/or similar to directed acyclic graph (DAG). Citations and hyperlinks have, respectively, a direction (they point from one entity to another), but citations are not reciprocal and no directed cycles occur in the citation graph. Usually, a scientific paper only cites documents that have already been published. Textual and graph-based approaches might be applied to a dataset. For example, similarity of different perceptions Page 2

3 between documents or groups of documents can be described using different methods. In addition, we observe dynamics in evolving databases. We include viewpoints and claim jointly to improve on existing text-based and graph analytic or bibliometric methods to science and statistics mapping. Indeed, textual information can indicate similarities that are invisible to bibliometric techniques. Based only on text, true document similarity can be overshadowed by differences in vocabulary use, or spurious similarities might be introduced as a result of textual pre-processing, or because of polysemous words (a word with several meanings) or words with little semantic values. Widely used method of co-citation clustering was introduced independently by Small (1973, 1978) and Marshakova (1973). Cross-citation-based cluster analysis for science mapping is different; while the former is usually based on links connecting individual documents, the latter requires aggregation of documents to units like journals or subject fields among which cross-citation links are established. Some advantages of this method are undermined by possible biases. (for instance, analyze directed information flows). For example, bias could be caused by the use of predefined units (journals, subject categories, etc.), in some way, this implies an initial level of structural classification. Journal crosscitation clustering has been used by Leydesdorff (2006), Leydesdorff and Rafols (2009), and Boyack, BÜrner, and Klavans (2005), while Moya-Anegùn et al. (2007) applied subject co-citation analysis to visualize the structure of science and its dynamics. The integration of lexical similarities and citation links are attractive also in other fields such as search engine design (i.e., Google combines text and links; Brin & Page 1998). In early 90 s, the combination of link-based clustering with a textual approach was suggested for better efficiency and appliability of co-citation and coword analysis. A new Weighted hybrid clustering framework was proposed by Liu, Yu, Janssens, Glènzel, Moreau & De Moor (2010) the focus was on text mining with bibliometrics in journal set analysis. This framework integrates two different approaches: clustering 1. ensemble and kernel-fusion clustering. 3 Aims, methods and data collection 2. In order to verify the hypothesis of accuracy of clustering and classification of scientific papers, we propose an answer to the following question: Is Statistics interdisciplinary or multidisciplinary?. In synthesis, our methodological proposal is organized as follows: 3. We combine different text mining techniques for information retrieval and map the networks of the content papers written by single or multiple statisticians. 4. We focus on analysis of large networks that emerge from individual papers of statisticians (authors) citing other scientific works. These networks are analyzed with techniques from bibliometrics and social network techniques in order to: construct coherent indicators for measuring the intensity of similarity relations within the bibliometric set and cluster the papers analyzed. We propose a clustering procedure based on Fisher s inverse chi-square method for integrating textual content and citation information. We evaluate ISI Web of Knowledge subject categories as descriptive labels for statistical documents, compare the clusters obtained in the third step with the ISI classification of Statistician papers. We collect monthly Italian Statisticians papers, for the period and we take Scs through Scopus and Web of Knowledge(WoK). To gather this data, we employed the following procedure: 1.- Create a list of whole papers authored by Italian Statisticians (Scopus Author search). 2. Create a list of all the papers present in references of Italian Statisticians papers (Scopus) 3. Create a list of all of the papers that cite Italian Statisticians papers Manually download the html files one for each paper from WoK (with next information: Authors, Title, Year, Source title, Volume, Abstract, Author Keywords, Index Keywords, References, Editors, ISSN, ISBN, CODEN, Language of Original Document, Document Type, SC).The dataset will have the title, the abstract text, author keywords and the SCs for each Statistician s publication, the publications they cite (references), the publications that cite them (citations). We modify the subject categories using the following method: Papers with a single WoK SCs that appears 10 or more times in our dataset uses assigned WoK SCs name. Page 3

4 Papers with a single WoK SCs that appears less than 10 times is changed to a broader WoK category. Papers with two or more SCs, containing equivalent weight, are assigned to a new conflated SC. Papers with two or more SCs that have a clear primary SC have Multidisciplinary appended to the primary name. Textual content is entirely indexed and encoded in the Vector Space Model using the TF-IDF weighted schemes, and text-based similarities were computed as the cosine of the angle between the two papers. The dimension of the term-by-document matrix is reduced by Latent Semantic Indexing (LSI) Deerwester, et al. (1988). Citations among selected publications are investigated in three different aspects: a) Cross-citation (CRC): Cross-citation between two papers is defined as the frequency of citations between each others. The direction of citations is ignored. b) Co-citation (COC): Co-citation refers to the number of times two papers are cited together in subsequent literature. The co-citation frequency of two papers is equal to the number of papers that cite them simultaneously. c) Bibliographic coupling (BGC): Bibliographic coupling occurs when two papers refer a common third paper in their bibliographies. The coupling frequency corresponds to the number of papers they simultaneously cite. All textual and citation data sources were converted into kernels using a linear kernel function. In particular, for the textual data, the kernel matrices were normalized and their elements correspond to the cosine value of pairwise document-by-term vectors. We combine document in a matrix of dissimilarities based on textual information, network structure or other bibliometric indicators. So the integrated document distances can be used for a learning algorithm. The integrated document distances can then be passed to a learning algorithm. Weighted linear combination of distance matrices, as well as Fisher s inverse chi-square method from statistical meta-analysis, are applied. We label the clusters obtained on their most significant terms and most representative publications. Finally, we compare the cluster structure with ISI classification schemes. We clustered statistical abstract data to evaluate SCs as document labels. We attempt to reconcile clustering (bottom-up approach) with pre-defined categories (top-down approach). If the clusters produced by hybrid framework don t correspond well to the SCs, so we can conclude that SCs are not well suited to the classification of statistical publications, and speculate that this may also be true for other interdisciplinary fields. 4 Conclusion Disciplinary diversity indicators are developed to describe the heterogeneity of a bibliometric set viewed from predefined categories, i.e. using a top-down approach that locates the set on the global map of science. In this pilot study on Italian Statisticians, we investigated the use of an hybrid clustering technique, to aid in measuring researcher interdisciplinary. Furthermore, we assess whether Journal Subject Categories from the Web of Knowledge database are sufficient for labeling statistics documents. Clustering and textual classification allow interdisciplinary analysis such that 1) describe collaboration and integration of knowledge and 2) draw to useful conclusions for statistical researchers by uncovering the underlying structure of research tracks 5 References Boyack, K. W., Klavans, R., & B Orner, K. (2005). Mapping the backbone of science. Scientometrics, 64, Braam R. R., Moed H. F., & van Raan A. F. J. (1991). Mapping of science by combined cocitation and word analysis.2. dynamic aspects. Journal of the American Society forinformation Science, 42(4): Brewer, G. D. (1999). The challenges of interdisciplinarity. Policy Sciences, 32, Brin S. & Page L. (1998). The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1-7): Page 4

5 Deerwester, S., et al. ( 1988). Improving Information Retrieval with Latent Semantic Indexing, Proceedings of the 51st Annual Meeting of the American Society for Information Science 25, pp Gowanlock M. & Gazan R. (2012). Assessing Researcher Interdisciplinarity: A Case Study of the University of Hawaii NASA Astrobiology Institute. ICS research documents Janssens, F., Zhang, L., Moor, B. D., & Glänzel, W. (2009). Hybrid clustering for validation and improvement of subject-classification schemes. Information Processing and Management, 45(6), Leydesdorff, L. (2006). Can scientific journals be classified in terms of aggregated journal-journal citation relations using the Journal Citation Reports? Journal of the American Society for Information Science and Technology, 57(5), Leydesdorff, L., & Rafols, I. (2009). A global map of science based on the ISI subject categories. Journal of the American Society for Information Science and Technology, 60(2), Liu, X., Yu, S., Janssens, F., Glänzel, W., Moreau, Y., & De Moor, B. (2010). Weighted hybrid clustering by combining text mining and bibliometrics on a large-scale journal database. Journal of the American Society for Information Science and Technology, 61(6), Marshakova I. V. (2003). Journal co-citation analysis in the field of information science and library science. In P. Nowak and M. Gorny, editors, Language, information and communication studies, pages Adam Mieckiewicz University, Poznan. Morillo, F., Bordons, M., & G omez, I. (2001). An approach to interdisciplinarity through bibliometric indicators. Scientometrics, 51, Moya-Anegon, F., Vargas-Quesada, B., Herrero-Solana, V., Chinchilla-Rodriguez, Z., Corera-Alvarez, E., & Munoz-Fernandez, F. J. (2004). A new technique for building maps of large scientific domains based on the cocitation of classes and categories. Scientometrics, 61(1), National Academies. (2005). Committee on Facilitating Interdisciplinary Research, of the Committee on Science, Engineering, and Public Policy. Facilitating Interdisciplinary Research. Washington, DC. Porter, A., & Rafols, I.. Is science becoming more interdisciplinary? Measuring and mapping six research fields over time. Scientometrics, 81, , Porter, A., Cohen, A., Roessner, J. D., & Perreault, M.. Measuring researcher interdisciplinarity. Scientometrics, 72, , Rafols, I. and Meyer, M. (2010) Diversity and Network Coherence as indicators of interdisciplinarity: case studies in bionanoscience. Scientometrics 82(2), Small, H. (1999). Visualizing science by citation mapping. Journal of the American Society for Information Science, 50, Wagner, C. S., Roessner, J. D., Bobb, K., Klein, J. T., Boyack, K. W., Keyton, J., Rafols, I., & B orner, K.. Approaches to understanding and measuring interdisciplinary scientific research (IDR): A review of the literature. Journal of Informetrics, 5(1), 14 26, Zhang, L., Liu, X., Janssens, F., Liang, L., & Gl anzel, W.. Subject clustering analysis based on ISI category classification. Journal of Informetrics, 4(2), , Zitt M. & Bassecoulard E. (1994). Development of a method for detection and trend analysis of research fronts built by lexical or cocitation analysis. Scientometrics, 30(1): Page 5