Towards automatic discovery of co-authorship networks in the Brazilian academic areas

Transcription

1 Towards automatic discovery of co-authorship networks in the Brazilian academic areas Jesús. Mena-Chalco, Roberto M. Cesar Junior Department of Computer Science Institute of Mathematics and Statistics University of São aulo - Brazil {jmena,cesar}@vision.ime.usp.br Abstract In Brazil, individual curricula vitae of academic researchers, that are mainly composed of professional information and scientific productions, are managed into a single software platform called Lattes. Currently, the information gathered from this platform is typically used to evaluate, analyze and document the scientific productions of Brazilian research groups. Despite the fact that the Lattes curricula has semi-structured information, the analysis procedure for medium and large groups becomes a time consuming and highly error-prone task. In this paper, we describe an extension of the scriptlattes (an open-source knowledge extraction system from the Lattes platform), for analysing individuals Lattes curricula and automatically discover largescale co-authorship networks for any academic area. Given some knowledge domain (academic area), the system automatically allows to identify researchers associated with the academic area, extract every list of scientific productions of the researchers, discretized by type and publication year, and for each paper, identify the co-authors registered in the Lattes latform. The system also allows the generation of different types of networks which may be used to study the characteristics of academic areas at large scale. In particular, we explored the node s degree and AuthorRank measures for each identified researcher. Finally, we confirm through experiments that the system facilitates a simple way to generate different co-authorship networks. To the best of our knowledge, this is the first study to examine large-scale co-authorship networks for any Brazilian academic area. I. INTRODUCTION Data mining is a process of extracting large volumes of data, commonly used to identify or reveal possible relationships between instances of the treated elements [1]. The extraction of scientific data, identification of bibliometric patterns, and effective modeling and visualization of networks of interaction among co-authors are relevant topics in e-science, Bibliometrics and Scientometrics. In recent years, there has been growing interest in such topics because of the knowledge discovery process that may be implemented from the treatment of datasets available in the repositories of scientific production (e.g. datasets of bibliographic production, academic guidances and research projects). On the other hand, in Brazil, the National Council for Scientific and Technological Development (CNq) performs an important work on the integration of academic curricula databases of public and private institutions into a single platform called Lattes. The so-called Lattes Curriculum are considered a national standard of assessment or evaluation, representing a history of scientific activities academic and professional of researchers registered on the platform [2]. Lattes curricula have been designed to show individual public information for each user registered on the platform. In this context, performing a summarization of bibliographic production, or generation of co-authorship networks, for a group of registered users of medium or large size (e.g. researchers from a specific academic area), really requires a lot of manual effort and a myriad of process, susceptible to failures, which could influence the obtained results. Therefore, starting from the work 1 [3], an automatic system for generation of large-scale co-authorship networks, based on Lattes curricula belongings to researchers associated to specific academic areas, has been created. It is worth noting that this previous work (i) do not explore the automatic identification of researchers associated with an academic area, and (ii) do not extract measures to characterize co-authorship networks. It is important to note that our approach assumes the existence of a large set of Lattes curricula. Our local dataset contains over 1.3 million of curricula. The data was gathered by crawling publicly accessible information on to the Lattes platform. We believe that our work is the first study to automatically examine Brazilian co-authorship networks in such large-scale. In addition to the generation of co-authorships networks, the system allows to automatic computation of two metrics related to the node s degree and the AuthorRank measure [4] for each identified researcher. Considering our experiments, we provide an insight into the real-world co-authorship networks. We observe a significant correlation between the node measures such as the degree value and the AuthorRank measure calculated of each researcher associated with an academic area. However, we observe that the AuthorRank measure is an indicator more representative than the degree value, because the AuthorRank better revels the status of a researcher, as mentioned by X. Liu et al. [4]. Finally, note that this system can be used for exploring, identifying and validating patterns of scientific activities, thus bringing bibliometric and/or scientometric information about a group of interest [5]. The relevance of our work rest on the benefits of using the process considered for the automatic 1 The scriptlattes is available as open-source code that can be adapted and improved for specific research needs:

2 analysis of scientific production and relationship among the actors of research academic areas [6], [7]. This paper is organized as follows. The acquisition procedures used to collect the used dataset is described in Section II. The system is described in Section III. Some experiments and results are shown in Section IV. The paper is concluded with some comments on our ongoing work in Section V. II. DATA COLLECTION The data presented in this paper is from the Lattes platform on May 5th, 2011 and contains about 1.3 million curricula. We have created several procedures in order to crawl curricula, in HTML format, by accessing the public websearch interface provided by the Lattes platform. Several webquery were executed for the purpose of access to large Lattes curricula dataset. Each web-query has been referred to a specific knowledge area. In summary, it was performed eighty queries to the Lattes platform, each one associated to an knowledge domain. For every result a specific HTML parser has been created in order to automatically obtain the list of curricula. With this in mind, we used this list to automatically download the Lattes curricula. It is important to note that with this collection strategy, we do not attempt to study the entire Lattes database, but a representative quantity data, since in 2007 the CNq announced that the Lattes platform achieved one million of registered curriculum which means that it should be much larger now. III. THE SYSTEM Our proposed system is composed of four main modules, namely (i) data selection, (ii) data extraction, (iii) redundancy treatment, (iv) co-authorship networks generation, and (v) AuthorRank computation. The basic operation of the system is as follows. Firstly, a list of Lattes curricula, in HTML format, must be selected following some criteria of analysis given by the user. This requires multiple queries to the local database. The resulting Lattes curricula are processed through a deterministic HTML parser that extracts all information of interest (e.g. professional information, expertise areas, list of academic productions). Therefore, the processed productions lists, associated with each researcher, are combined into lists, organized by type and publication year. The obtained filtered lists are used to create different types of networks. This is achieved by identifying all co-authors (registered on Lattes platform) of each publication. The AuthorRank computation is then performed to obtain a measure related to the status of each researcher. Fig. 1 shows the high level view of the proposed system. A. Data selection This module allows to select the Lattes curricula, in HTML format, from the local database. Therefore, only the subset of researchers Lattes curricula is chosen in accordance to the academic (expertise) areas informed by the user. In order to facilitate the comparison between strings, the curricula were normalized to use the same characters codification. As Fig. 1. A high level view of our system to automatically discover coauthorship networks belongings to academic areas. indicated in Section II, public users of the Lattes platform do not have access to the complete Lattes database. For this reason, special attention is given to extract the information about the scientific productions, as explained in the data extraction module. B. Data extraction In this module, a deterministic HTML arser [8] is used to automatically extract all information of each researcher related to full name, bibliographic citation name, academic areas, professional address, scholarship type, and gender. Additionally, the designed parser allows to extract the list of academic productions limited by the years indicated in the time period (given by the user). Currently, it is considered three types of academic production, namely (i) bibliographical production, (ii) technical production, and (iii) artistic production. See in Table I the types of academic production considered by the system. Is worth mentioning that a computational challenge for the system is processing the data in HTML format, where the constituent parts of the academic productions (e.g. authors names, publication title, the journal name of publication, number, volume, pages, year) are presented without any indication of division/separation. Thus, the designed parser identifies, in the vast majority of cases, all the constituent parts of the academic productions. C. Redundancy treatment It is common that some academic productions are made in co-authorship with one or more researches of the selected group in the Data selection module. Thus, the same academic production may be being registered in each Lattes curriculum of the corresponding co-authors. The developed system has a redundancy treatment module that allows the detection of

3 TABLE I INFORMATION EXTRACTED FROM THE LATTES CURRICULUM. (i) Bibliographical production Articles in scientific journals Book published/organized Book chapter published Articles in newspapers/magazines Complete works published in proceedings of conferences Expanded summary published in proceedings of conferences Summary published in proceedings of conferences Articles accepted for publication resentations of work Other kinds of bibliographical production (ii) Technical production atented or registered software Not patented or registered software Technological products Techniques or process Technical works Other kinds of technical production (iii) Artistic productions Artistic/cultural production equal or similar academic productions. In this regard, the productions are used to detect co-authorship among the selected group of researchers: two or more researchers are considered co-authors if there is a common production between them. The detection of equal or similar productions is accomplished through comparisons among all extracted academic productions discretized by year and type (e.g., journal paper, book chapter), so that the productions with different publication s year are not used in any comparison, allowing a substantial reduction of processing time of this module. Due to inconsistencies in filling information in the Lattes curriculum (typos or lack of standardization of co-authors names [9]), the comparison of any two productions is made through an inexact matching procedure between the title of the academic productions. Two productions are considered equal if the similarity percentage between the titles is greater than a percentage given by the user. The similarity between two strings is based on the distance proposed by Levenshtein [10]. The Levenshtein distance is obtained through the minimum number of insertions, deletions or substitutions required to transform one string into another (Levenshtein distance equals to 0 indicate that two considered titles are exactly equal). An important feature of this module is the conjunction of data in the similar productions, so that the missing information of a production can be combined/complemented with information from the other. Currently, complementation refers only to the exclusive use of the string with larger size. With this module, the system is able to keep a record of all coauthors associated with a particular academic production. This information is important in numerical weighting of academic productions, such as the normalization of edges weights belonging to co-authorship networks. D. Co-authorship networks generation A co-authorship network describes research activities that have been carried out by a research group [6], [11]. In this module, the designed system uses a graph to represent the coauthorship among researchers based on scientific productions. Each researcher is represented by a node. An edge is created between a pair of nodes whenever a common production of the corresponding researchers is detected by the redundancy treatment module, i.e., if two researchers have co-authored an academic production, their respective nodes in the coauthorship network are linked by an edge (co-author relationship between authors). Our work is focused on creating different co-authorship networks. Below we describe briefly each of them. 1) Networks without weights: this type of network is commonly represented by an undirected graph without weights (or undirected unit-weighted graph), where the edges represent only connections for co-authored academic productions. A simple measure that can be harnessed on this type of graphs is the node degree, that represents the number of edges connected to it. See an example in Fig. 2(a). The degree of the node A1 is 3. 2) Networks with un-normalized weights: this type of network is represented by an undirected weighted graph, where the edges represent the number of academic productions developed in co-authorship between two nodes. In this type of networks the node degree can be seen as the sum of the edges weights of edges connected to it. Note that with this representation, the total number of productions made in co-authorship is overlooked. For example, in Fig. 2(b) the author A1 collaboratively has participated in three productions, however, the node degree is 4 (this value does not correspond with a common sense notion of coauthorship). 3) Networks with co-authorship frequencies: to work around the problem reported in the previous subsection, in [4] it was defined the terms of exclusivity and co-authorship frequency. Exclusivity is related to the value at which two authors have an exclusive co-authorship relation for a particular academic production. Thus, given an academic production co-authored by n authors, the exclusivity value of this production can be defined as 1/(n 1). This measure gives more weight to coauthor relationships in academic production with fewer total co-authors than articles with large numbers of co-authors. See in the top of Fig. 2 examples of exclusivity values computed for three productions. The co-authorship frequency between two co-authors is defined as the sum of all exclusivity values between them. This measure gives more weight to authors who co-publish more academic productions together. See in Fig. 2(c) examples of co-authorship frequencies. Note that with this representation,

4 Fig. 2. Example of co-authorship networks obtained from automatic identification of 3 productions written by 4 authors: A1, A2, A3 and A4. given an author, the total number of productions made in coauthorship is computed by the sum of co-authorship frequencies obtained from co-authors connected to the author. For example, the author A1 has participated in three productions ( ). 4) Networks with normalized weights: this type of network is represented by a directed weighted graph where, given an author, the salient edges are represented by the co-authorship frequencies normalized by the total number of productions made in co-authorship. This normalization ensures that the weights of an author s relationships sum to one. See in Fig. 2(d) an example of network with normalized weights. The normalized weights intuitively gives an idea of the importance of a co-author in the production carried out in collaboration with others. For example, the author A1 participates in 75% of the production made in co-authorship of A2 (i.e., A1 is 75% important for A2). However, the author A2 is important only about 50% for A1. On the other hand, A1 is 100% important for A4, but A4 is important only about 33% for A1. This behavior is typical in the relationship advisor-advisee (A1-A4). E. AuthorRank computation The weights of the aforementioned directed graph show the importance of collaboration and reciprocity, being also used in the system as a basis for the calculation of the AuthorRank measures [4] of researchers. We could understand the AuthorRank measure as a numeric value that represents the status of a researcher within the co-authorship network (the algorithm proposed by X. Liu et al. [4], called AuthorRank, it is an adaptation of the agerank algorithm [12] used in the search for relevant pages in the Google search engine). Therefore, the researchers with more collaborative status have the highest AuthorRank values. The AuthorRank (AR) of an author i, which has n coauthors, is given as follows: AR (t) (i) = (1 d) + d n ( AR (t 1) ) (j) w j,i j=1 where AR (t 1) (j) corresponds to the AuthorRank of the j coauthor of i (j = 1,..., n), w j,i corresponds to the normalized weight between author j and author i, and d is the damping factor (usually set to 0.85 as described in [13]). For this iterative process we considered the initialization of Author- Rank measures as AR (0) (k) = 1 k. In our experiments the AuthorRank measures converged when t = 100. For the example shown in Fig. 2, A1 is the author with the highest AuthorRank. It is important to note that, given the node degree values, the authors A2 and A3 could be classified at the same level. However, considering the AuthorRank measures, the ranking of author A2 is better than the ranking of author A3. IV. EXERIMENTS AND RESULTS Researchers from four domains have been considered in the present study: Computer Science, Mathematics, hysics, and Statistics. In order to standardize the analysis of automatic generation of co-authorship networks, only the journal papers published between 2000 and 2010 were considered. The obtained co-authorship networks are presented in Fig. 3. For each knowledge domain, the Data selection module (Section III) has been configured to consider only researchers with a doctoral degree (at least). In the regular representation (first column), the researchers are represented by nodes, and the co-authorships are represented by unweighted straight edges. The graph drawing was obtained through the Fruchterman- Reingold algorithm [14]. The entire graph is simulated as if it were a physical system. The aforementioned algorithm is based on the attribution of forces between the nodes and the (1)

5 Regular representation Node size is proportional to node s degree Node size is proportional to node s AuthorRank measure (a) Computer Science 6014 researchers (b) Mathematics 5008 researchers (c) hysics 6410 researchers (d) Statistics 2274 researchers Fig. 3. Co-authorship networks belonging to researchers associated to four Brazilian academic areas. The researchers are represented by nodes, the coauthorships are represented by edges. The networks were created considering the journal papers published between 2000 and 2010 by researchers with doctoral degree. Isolated nodes are not shown.

6 cumulative frequency Computer sciences Mathematics hisics Statistics All degree Fig. 4. Cumulative degree distribution of co-authorship networks of four Brazilian academic areas. edges, as if the edges were straight springs and nodes are electrically charged particles. The forces are applied iteratively to the nodes, pulling them closer together or pushing them more distant until the system enter into an equilibrium state. This algorithm was adopted to visualize the complete co-authorship networks because the vertices are evenly distributed, making edge lengths uniform and reflecting symmetry. In general, each co-authorship network apparently reflects characteristics of professional/academic practice. For instance, mathematicians usually have a few co-authors in publishing their work. On the other hand, physicists generally tend to have several co-authors in the publications. This approach may be better understood by calculating the cumulative degree distribution of each co-authorship network (see Fig. 4). In the representation of co-authorship networks of the second column of Fig. 3, each node size is proportional to its own measure of degree of connectivity, i.e., the size of the node represents the number of co-authors of each researcher, so that the more collaborative researcher will have a larger size of the node. On the other hand, in the third column of Fig. 3, the node size is proportional to its own measure of AuthorRank. Note that the nodes maintain the same relative positions for all co-authorship networks of each area. As expected, only one giant component was obtained for each academic area. As a global analysis we execute our proposed system in order to obtain a complete co-authorship network composed by the four aforementioned areas. A unique label was assigned for each academic area hd researchers have been identified in our local database researchers were associated with more than one academic area. Fig. 5 presents the co-authorship networks created considering the journal papers published between 2000 and Note that in the regular representation of the network (first column), the researchers associated with each area are slightly grouped among themselves forming components. Note also that the area of mathematics (nodes represented in green color) is transversal to the other three areas considered in our experiment. In the second and third column of Fig. 5 the node size associated to each researcher was modified in order to represent the degree value and AuthorRank measures, respectively. The higher value is represented by a node with a higher size. Note that the more connected academic area corresponds to hysics. Therefore, it is expected that the majority of nodes with large size (i.e., nodes width high degree value) belong to this area. However, a better sense of collaboration is given by the AuthorRank measures. Regarding the relationship between node measures such as degree and AuthorRank that allows to characterize the topology of real-world networks, few studies had been conducted [15]. In this sense, in our experiment, we computed the earson s correlation coefficient of (p-value < 2.2e- 16) between the degree values and the AuthorRank measures. This significant correlation indicates that the AuthorRank measure is related to degree value of each node. In fact, both measures indicate how related is a researcher in the academic community. Fig. 6 displays the graphic correlation between both measures. In order to compare the representation of co-authorship networks by both measures, in Table II the top 10 researchers are presented sorted by degree values and AuthorRank measures. Note that, considering the nodes degree value, the top 10 identified researchers are associated with the area of hysics (highly connected area). On the other hand, considering the nodes AuthorRank measure, the researchers more collaborative from other areas are shown in evidence in the top 10 researchers associated with the four considered academic areas. It is also interesting to note that both measures are apparently linked in part to the number of papers published and the type of academic scholarship conferred by the CNq (the best types of scholarships associated to the top researchers are:,, 1C and 1D). V. CONCLUSION AND ERSECTIVES In the present work we have presented a system which can improve the usefulness significantly of the scriptlattes (an open-source knowledge extraction system from the Lattes platform) in order to automatically discover co-authorship networks through data extracted from the Lattes platform. Our initiative contributes mainly in two aspects: (i) allow the automatic generation of large-scale co-authorship networks of several Brazilian academic area, and (ii) provides a good opportunity to analyze, through metrics, co-authorship networks derived from different areas associated with the eighty knowledge areas classified by CNq. All things considered, the co-authorship networks may be examined by complementary tools for social network analysis such as ajek, UCInet or R. Thereby, the obtained networks

7 Regular representation Node size is proportional to node s degree value Node size is proportional to node s AuthorRank measure Fig. 5. Co-authorship network belonging to researchers associated to four Brazilian academic areas: Computer Science (red), Mathematics (green), hysics (blue), Statistics (magenta). Researchers with two or more academic areas are presented in black color. The networks were created considering the journal papers published between 2000 and 2010 by researchers with doctoral degree. Isolated nodes are not shown. TABLE II T O 10 RESEARCHERS W. R. T. DEGREE VALUES ( LEFT ) AND AUTHOR R ANK MEASURES ( RIGHT ) OBTAINED FROM JOURNAL AERS UBLISHED BETWEEN 2000 AND 2010 BY RESEARCHERS WITH H D DEGREE. CS=C OMUTER S CIENCE, M=M ATHEMATICS, = HYSICS, S=S TATISTICS. Researcher A6201 A14631 A2763 A12704 A1839 A260 A3599 A14752 A9190 A11411 Degree AuthorRank apers Area(s) CS, Type 1C 1D 1C 1C can be easily treated with various metrics such as: distance, clustering and centrality (closeness and betweenness) [16], [17]. These measures are widely studied to: (i) characterize social networks and automatically identify co-authorship subcommunities of research groups [18], (ii) study and characterize the temporal evolution of co-authorship among researchers, i.e., analyze the cooperation dynamics of researchers across different periods of time [19], or (iii) correlate quantitative with qualitative collaboration data in order to examine the trending of research activities on specific axes [20], [21]. Finally, we emphasize that the use of automatic tools similar to the presented (e.g. DBL, CiteSeer, Google Scholar, Microsoft Academic Search, and ArnetMiner [22]) have been increasingly necessary because there is a increasing volume of academic production data that must be correctly computed [1] and effectively visualized by using computational methods [23]. ACKNOWLEDGMENT The authors would like to thank CNq, CAES, FAES and FINE for their financial support. Researcher A14631 A4600 A9220 A15939 A14752 A2763 A11411 A14054 A16431 A6201 AuthorRank Degree apers Area(s) CS, CS, S CS, M, M CS Type 1C R EFERENCES [1] S. Issue, Communications of ACM: Surviving the data deluge. New York, NY, USA: ACM, 2008, vol. 51, no. 12. [2] C. Amorin, Curriculum vitae organization: the Lattes software platform, esquisa Odontolgica Brasileira, vol. 17, no. 1, pp , [3] J.. Mena-Chalco and R. M. Cesar-Jr., scriptlattes: An open-source knowledge extraction system from the lattes platform, Journal of the Brazilian Computer Society, vol. 15, no. 4, pp , [4] X. Liu, J. Bollen, M. Nelson, and H. V. de Sompel, Co-authorship networks in the digital library research community, Informations rocessing and Management, vol. 41, no. 6, pp , [5] S. Nicholson, The basis for bibliomining: Frameworks for bringing together usage-based data mining and bibliometrics through data warehousing in digital library services, Informations rocessing and Management, vol. 42, no. 3, pp , [6] S. Klink,. Reuther, A. Weber, B. Walter, and M. Ley, Analysing social networks within bibliographical data, in 7th International Conference on Database and Expert Systems Applications, ser. Lecture Notes in Computer Science, vol. 4080, 2006, pp [7] R. Kouzes, G. Anderson, S. Elbert, I. Gorton, and D. Gracio, The changing paradigm of data-intensive computing, Computer, vol. 42, no. 1, pp , [8] M. Tomita, Current issues in parsing technology. Kluwer Academic ublishers, Boston, [9] I. S. Kang, S. H. Na, S. Lee, H. Jung,. Kim, W. K. Sung, and J. H. Lee, On co-authorship for author disambiguation, Information rocessing and Management, vol. 45, no. 1, pp , [10] G. Navarro, A guided tour to approximate string matching, ACM Computing Surveys, vol. 33, no. 1, pp , 2001.

8 Degree AuthorRank Fig. 6. Correlation between degree and AuthorRank measures. [11] M. Maia and S. Caregnato, Co-autoria como indicador de redes de colaborao cientfica, erspectivas em Cincia da Informao, vol. 13, no. 2, pp , [12] L. age, S. Brin, R. Motwani, and T. Winograd, The pagerank citation ranking: Bringing order to the web, Stanford InfoLab, Technical Report , November 1999, previous number = SIDL-W [13] S. Brin and L. age, The anatomy of a large-scale hypertextual web search engine, in Seventh International World-Wide Web Conference, [14] T. M. J. Fruchterman and E. M. Reingold, Graph drawing by forcedirected placement, Software - ractice and Experience, vol. 21, no. 11, pp , [15] A. Jamakovic and S. Uhlig, On the relationships between topological measures in real-world networks, Networks and Heterogeneous Media, vol. 3, no. 2, pp , [16] S. Wasserman and K. Faust, Social network analysis: methods and applications. Cambridge University ress, [17] J.. Scott, Social Network Analysis: A Handbook, 2nd ed. SAGE ublications, [18] M. Newman and M. Girvan, Finding and evaluating community structure in networks, hysical Review E, vol. 69, no. 2, p , [19] B. Wu, F. Zhao, S. Yang, L. Suo, and H. Tian, Characterizing the evolution of collaboration network, in roceeding of the 2nd ACM workshop on Social web search and mining, ser. SWSM 09, 2009, pp [20] L. L. Leydesdorff, Can scientific journals be classified in terms of aggregated journal-journal citation relations using the journal citation reports? Journal of the American Society for Information Science and Technology, vol. 57, no. 5, pp , [21] L. Leydesdorff, Betweenness centrality as an indicator of the interdisciplinarity of scientific journals, Journal of the American Society for Information Science and Technology, vol. 58, no. 9, pp , [22] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, Arnetminer: Extraction and mining of academic social networks, in roceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 2008, pp [23] D. A. Keim, Information visualization and visual data mining, IEEE transactions on Visualization and Computer Graphics, vol. 8, no. 1, pp. 1 8, 2002.