Parker / SPL / Agentur Focus ERACEP: IDENTIFYING EMERGING RESEARCH AREAS IN THE ERC RESEARCH PROPOSALS P a r t T h r e e : D e m o n s t a t i o n Dr. Thomas Reiss Etienne Vignola-Gagné Piret Kukk Fraunhofer ISI Karlsruhe Prof. Dr. Wolfgang Glänzel Dr. Bart Thijs KU Leuven
Introduction Objectives: Demonstrate the different tools and software that is used to map applications to topics Provide a good understading of the workflow Show where human input/judgement is required at several stages in the process Discuss advantages and shortcomings of methodology 2 Seite 2
Introduction Structure of presentation General overview of workflow in differrent stages Importance of appropriate data source Usage of different software and analytical tools Detailed demonstation of most crucial parts of work flow 3 Seite 3
Workflow: Part 1. Indexing documents Topics Applications Combine fields Data in Database Data in PDF Convert to plain text Processing text & Indexing Data in fields Lucene Text Indexer Data in Txt-files Processing text & Indexing Index for Papers Index for Applications 4 Seite 4
Workflow: Part 2. Mathematical representation Topics Applications Index for Papers Index for Applications Set of common terms TFiDF weighting Paper by term matrix Weighted Paper by term matrix Application by term matrix Weighted Application by term matrix TFiDF weighting 5 Seite 5
Workflow: Part 3. Mapping of applications Topics Weighted Paper by term matrix Cosine Similarity Application by paper matrix Applications Average Similarity Weighted Application by term matrix Application by topic matrix 6 Seite 6
Importance of appropriate data source Data for publications: Thomson-Reuters (TR) Web of Science (WoS) Science Citation Index Expanded (SCIE) Social Sciences Citation Index (SSCI) Arts & Humanities Citation Index (AHCI) ECOOM uses custom data delivered by TR as plain data underlying the WoS online database. Annual updates become available in March (e.g., March 2013: delivery of publications indexed in 2012 in WoS) Deliverd in tagged format Extensive processing is needed before data is ready for analysis 7 Seite 7
Importance of appropriate data source UT 000287079800007 T9 0138371746 AR DOI 10.1080/17458080.2010.487227 AR PII 932980638 TI Stable dispersions of poly(ethylene glycol) methyl ether-magnetite complexe -- s in water AU Theppaleak, T RO Author LN Theppaleak AF Thapanapong AD 1 AA Naresuan Univ, Dept Chem, Phitsanulok, Thailand AD 2 AA Naresuan Univ, Ctr Excellence Innovat Chem, Phitsanulok, Thailand AU Wichai, U RO Author LN Wichai AF Uthai AD 1 AA Naresuan Univ, Dept Chem, Phitsanulok, Thailand AD 2 AA Naresuan Univ, Ctr Excellence Innovat Chem, Phitsanulok, Thailand 8 Seite 8
Importance of appropriate data source Data Alternatives: Appropriate SCOPUS (in custom format, not online version) NOT appropriate Google Scholar: No access to all data through API, impossible to download abstract, addresses, authors, references No clear and transparant coverage. No clear and transparant selection criteria for documents and notably for citations Other online resources like Mendeley or others: Very selective coverage: upon authors decision 9 Seite 9
Importance of appropriate data source Data for Applications: 2009 Starting grants Part B, Section 1 : one file containing typically: Cover page with name PI, Host Institition, Full Title, Short name, Duration Project Summary Section 1, Principal Investigator, description, CV, Early Achievements Extended Synopsis References PDF-format, needs to be converted before processing Possible improvements: separate text files or XML 10 Seite 10
Software and Analytical Tools Simple Structure Programming and processing in Java Storage in Relational Database: Oracle Analysis and clustering: Matlab Text Indexing: Lucene Some Additional Tools for visualisation: eg. Pajek Benefits: Smooth integration between different platforms Each has extensive parallel computing capabilities: Absolutely crucial to have reasonable preformance of each tool. Total control over processing of data 11 Seite 11
Mapping Applications to Topics Text Mining with Lucene High-performance, full-featured text search engine library Written entirely in Java Open source project available for free download Used version 4.0 Website: lucene.apache.org Benefits: Smooth integration into Matlab and any java platform Open Source: complete control and understanding of applied methodologies, extensible and adoptable to individual needs 12 Seite 12
Matching of Applications Document Indexing All terms from a document need to be extracted and processed. Some statistics are calculated (offset, frequency, ) for the terms and those are stored in the index. The settings for extraction and processing have an impact on usability and quality of the index. Tokenization Removal of stopwords Stemming (Porter Stemmer) Lower case 13 Seite 13
Example publication Original combination of title, abstract and keywords Transient analysis for the cathode gas diffusion layer of PEM fuel cells # A one-dimensional, nonisothermal, two-phase transient model has been developed to study the transient behaviour of water transport in the cathode gas diffusion layer of PEM fuel cells. The effects of four parameters, namely the liquid water saturation at the interface of the gas diffusion layer and flow channels, the proportion of liquid water to all of the water at the interface of the cathode catalyst layer and the gas diffusion layer, the current density, and the contact or wetting angle, on the transient distribution of liquid water saturation in the cathode gas diffusion layer are investigated. Especially, the time needed for liquid water saturation to reach steady state and the liquid water saturation at the interface of the cathode catalyst layer and gas diffusion layer are plotted as functions of the above four parameters. The ranges of water vapour condensation and liquid water evaporation are identified across the thickness of the gas diffusion layer. In addition, the effects of the above four parameters on the steady state distributions of gas phase pressure, water vapour concentration, oxygen concentration and temperature are also presented. It is found that increasing any one of the first three parameters will increase the water saturation at the interface of the catalyst layer and gas diffusion layer, but decrease the time needed for the liquid water saturation to reach steady state. When the liquid water saturation at the interface of the gas diffusion layer and flow channels is high enough (>= 0.1), the liquid water saturation at steady state is almost uniformly distributed across the thickness of the gas diffusion layer. It is also found that, under the given initial and boundary conditions in this paper, evaporation takes place within the gas diffusion layer close to the channel side and is the major process for water phase change at low current density (< 2000 A m(-2)); condensation occurs close to the catalyst layer side within the gas diffusion layer and dominates the phase change at high current density (> 5000 A m(-2)). For hydrophilic gas diffusion layers, both the time needed for liquid water saturation to reach steady state and the water saturation at the interface of the catalyst layer and gas diffusion layer will increase when the contact angle increases; but for hydrophobic gas diffusion layers, both of them decrease when the contact angle increases. (c) 2005 Elsevier B.V. All rights reserved.,, BEHAVIOR, LIQUID WATER TRANSPORT, 2-PHASE FLOW, POROUS-MEDIA, MODEL, MEMBRANE, MULTIPHASE, PEM fuel cell, transient analysis, two-phase transport, gas diffusion layer 14 Seite 14
Example publication Analyzed and stored text transient, cathod, diffus, layer, pem, fuel, cell, dimension, non, isotherm, phase, transient, model, been, develop, transient, behaviour, water, transport, cathod, diffus, layer, pem, fuel, cell, effect, four, paramet, name, liquid, water, satur, interfac, diffus, layer, flow, channel, proport, liquid, water, all, water, interfac, cathod, catalyst, layer, diffus, layer, current, densiti, contact, wet, angl, transient, distribut, liquid, water, satur, cathod, diffus, layer, investig, especi, time, need, liquid, water, satur, reach, steadi, state, liquid, water, satur, interfac, cathod, catalyst, layer, diffus, layer, plot, function, abov, four, paramet, rang, water, vapour, condens, liquid, water, evapor, identifi, across, thick, diffus, layer, addit, effect, abov, four, paramet, steadi, state, distribut, phase, pressur, water, vapour, concentr, oxygen, concentr, temperatur, present, found, increas, ani, first, three, paramet, increas, water, satur, interfac, catalyst, layer, diffus, layer, decreas, time, need, liquid, water, satur, reach, steadi, state, liquid, water, satur, interfac, diffus, layer, flow, channel, enough, 0.1, liquid, water, satur, steadi, state, almost, uniformli, distribut, across, thick, diffus, layer, found, given, initi, boundari, condit, paper, evapor, take, place, within, diffus, layer, close, channel, side, major, process, water, phase, chang, low, current, densiti, 2000, condens, occur, close, catalyst, layer, side, within, diffus, layer, domin, phase, chang, current, densiti, 5000, null_4, hydrophil, diffus, layer, time, need, liquid, water, satur, reach, steadi, state, water, satur, interfac, catalyst, layer, diffus, layer, increas, contact, angl, increas, hydrophob, diffus, layer, them, decreas, contact, angl, increas, 2005, b.v, all, behavior, liquid, water, transport, phase, flow, porou, media, model, membran, multiphas, pem, fuel, cell, transient, phase, transport, diffus, layer 15 Seite 15
Example publication tfidf of most frequent terms Term Frequency TFiDF Term Frequency TFiDF layer 21 30.52 transport 3 4.13 water 18 17.06 distribut 3 3.22 diffus 16 24.51 fuel 3 3.80 liquid 11 17.18 densiti 3 3.89 satur 10 19.67 reach 3 4.47 interfac 6 10.41 contact 3 4.67 phase 6 7.01 pem 3 6.92 steadi 5 9.58 time 3 2.07 increas 5 2.86 need 3 2.98 catalyst 5 8.16 current 3 3.11 cathod 5 9.65 four 3 3.36 state 5 4.95 angl 3 5.53 transient 5 9.78 channel 3 5.41 paramet 4 4.24 cell 3 2.47 flow 3 3.36 idf t = log N n t with TF: Term frequency idf t : Inverse Document Frequency of term t n t : number of documents containing term t N : total number of documents in the set 16 Seite 16
Matching of Applications Similarity between papers and applications A cosine similarity is calculated between all applications and the papers in the selected topics if they share at least one term. This results in a paper by application similarity matrix. This can be aggregated to the level of the topics or the Subject Categories by calculating by applying statistical functions of the similarities. Applications are said to be relevant to a topic if the average similarity exceeds a given threshold. 17 Seite 17
Matching of Applications : Example Application [240181] Non-Destructive (in situ) Micro-Structural Characterization of Solid Oxide Fuel Cells : UT-code Title Similarity 000248849000002 Mathematical modelling of proton-conducting solid oxide fuel cells and comparison with oxygen-lon-conducting counterpart 000252513100042 On the single chamber solid oxide fuel cells 0.422 000259716600036 Thermodynamic analysis of ammonia fed solid oxide fuel cells: Comparison between proton-conducting electrolyte and oxygen ion-conducting electrolyte 000247093200001 A review of numerical modeling of solid oxide fuel cells 0.410 000243675300032 A high fuel utilizing solid oxide fuel cell cycle with regard to the formation of nickel oxide and power density 0.403 000238964200007 Cycle analysis of planar SOFC power generation with serial connection of low and high temperature SOFCs 0.401 000250992500011 A physically based dynamic model for solid oxide fuel cells 0.394 000259659300020 000261009000058 Modeling of methane fed solid oxide fuel cells: Comparison between proton conducting electrolyte and oxygen ion conducting electrolyte Mathematical modeling of ammonia-fed solid oxide fuel cells with different electrolytes 000250932300003 Functional materials for the IT-SOFC 0.374 0.453 0.418 0.393 0.380 18 Seite 18
Matching of Applications : Example Mapping with topics (threshold set at 0.025) Energy & Fuels Fuel Cells (0.0812) Kinetics (0.0265) Biodiesel (0.0261) Environmental Sciences Waste Water (0.0253) Conclusion for this application: It is relevant to at least one emerging topic 19 Seite 19
Matching of Applications : Example Application [241161] Intergenerational correlations of schooling, income and health: an investigation of the underlying mechanisms: 20 UT-code Title Similarity 000229960400004 Spatial variations in intergenerational transmission of self-employment 0.281 000257973900004 Estimating intergenerational distribution preferences 0.277 An intergenerational and lifecourse study of health and mortality risk in parents of 000230053300009 the 1958 birth cohort: (II) mortality rates and study representativeness 0.276 000220476700016 An intergenerational study of birthweight: investigating the birth order effect 0.274 000254008700004 The dynamics of intergenerational sexual relationships: the experience of schoolgirls in Botswana 0.272 000230053300008 An intergenerational and lifecourse study of health and mortality risk in parents of the 1958 birth cohort: (1) methods and tracing 0.267 000188603400003 The impact of income: assessing the relationship between income and health in Sweden 0.263 000236067600009 Intergenerational effects of preterm birth and reduced intrauterine growth: a population-based study of Swedish mother-offspring pairs 0.260 000220654500011 What is good parental education? Interviews with parents who have attended parental education sessions 0.257 000226627100008 How children with special health care needs affect the employment decisions of low-income parents 0.246 Seite 20
Matching of Applications : Example Mapping with topics (threshold set at 0.025) Public Health Quality of Life (0.029) Health Policy (0.039) Tobacco (0.0261) Gender and Family (0.0253) Conclusion for this application: It is not relevant to a topic that has been identified as emerging 21 Seite 21
Matching of Applications : Example Application [240544] Construction of a Molecular Crane Based on the Flavoprotein Dodecin : UT-code Title Similarity 000256109000003 From iron oxides to infections 0.163 000256392700011 Fuel cells based on multifunctional carbon nanotube networks 0.152 000252020500011 High energy lithium batteries by molecular wiring and targeting approaches 0.144 000253976200011 Single walled carbon nanotubes (SWCNT) affect cell physiology and cell architecture 0.121 000252096100003 Parameters for carbamate pesticide QSAR and PBPK/PD models for human risk assessment 0.117 000250900700051 Fabrication and electrochemical activity of Ni-attached carbon nanotube electrodes for hydrogen storage in alkali electrolyte 0.114 000226443100008 Quantitative assessment of tension in wires of fine-wire external fixators 0.111 000241412000007 Multi-walled carbon nanotubes based Pt electrodes prepared with in situ ion exchange method for oxygen reduction 0.110 000250654700005 Evaluation of electrochemical performance for surface-modified carbons as catalyst support in polymer electrolyte membrane (PEM) fuel cells 0.108 000226011300013 Evaluation of the effect of fluoride-containing acetic acid on NiTi wires 0.108 22 Seite 22
Matching of Applications : Example Mapping with topics (threshold set at 0.025) Not matched with any cluster or emerging topic Based on journals referred to in the application it is relevant to Material Sciences and Biophysics. However, the terms biofuel and fuel cell/biofuel cell do appear in the text. The summary states: The results of these studies will be of general interest for the construction of molecular switches, devices, and transport systems, and for the development of amperometric biosensors and biofuel cells 23 Seite 23