Paradigm Changes Affecting the Practice of Scientific Communication in the Life Sciences Prof. Dr. Martin Hofmann-Apitius Head of the Department of Bioinformatics Fraunhofer Institute for Algorithms and Scientific Computing (SCA Professor for Applied Life Science Informatics, Bonn-Aachen International Center for Information Technology (B-IT)
Rapid Paradigm Changes : Addressing Increasing Complexity From genes to proteins to biological function Multiscale integration of biological, medical and chemical information Systems Biology Simulation of Lif Isolated molecules Genomics Molecular Biology Organic Chemistry Biochemistry 1900 1950 1960 1970 1980 1990 2000 2010 2020 Seite 2
Acquisition of BioMedical Knowledge Before the Internet Study textbooks Learn how to use a library Focus on a subject Read scientific publications Become an expert in the field Seite 3
Acquisition of BioMedical Knowledge since 1997 Study textbooks Learn how to use MEDLINE Get an overview on a subject Read scientific publications Become an expert in the field Seite 4
Growth Rates of Scientific Publications in BioMedicine Growth of PubMed: 1,500-3,500 new data sets per day Currently > 16 Mio. entries Seite 5
Biomedical Databases as Sources of Knowledge? BioMedical databases store data, not knowledge Representation of information in databases dependent on database - model Expressiveness of database - models not sufficient for the representation of complex biomedical information Seite 6
An Over-Simplification...? The more complex a subject is, the more likely you will find it adequately described only in unstructured text and not in databases Seite 7
Breaking the Silos: Linking Named Entities in Text to Database Entries Seite 8
Mapping of Text Objects to Database Entries Proprietary knowledge Textual information Experimental data Pathway/Interaction Databases Seite 9
Protein Name Recognition F12A Multiple names for one gene Ambiguous names in databases Ambiguous acronyms Common word names Multi-word terms Spelling variants Permutations Nested protein names COL1A1 Neuronectin, GMEM, tenascin, HXB, cytotactin, hexabrachion p21, EPO, large T antigen WAS, STEP, ice, StAR Interleukin 1 alpha Tumor necrosis factor beta Collagen, type I, alpha 1 Collagen alpha 1(I) chain Alpha 1 collagen Alpha-1 type I collagen TNF receptor 1 collagen, type I, alpha receptor Seite 10
Functional Network of Interacting Molecules Extracted from Tex Seite 11
Awareness of Synonyms by a Computer Programme Seite 12
Available Chemical Information Textbooks Reports Patents Databases Scientific journals and publications Websites Seite 13
Representing a Chemical Compound How much information do you want to include? Atoms present OH Connections between atoms o bond types Isotopes Charges Stereochemical configuration 14 CH 2 O H N + 3 CH O - Seite 14
Chemical Structure Recognition an Overview 1 Document 2 Depiction 3 Reconstruction 4 SDF file 5 in silico Chemistry created from /home/marc/workspace/csr/results/csr/examples/us2005182053/ US2005182053_result.pnm MZCSRv0.5010050621162D 0.00000 0.00000 0 26 28 0 1 0 0 0 0 0999 V2000 204.0000 102.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 275.0000 61.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 201.0000 59.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 422.0000 178.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 311.0000 164.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 384.0000 165.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 447.0000 144.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 383.0000 123.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 131.0000 60.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 239.0000 123.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 349.0000 218.0000 0.0000 R# 0 0 0 0 0 0 0 0 0 0 0 0 447.0000 207.0000 0.0000 R# 0 0 0 0 0 0 0 0 0 0 0 0 Seite 15
Look And Feel Of chemocr reconstructed molecule input image Seite 16
n Automatically Generated Knowledge e Layer Seite 17
Summary Complex biomedical interrelationships are described in text, not in databases However, databases harbor relevant information on biomedical objects Automated recognition of biomedical entities in text and analysis of chemical depictures allows connecting entities in text and entities in databases as well as experimental platforms In the future we will see that text becomes largely interoperable with databases Moreover, we might be able to use text mining and image mining technologies to automatically generate knowledge layers that will boost the ability to find relevant knowledge. Seite 18
Consequences for the Scientific Communication Process Automated recognition of biomedical entities in text should be enabled / supported by publishers As an alternative to keyword based document retrieval (such as Google) I propose to establish a system that enables the scientist to navigate through an abstract knowledge layer and to identify and to purchase only relevant publications based on factual statements made in these documents. The knowledge layer must not be publisherspecific and consequently it should be generated in a joint effort of public and private stakeholders (publishers; national and international organizations). Seite 19
Thank you for your attention Seite 20