Extraction and Visualization of Protein-Protein Interactions from PubMed

Transcription

1 Extraction and Visualization of Protein-Protein Interactions from PubMed Ulf Leser Knowledge Management in Bioinformatics Humboldt-Universität Berlin

2 Finding Relevant Knowledge Find information about Much knowledge is in text (and only text) Find articles with information about - PubMed/Medline - Which diseases is RAB5 associated to? Find information about inside each article - Reading many abstracts is tedious - What about a summarize results button? Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/2006 2

3 Question What is the risk of treating malaria patients that have a G6PD (Glucose 6-Phosphate Dehydrogenase) deficiency with Primaquine? Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/2006 3

4 Use PubMed Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/2006 4

5 Use AliBaba Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/2006 5

6 Use AliBaba Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/2006 6

7 Question Which proteins are associated to RAB5? Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/2006 7

8 Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/2006 8

9 Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/2006 9

10 Finding Relevant Knowledge Find information about Find articles with information about - PubMed/Medline - Which diseases is RAB5 associated to? Find information about inside each article - Reading many abstracts is tedious - What about a summarize results button? Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

11 Overview Why text mining for biomedical research Extraction of protein-protein interactions from text - Learning language patterns - Pattern generalization - Evaluation Alibaba: Summarizing PubMed results Vision Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

12 Possible Approaches to PPI Co-occurrence - Two proteins in one sentences -> PPI - Tendency: Low precision, very good recall Full sentence parsing - Recognizes syntactic relationship between entities - Extraction uses rules navigating syntax tree - Only ~30% of all sentences can be parsed unambiguously But recent developments (e.g. INFO-PUBMED, Rinaldi et al.) - Tendency: Good precision, low recall Pattern matching Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

13 Relationship Mining Language pattern - Sentence GENE regulates expression of GENE GENE is strongly suppressed by GENE - Adding part-of-speech GENE VRB NOM PRP GENE GENE is ADJ VRB PRP GENE Different levels of generality - GENE.* VRB.* GENE Simple rules, high recall, low precision - GENE [is] ADJ? {regulat suppres} NOM? PRP GENE Complex rules, lower recall, higher precision Balanced precision/recall requires many rules Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

14 State-of-the-Art Most systems work on hand-crafted sets of pattern - Hundreds of pattern - Enormous effort - Need to be created for any type of relationship Our idea Protein-protein, gene-disease, disease-drug, - Learn patterns automatically Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

15 Recall Bioinformatics Protein families are often defined by patterns How to find protein families? - [Very simple method] - Compute distances between protein sequences Alignment - Find clusters of similar sequences E.g. using hierarchical clustering - Build multiple sequence alignment for each sequence E.g. using ClustalW, DAlign, - Compute profile for each MSA From sequences (of AA) to sentences (of tokens) Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

16 AliBaba Workflow PubMed IntAct Protein pairs Search sentences Linguistic annotation Initial patterns Clustering Alignment Consensus pattern Extracted PPI Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

17 Initial Pattern Extract all pairs of proteins from IntAct - Only the names, not the evidence / links - Gold standard: These interactions are assumed to be real Find all sentences in PubMed - Pair of proteins and interaction word - FADD immediately activates procaspase-8 Extract core phrases - Width: Parameter - show that FADD immediately activates procaspase-8 during Annotate with linguistic information Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

18 Linguistic Annotation Multi-layered pattern Original FADD immediately activates procaspase-8 Class / POS PTN ADV VRB PTN Stem Token PTN immediat activat PTN PTN immediately activates PTN Initial pattern set - Highly specific - Can be used immediately, but results in very low recall Need to be generalized Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

19 Workflow PubMed IntAct Protein pairs Search sentences Linguistic annotation Initial patterns Clustering Alignment Consensus pattern Extracted PPI Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

20 Pattern Generalization Initial patterns - Too many (performance is an issue) - Too specific - Miss many little linguistic derivations Find clusters of similar patterns - Requires a distance measure for language patterns For each cluster, generate consensus pattern - Compute commonality of each set - Generate a new, generalized pattern Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

21 Distances of Initial Patterns Sentence alignment One layer: Standard dynamic programming End-Free alignment of patterns (core phrases) against sentences Cost for insertion, deletion, match, replacement Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

22 Substitution Matrices One substitution matrix per layer Layers can be weighted Score is aggregated over all layers c( i, j) = w l layers l * score Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/ l ( i[ l], j[ l])

23 Clustering and Generalization Distance matrix for all pairs of initial patterns Hierarchical clustering Consensus pattern using multiple sentence alignment - Generates a profile per layer Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

24 Workflow PubMed IntAct Protein pairs Search sentences NER and POS tagging Initial patterns Clustering Alignment Consensus pattern Extracted PPI Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

25 Search Phase Given a text: All sentences - are searched for at least two protein names - matched against all consensus pattern - Complication: Matching a sentence (i.e. a multi-layered pattern) against a pattern profile c( i, j) = wl * scorel ( i[ l], j[ l])*(1 freq( i[ l])) l layers Highest scoring pattern wins Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

26 Evaluation ~ IntAct pairs ~ sentences containing an IntAct pair and an interaction word ~ unique initial patterns - Difference between abstracts and full text Evaluation using SPIES corpus - Hao et al. 2004, ~900 sentences, ~1500 annotated PPI - Not the best corpus one can think of Only sentences with 2 proteins, taken from very few papers But strongest competitor Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

27 Results Using initial patterns directly - As expected: Precision ~85%, recall ~15% Generalization: ~9.500 consensus pattern - Some very large, most very small - Can be tuned towards precision or recall (cluster threshold) Result: 79% precision at 52% recall - F-measure: 63 - Most important type of error: Enumerations CUL-1 interacts with SKR-1, SKR-2, SKR-3, and SKR-10 - Tweaking towards higher recall yields 74 / 57 Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

28 Comparison Hao et al. report F-measure of 68 - Semi-automatic system - Patterns are learned from annotated corpus - Self-made corpus - [Alibaba on home-made corpus: F-measure 66] Alibaba - Needs no learning corpus at all - Semi-supervised methods examples are almost correct - Highly adaptable to different tasks Examples readily available in many databases Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

29 Overview Why text mining for biomedical research Extraction of protein-protein interactions from text Alibaba: Summarizing PubMed results Vision Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

30 Workflow Client 1. PubMed Query Server 2. Query PMIDs Internet Annotated Texts (XML) PMID: PMID: Local Document Index Annotation Pipeline Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

31 Alibaba Analyses results of a PubMed query - Full PubMed query syntax - Scope of analysis is defined by user Extracting and visualizing information - Entities: dictionary matches [Kirsch et al. 05] Genes, proteins, diseases, cells, tissues, species, drugs - Detects PPI using extraction pipeline - Detects further relationships using co-occurrence - Confidence scores Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

32 Query Extracted infos Visualization of extracted relationships Links to databases Links to textual evidence Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

33 Walk-through Which proteins are associated with the TNFalpha associated death domain (TRADD)? Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

34 Many! Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

35 Filter by Object Type and Confidence Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

36 Show only Connected Objects Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

37 Show Type of Interaction Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

38 Location of Interaction Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

39 View Annotated Abstracts Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

40 Overview Why text mining for biomedical research Extraction of protein-protein interactions from text Alibaba: Summarizing PubMed results Vision Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

41 Annotated Relationships Relationships have many parameters Example: Modeling in Systems Biology The apparent K(m) value was calculated for adenosine and found to be 3.63 x 10(-3) M, which indicates high affinity of adenosine deaminase for its substrate adenosine. Constant: K(m) Value: 3.63 x 10(-3) Unit: M Enzyme: Adenosine deaminase Compound: adenosine Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

42 KMedDB Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

43 More Overlying extracted networks with established pathways (KEGG) Application to other types of relationships - Protein disease, disease target drug - Annotated corpora for evaluation welcome Improving text mining performance Disambiguation Advanced NER methods (links are lost) Larger learning sample (reactome, BIND, DIP, ) Scalability Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

44 Conclusion Learning patterns is possible - Quickly adaptable to different tasks Corpus creation is a bottleneck - Even if available, might not be suitable for task at hand - Use semi-supervised methods - The more data, the more promising (full text, web) What is an interaction? - Probably hardest problem for higher felt precision - Solve more specific problems - [Alibaba: task-specific lists of interaction words] Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/

45 Acknowledgements Humboldt-Universität, Informatics - Jörg Hakenberg Torsten Schiemann - Conrad Plake Markus Pankalla - Lukas Faulstich Long Nguyen Max-Planck-Institute for Molecular Genetics - Edda Klipp, Sebastian Schmeier, Axel Kowald European Bioinformatics Institute - Harald Kirsch, Dietrich Rebholz-Schumann Ulf Leser: Visualizing PPI from text, SCAI Text Mining Symposium, 10/