Understanding Text with Knowledge-Bases and Random Walks

Size: px
Start display at page:

Download "Understanding Text with Knowledge-Bases and Random Walks"

Transcription

1 Understanding Text with Knowledge-Bases and Random Walks Eneko Agirre ixa2.si.ehu.es/eneko IXA NLP Group University of the Basque Country MAVIR, 2011 Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

2 Random Walks on Large Graphs WWW, PageRank and Google Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

3 Random Walks on Large Graphs WWW, PageRank and Google Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

4 Random Walks on Large Graphs Linked Data Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

5 Random Walks on Large Graphs Wikipedia (DBpedia) Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

6 Random Walks on Large Graphs WordNet Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

7 Random Walks on Large Graphs Unified Medical Language System Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

8 Random Walks on Large Graphs Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

9 Text Understanding Understanding of broad language, what s behind the surface strings Barcelona boss says that Jose Mourinho is the best coach in the world Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

10 Text Understanding Understanding of broad language, what s behind the surface strings Barcelona boss says that Jose Mourinho is the best coach in the world Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

11 Text Understanding Understanding of broad language, what s behind the surface strings Barcelona boss says that Jose Mourinho is the best coach in the world Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

12 Text Understanding Understanding of broad language, what s behind the surface strings Barcelona boss says that Jose Mourinho is the best coach in the world End systems that we would like to build: natural dialogue, speech recognition, machine translation improving parsing, semantic role labeling, information retrieval, question answering Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

13 Text Understanding From string to semantic representation (First Order Logic) Barcelona coach praises Jose Mourinho. Exist e1, x1, x2, x3 such that FC Barcelona(x1) and coach:n:1(x2) and praise:v:2(e1,x2,x3) and José Mourinho(x3) Disambiguation: Concepts, Entities and Semantic Roles Quantifiers, modality, negation, etc. Inference and Reasoning Barcelona coach praises Mourinho Guardiola honors Mourinho... with respect to some Knowledge Base Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

14 Text Understanding From string to semantic representation (First Order Logic) Barcelona coach praises Jose Mourinho. Exist e1, x1, x2, x3 such that FC Barcelona(x1) and coach:n:1(x2) and praise:v:2(e1,x2,x3) and José Mourinho(x3) Disambiguation: Concepts, Entities and Semantic Roles Quantifiers, modality, negation, etc. Inference and Reasoning Barcelona coach praises Mourinho Guardiola honors Mourinho... with respect to some Knowledge Base Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

15 Text Understanding: Knowledge Bases and Random Walks Focus on the following tasks on inference and disambiguation: Map words in context to KB concepts (Word Sense Disambiguation) Similarity between concepts and words Similarity to improve ad-hoc information retrieval Applied to WordNet(s), UMLS, Wikipedia Excellent results Open source software and data: Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

16 Text Understanding: Knowledge Bases and Random Walks Focus on the following tasks on inference and disambiguation: Map words in context to KB concepts (Word Sense Disambiguation) Similarity between concepts and words Similarity to improve ad-hoc information retrieval Applied to WordNet(s), UMLS, Wikipedia Excellent results Open source software and data: Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

17 Outline 1 WordNet, PageRank and Personalized PageRank 2 Random walks for WSD 3 Adapting WSD to domains 4 WSD on the biomedical domain 5 Random walks for similarity 6 Similarity and Information Retrieval 7 Conclusions Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

18 WordNet, PageRank and Personalized PageRank Outline 1 WordNet, PageRank and Personalized PageRank 2 Random walks for WSD 3 Adapting WSD to domains 4 WSD on the biomedical domain 5 Random walks for similarity 6 Similarity and Information Retrieval 7 Conclusions Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

19 WordNet, PageRank and Personalized PageRank Wordnet, Pagerank and Personalized PageRank (with Aitor Soroa) WordNet is the most widely used hierarchically organized lexical database for English (Fellbaum, 1998) Broad coverage of nouns, verbs, adjectives, adverbs Main unit: synset (concept) coach#1, manager#3, handler#2 someone in charge of training an athlete or a team. Relations between concepts: synonymy (built-in), hyperonymy, antonymy, meronymy, entailment, derivation, gloss Closely linked versions in several languages Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

20 WordNet, PageRank and Personalized PageRank Wordnet Example of hypernym relations: coach#1 trainer leader person organism... entity synonyms: manager, handler gloss words (and synsets): charge, train (verb), athlete, team hyponyms: baseball coach, basketball coach, conditioner, football coach instance:john McGraw domain: sport, athletics derivation: coach (verb), managership, manage (verb), handle (verb) Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

21 WordNet, PageRank and Personalized PageRank Wordnet Representing WordNet as a graph: Nodes represent concepts Edges represent relations (undirected) In addition, directed edges from words to corresponding concepts (senses) Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

22 WordNet, PageRank and Personalized PageRank Wordnet handle#v6 managership#n3 derivation trainer#n1 sport#n1 domain coach#n1 derivation hyperonym hyperonym teacher#n1 coach coach#n2 derivation coach#n5 hyperonym holonym public_transport#n1 holonym fleet#n2 tutorial#n1 seat#n1 Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

23 WordNet, PageRank and Personalized PageRank Random Walks: PageRank Given a graph, ranks nodes according to their relative structural importance If an edge from n i to n j exists, a vote from n i to n j is produced Strength depends on the rank of n i The more important n i is, the more strength its votes will have. PageRank is more commonly viewed as the result of a random walk process Rank of n i represents the probability of a random walk over the graph ending on n i, at a sufficiently large time. Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

24 WordNet, PageRank and Personalized PageRank Random Walks: PageRank G: graph with N nodes n 1,..., n N d i : outdegree of node i M: N N matrix 1 an edge from i to j exists M ji = d i 0 otherwise PageRank equation: Pr = cmpr + (1 c)v surfer follows edges surfer randomly jumps to any node (teleport) c: damping factor: the way in which these two terms are combined Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

25 WordNet, PageRank and Personalized PageRank Random Walks: PageRank G: graph with N nodes n 1,..., n N d i : outdegree of node i M: N N matrix 1 an edge from i to j exists M ji = d i 0 otherwise PageRank equation: Pr = cmpr + (1 c)v surfer follows edges surfer randomly jumps to any node (teleport) c: damping factor: the way in which these two terms are combined Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

26 WordNet, PageRank and Personalized PageRank Random Walks: PageRank G: graph with N nodes n 1,..., n N d i : outdegree of node i M: N N matrix 1 an edge from i to j exists M ji = d i 0 otherwise PageRank equation: Pr = cmpr + (1 c)v surfer follows edges surfer randomly jumps to any node (teleport) c: damping factor: the way in which these two terms are combined Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

27 WordNet, PageRank and Personalized PageRank Random Walks: PageRank G: graph with N nodes n 1,..., n N d i : outdegree of node i M: N N matrix 1 an edge from i to j exists M ji = d i 0 otherwise PageRank equation: Pr = cmpr + (1 c)v surfer follows edges surfer randomly jumps to any node (teleport) c: damping factor: the way in which these two terms are combined Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

28 WordNet, PageRank and Personalized PageRank Random Walks: PageRank G: graph with N nodes n 1,..., n N d i : outdegree of node i M: N N matrix 1 an edge from i to j exists M ji = d i 0 otherwise PageRank equation: Pr = cmpr + (1 c)v surfer follows edges surfer randomly jumps to any node (teleport) c: damping factor: the way in which these two terms are combined Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

29 WordNet, PageRank and Personalized PageRank Random Walks: Personalized PageRank Pr = cmpr + (1 c)v PageRank: v is a stochastic normalized vector, with elements 1 N Equal probabilities to all nodes in case of random jumps Personalized PageRank, non-uniform v (Haveliwala 2002) Assign stronger probabilities to certain kinds of nodes Bias PageRank to prefer these nodes For ex. if we concentrate all mass on node i All random jumps return to n i Rank of i will be high High rank of i will make all the nodes in its vicinity also receive a high rank Importance of node i given by the initial v spreads along the graph Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

30 WordNet, PageRank and Personalized PageRank Random Walks: Personalized PageRank Pr = cmpr + (1 c)v PageRank: v is a stochastic normalized vector, with elements 1 N Equal probabilities to all nodes in case of random jumps Personalized PageRank, non-uniform v (Haveliwala 2002) Assign stronger probabilities to certain kinds of nodes Bias PageRank to prefer these nodes For ex. if we concentrate all mass on node i All random jumps return to n i Rank of i will be high High rank of i will make all the nodes in its vicinity also receive a high rank Importance of node i given by the initial v spreads along the graph Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

31 WordNet, PageRank and Personalized PageRank Random Walks: Personalized PageRank Pr = cmpr + (1 c)v PageRank: v is a stochastic normalized vector, with elements 1 N Equal probabilities to all nodes in case of random jumps Personalized PageRank, non-uniform v (Haveliwala 2002) Assign stronger probabilities to certain kinds of nodes Bias PageRank to prefer these nodes For ex. if we concentrate all mass on node i All random jumps return to n i Rank of i will be high High rank of i will make all the nodes in its vicinity also receive a high rank Importance of node i given by the initial v spreads along the graph Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

32 Random walks for WSD Outline 1 WordNet, PageRank and Personalized PageRank 2 Random walks for WSD 3 Adapting WSD to domains 4 WSD on the biomedical domain 5 Random walks for similarity 6 Similarity and Information Retrieval 7 Conclusions Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

33 Random walks for WSD Word Sense Disambiguation (WSD) Goal: determine senses of the open-class words in a text. Nadal is sharing a house with his uncle and coach, Toni. Our fleet comprises coaches from 35 to 58 seats. Knowledge Base (e.g. WordNet): coach#1 someone in charge of training an athlete or a team. coach#2 a person who gives private instruction (as in singing, acting, etc.). coach#3 a railcar where passengers ride. coach#4 a carriage pulled by four horses with one driver. coach#5 a vehicle carrying many passengers; used for public transport. Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

34 Random walks for WSD Word Sense Disambiguation (WSD) Goal: determine senses of the open-class words in a text. Nadal is sharing a house with his uncle and coach, Toni. Our fleet comprises coaches from 35 to 58 seats. Knowledge Base (e.g. WordNet): coach#1 someone in charge of training an athlete or a team. coach#2 a person who gives private instruction (as in singing, acting, etc.). coach#3 a railcar where passengers ride. coach#4 a carriage pulled by four horses with one driver. coach#5 a vehicle carrying many passengers; used for public transport. Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

35 Random walks for WSD Word Sense Disambiguation (WSD) Supervised corpus-based WSD performs best Train classifiers on hand-tagged data (typically SemCor) Data sparseness, e.g. coach 20 examples (20,0,0,0,0,0), bank 48 examples (25,20,2,1,0... ) Results decrease when train/test from different sources (even Brown, BNC) Decrease even more when train/test from different domains Knowledge-based WSD Uses information in a KB (WordNet) Performs close to but lower than Most Frequent Sense (MFS, supervised) Vocabulary coverage Relation coverage Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

36 Random walks for WSD Word Sense Disambiguation (WSD) Supervised corpus-based WSD performs best Train classifiers on hand-tagged data (typically SemCor) Data sparseness, e.g. coach 20 examples (20,0,0,0,0,0), bank 48 examples (25,20,2,1,0... ) Results decrease when train/test from different sources (even Brown, BNC) Decrease even more when train/test from different domains Knowledge-based WSD Uses information in a KB (WordNet) Performs close to but lower than Most Frequent Sense (MFS, supervised) Vocabulary coverage Relation coverage Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

37 Random walks for WSD Domain adaptation Deploying NLP techniques in real applications is challenging, specially for WSD: But... Sense distributions change across domains Data sparseness hurts more Context overlap is reduced New senses, new terms Some words get less interpretations in domains: bank in finance, coach in sports Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

38 Random walks for WSD Domain adaptation Deploying NLP techniques in real applications is challenging, specially for WSD: But... Sense distributions change across domains Data sparseness hurts more Context overlap is reduced New senses, new terms Some words get less interpretations in domains: bank in finance, coach in sports Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

39 Random walks for WSD Knowledge-based WSD using random walks (with Aitor Soroa, Oier Lopez de Lacalle) Use information in WordNet for disambiguation: Our fleet comprises coaches from 35 to 58 seats. Traditional approach (Patwardhan et al. 2007): Compare each target sense of coach with those of the words in the context (fleet, comprise, blue, seat) Using semantic relatedness between pairs of senses (6x4x3x8x9=5184) Alternative to comb. explosion, each word disambiguated individually (6x( )=144) sim(coach#1,fleet#1) + sim(coach#1,fleet#2) +... sim(coach#1,seat#1)... sim(coach#2,fleet#1) + sim(coach#2,fleet#2) +... sim(coach#2,seat#1) Graph-based methods Exploit the structural properties of the graph underlying WordNet Find globally optimal solutions Disambiguate large portions of text in one go Principled solution to combinatorial explosion Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

40 Random walks for WSD Knowledge-based WSD using random walks (with Aitor Soroa, Oier Lopez de Lacalle) Use information in WordNet for disambiguation: Our fleet comprises coaches from 35 to 58 seats. Traditional approach (Patwardhan et al. 2007): Compare each target sense of coach with those of the words in the context (fleet, comprise, blue, seat) Using semantic relatedness between pairs of senses (6x4x3x8x9=5184) Alternative to comb. explosion, each word disambiguated individually (6x( )=144) sim(coach#1,fleet#1) + sim(coach#1,fleet#2) +... sim(coach#1,seat#1)... sim(coach#2,fleet#1) + sim(coach#2,fleet#2) +... sim(coach#2,seat#1) Graph-based methods Exploit the structural properties of the graph underlying WordNet Find globally optimal solutions Disambiguate large portions of text in one go Principled solution to combinatorial explosion Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

41 Random walks for WSD Using PageRank for WSD Given a graph representation of the LKB PageRank over the whole WordNet would get a context-independent ranking of word senses We would like: Given an input text, disambiguate all open-class words in the input taking the rest as context Two alternatives 1 Create a context-sensitive subgraph and apply PageRank over it (Navigli and Lapata, 2007; Agirre et al. 2008) 2 Use Personalized PageRank over the complete graph, initializing v with the context words Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

42 Random walks for WSD Using PageRank for WSD Given a graph representation of the LKB PageRank over the whole WordNet would get a context-independent ranking of word senses We would like: Given an input text, disambiguate all open-class words in the input taking the rest as context Two alternatives 1 Create a context-sensitive subgraph and apply PageRank over it (Navigli and Lapata, 2007; Agirre et al. 2008) 2 Use Personalized PageRank over the complete graph, initializing v with the context words Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

43 Random walks for WSD Using Personalized PageRank (PPR) For each word W i, i = 1... m in the context Initialize v with uniform probabilities over words W i Context words act as source nodes injecting probability mass into the concept graph Run Personalized PageRank Choose highest ranking sense for target word Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

44 Random walks for WSD Using Personalized PageRank (PPR) handle#n8 managership#n3 trainer#n1 sport#n1 coach#n1 teacher#n1 coach fleet comprise... seat coach#n2 tutorial#n1 coach#n5 comprise#v1... fleet#n2 public_transport#n1 seat#n1 Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

45 Random walks for WSD Experiment setting Two datasets Senseval 2 All Words (S2AW) Senseval 3 All Words (S3AW) Both labelled with WordNet 1.7 tags Create input contexts of at least 20 words Adding sentences immediately before and after if original too short PageRank settings: Damping factor (c): 0.85 End after 30 iterations Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

46 Random walks for WSD Results and comparison to related work (S2AW) Senseval-2 All Words dataset System All N V Adj. Adv. Mih Sihna Tsatsa PPR MFS (Mihalcea, 2005) Pairwise Lesk between senses, then PageRank. (Sinha & Mihalcea, 2007) Several similarity measures, voting, fine-tuning for each PoS. Development over S3AW. (Tsatsaronis et al., 2007) Subgraph BFS over WordNet 1.7 and extended WN, then spreading activation. Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

47 Random walks for WSD Results and comparison to related work (S2AW) Senseval-2 All Words dataset System All N V Adj. Adv. Mih Sihna Tsatsa PPR MFS (Mihalcea, 2005) Pairwise Lesk between senses, then PageRank. (Sinha & Mihalcea, 2007) Several similarity measures, voting, fine-tuning for each PoS. Development over S3AW. (Tsatsaronis et al., 2007) Subgraph BFS over WordNet 1.7 and extended WN, then spreading activation. Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

48 Random walks for WSD Comparison to related work (S3AW) System All N V Adj. Adv. Mih Sihna Nav PPR MFS (Mihalcea, 2005) Pairwise Lesk between senses, then PageRank. (Sinha & Mihalcea, 2007) Several simmilarity measures, voting, fine-tuning for each PoS. Development over S3AW. (Navigli & Lapata, 2010) Subgraph DFS(3) over WordNet 2.0 plus proprietary relations, several centrality algorithms. Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

49 Random walks for WSD Comparison to related work (S3AW) System All N V Adj. Adv. Mih Sihna Nav PPR MFS (Mihalcea, 2005) Pairwise Lesk between senses, then PageRank. (Sinha & Mihalcea, 2007) Several simmilarity measures, voting, fine-tuning for each PoS. Development over S3AW. (Navigli & Lapata, 2010) Subgraph DFS(3) over WordNet 2.0 plus proprietary relations, several centrality algorithms. Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

50 Adapting WSD to domains Outline 1 WordNet, PageRank and Personalized PageRank 2 Random walks for WSD 3 Adapting WSD to domains 4 WSD on the biomedical domain 5 Random walks for similarity 6 Similarity and Information Retrieval 7 Conclusions Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

51 Adapting WSD to domains Adapting WSD to domains How could we improve WSD performance without tagging new data from domain or adapting WordNet manually to the domain? What would happen if we apply PPR-based WSD to specific domains? Personalized PageRank over context... has never won a league title as coach but took Parma to success... Personalized PageRank over related words Get related words from distributional thesaurus coach: manager, captain, player, team, striker,... Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

52 Adapting WSD to domains Adapting WSD to domains How could we improve WSD performance without tagging new data from domain or adapting WordNet manually to the domain? What would happen if we apply PPR-based WSD to specific domains? Personalized PageRank over context... has never won a league title as coach but took Parma to success... Personalized PageRank over related words Get related words from distributional thesaurus coach: manager, captain, player, team, striker,... Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

53 Adapting WSD to domains Adapting WSD to domains How could we improve WSD performance without tagging new data from domain or adapting WordNet manually to the domain? What would happen if we apply PPR-based WSD to specific domains? Personalized PageRank over context... has never won a league title as coach but took Parma to success... Personalized PageRank over related words Get related words from distributional thesaurus coach: manager, captain, player, team, striker,... Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

54 Adapting WSD to domains Experiments Dataset with examples from BNC, Sports and Finance sections Reuters (Koeling et al. 2005) 41 nouns: salient in either domain or with senses linked to these domains Sense inventory: WordNet v examples for each of the 41 nouns Roughly 100 examples from each word and corpus Experiments Supervised: train MFS, SVM, k-nn on SemCor examples Personalized PageRank (same damping factors, iterations) Use context 50 related words (Koeling et al. 2005) (BNC, Sports, Finance) Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

55 Adapting WSD to domains Experiments Dataset with examples from BNC, Sports and Finance sections Reuters (Koeling et al. 2005) 41 nouns: salient in either domain or with senses linked to these domains Sense inventory: WordNet v examples for each of the 41 nouns Roughly 100 examples from each word and corpus Experiments Supervised: train MFS, SVM, k-nn on SemCor examples Personalized PageRank (same damping factors, iterations) Use context 50 related words (Koeling et al. 2005) (BNC, Sports, Finance) Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

56 Adapting WSD to domains Results Systems BNC Sports Finances Baselines Random SemCor MFS Supervised SVM k-nn Context PPR Related PPR words (Koeling et al. 2005) Skyline Test MFS Supervised (MFS, SVM, k-nn) very low (see test MFS) PPR on context: best for BNC (* for statistical significance) PPR on related words: best for Sports and Finance and improves over Koeling et al., who use pairwise WordNet similarity. Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

57 WSD on the biomedical domain Outline 1 WordNet, PageRank and Personalized PageRank 2 Random walks for WSD 3 Adapting WSD to domains 4 WSD on the biomedical domain 5 Random walks for similarity 6 Similarity and Information Retrieval 7 Conclusions Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

58 WSD on the biomedical domain UMLS and biomedical text (with Aitor Soroa and Mark Stevenson) Ambiguities believed not to occur on specific domains On the Use of Cold Water as a Powerful Remedial Agent in Chronic Disease. Intranasal ipratropium bromide for the common cold. 11.7% of the phrases in abstracts added to MEDLINE in 1998 were ambiguous (Weeber et al. 2011) Unified Medical Language System (UMLS) Metathesaurus Concept Unique Identifiers (CUIs) C : Cold (Cold Sensation) [Physiologic Function] C : Cold (cold temperature) [Natural Phenomenon or Process] C : Cold (Common Cold) [Disease or Syndrome] Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

59 WSD on the biomedical domain UMLS and biomedical text (with Aitor Soroa and Mark Stevenson) Ambiguities believed not to occur on specific domains On the Use of Cold Water as a Powerful Remedial Agent in Chronic Disease. Intranasal ipratropium bromide for the common cold. 11.7% of the phrases in abstracts added to MEDLINE in 1998 were ambiguous (Weeber et al. 2011) Unified Medical Language System (UMLS) Metathesaurus Concept Unique Identifiers (CUIs) C : Cold (Cold Sensation) [Physiologic Function] C : Cold (cold temperature) [Natural Phenomenon or Process] C : Cold (Common Cold) [Disease or Syndrome] Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

60 WSD on the biomedical domain WSD and biomedical text Thesaurus in Metathesaurus: Alcohol and other drugs, Medical Subject Headings, Crisp Thesaurus, SNOMED Clinical Terms, etc. Relations in the Metathesaurus between CUIs: parent, can be qualified by, related possibly sinonymous, related other We applied random walks over a graph of CUIs. Evaluated on NLM-WSD, 50 ambiguous terms (100 instances each) KB #CUIs #relations Acc. Terms AOD 15,901 58, MSH 278,297 1,098, CSP 16,703 73, SNOMEDCT 304,443 1,237, all above 572,105 2,433, all relations - 5,352, combined with cooc (Jimeno and Aronson, 2011) Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

61 WSD on the biomedical domain WSD and biomedical text Thesaurus in Metathesaurus: Alcohol and other drugs, Medical Subject Headings, Crisp Thesaurus, SNOMED Clinical Terms, etc. Relations in the Metathesaurus between CUIs: parent, can be qualified by, related possibly sinonymous, related other We applied random walks over a graph of CUIs. Evaluated on NLM-WSD, 50 ambiguous terms (100 instances each) KB #CUIs #relations Acc. Terms AOD 15,901 58, MSH 278,297 1,098, CSP 16,703 73, SNOMEDCT 304,443 1,237, all above 572,105 2,433, all relations - 5,352, combined with cooc (Jimeno and Aronson, 2011) Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

62 Random walks for similarity Outline 1 WordNet, PageRank and Personalized PageRank 2 Random walks for WSD 3 Adapting WSD to domains 4 WSD on the biomedical domain 5 Random walks for similarity 6 Similarity and Information Retrieval 7 Conclusions Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

63 Random walks for similarity Similarity Given two words or multiword-expressions, estimate how similar they are. cord gem magician smile jewel oracle Features shared, belonging to the same class Relatedness is a more general relationship, including other relations like topical relatedness or meronymy. king movie journey cabbage star voyage Typically implemented as calculating a numeric value of similarity/relatedness. Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

64 Random walks for similarity Similarity Given two words or multiword-expressions, estimate how similar they are. cord gem magician smile jewel oracle Features shared, belonging to the same class Relatedness is a more general relationship, including other relations like topical relatedness or meronymy. king movie journey cabbage star voyage Typically implemented as calculating a numeric value of similarity/relatedness. Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

65 Random walks for similarity Similarity Given two words or multiword-expressions, estimate how similar they are. cord gem magician smile jewel oracle Features shared, belonging to the same class Relatedness is a more general relationship, including other relations like topical relatedness or meronymy. king movie journey cabbage star voyage Typically implemented as calculating a numeric value of similarity/relatedness. Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

66 Random walks for similarity Similarity and WSD gem jewel movie star Both WSD and Similarity are closely intertwined: Similarity between words based on similarity between senses (implicitly doing disambiguation) WSD uses similarity between senses in context Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

67 Random walks for similarity Similarity examples RG dataset WordSim353 dataset cord smile 0.02 king cabbage 0.23 rooster voyage 0.04 professor cucumber glass jewel 1.78 investigation effort 4.59 magician oracle 1.82 movie star cemetery graveyard 3.88 journey voyage 9.29 automobile car 3.92 midday noon 9.29 midday noon 3.94 tiger tiger pairs, 51 subjects 353 pairs, 16 subjects Similarity Similarity and relatedness Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

68 Random walks for similarity Similarity Two main approaches: Knowledge-based (Roget s Thesaurus, WordNet, etc.) Corpus-based, also known as distributional similarity (co-occurrences) Many potential applications: Overcome brittleness (word match) NLP subtasks (parsing, semantic role labeling) Information retrieval Question answering Summarization Machine translation optimization and evaluation Inference (textual entailment) Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

69 Random walks for similarity Similarity Two main approaches: Knowledge-based (Roget s Thesaurus, WordNet, etc.) Corpus-based, also known as distributional similarity (co-occurrences) Many potential applications: Overcome brittleness (word match) NLP subtasks (parsing, semantic role labeling) Information retrieval Question answering Summarization Machine translation optimization and evaluation Inference (textual entailment) Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

70 Random walks for similarity Random walks for similarity (with Aitor Soroa, Montse Cuadros, German Rigau) Based on (Hughes and Ramage, 2007) Given a pair of words (w 1, w 2 ), Initialize teleport probability mass on w 1 Run Personalized Pagerank, obtaining w 1 Initialize w 2 and obtain w 2 Measure similarity between w 1 and w 2 (e.g. cosine) Experiment settings: Damping value c = 0.85 Calculations finish after 30 iterations Variations for Knowledge Base: WordNet 3.0 WordNet relations Gloss relations other relations Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

71 Random walks for similarity Random walks for similarity (with Aitor Soroa, Montse Cuadros, German Rigau) Based on (Hughes and Ramage, 2007) Given a pair of words (w 1, w 2 ), Initialize teleport probability mass on w 1 Run Personalized Pagerank, obtaining w 1 Initialize w 2 and obtain w 2 Measure similarity between w 1 and w 2 (e.g. cosine) Experiment settings: Damping value c = 0.85 Calculations finish after 30 iterations Variations for Knowledge Base: WordNet 3.0 WordNet relations Gloss relations other relations Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

72 Random walks for similarity Dataset and results Method Source Spearman (Gabrilovich and Markovitch, 2007) Wikipedia 0.75 WordNet Knownets WordNet 0.71 WordNet glosses WordNet 0.68 (Agirre et al. 2009) Corpora 0.66 (Finkelstein et al. 2007) LSA 0.56 (Hughes and Ramage, 2007) WordNet 0.55 (Jarmasz 2003) WordNet 0.35 Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

73 Similarity and Information Retrieval Outline 1 WordNet, PageRank and Personalized PageRank 2 Random walks for WSD 3 Adapting WSD to domains 4 WSD on the biomedical domain 5 Random walks for similarity 6 Similarity and Information Retrieval 7 Conclusions Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

74 Similarity and Information Retrieval Similarity and Information Retrieval (with Arantxa Otegi and Xabier Arregi) Document expansion (aka clustering and smoothing) has been shown to be successful in ad-hoc IR Use WordNet and similarity to expand documents Example: I can t install DSL because of the antivirus program, any hints? You should turn off virus and anti-spy software. And thats done within each of the softwares themselves. Then turn them back on later after setting up any DSL softwares. Method: Initialize random walk with document words Retrieve top k synsets Introduce words on those k synsets in a secondary index When retrieving, use both primary and secondary indexes Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

75 Similarity and Information Retrieval Example You should turn off virus and anti-spy software. And thats done within each of the softwares themselves. Then turn them back on later after setting up any DSL softwares. Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

76 Similarity and Information Retrieval Example Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

77 Similarity and Information Retrieval Example I can t install DSL because of the antivirus program, any hints? Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

78 Similarity and Information Retrieval Experiments BM25 ranking function Combine 2 indexes: original words and expansion terms Parameters: k 1, b (BM25) λ (indices) k (concepts in expansion) Three collections: Robust at CLEF 2009 Yahoo Answer! RespubliQA (IR for QA) Summary of results: Default parameters: 1.43% % improvement in all 3 datasets Optimized parameters: 0.98% % improvement in 2 datasets Carrying parameters: 5.77% % improvement in 4 out of 6 Robustness Particularly on short documents Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

79 Similarity and Information Retrieval Experiments BM25 ranking function Combine 2 indexes: original words and expansion terms Parameters: k 1, b (BM25) λ (indices) k (concepts in expansion) Three collections: Robust at CLEF 2009 Yahoo Answer! RespubliQA (IR for QA) Summary of results: Default parameters: 1.43% % improvement in all 3 datasets Optimized parameters: 0.98% % improvement in 2 datasets Carrying parameters: 5.77% % improvement in 4 out of 6 Robustness Particularly on short documents Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

80 Similarity and Information Retrieval Experiments BM25 ranking function Combine 2 indexes: original words and expansion terms Parameters: k 1, b (BM25) λ (indices) k (concepts in expansion) Three collections: Robust at CLEF 2009 Yahoo Answer! RespubliQA (IR for QA) Summary of results: Default parameters: 1.43% % improvement in all 3 datasets Optimized parameters: 0.98% % improvement in 2 datasets Carrying parameters: 5.77% % improvement in 4 out of 6 Robustness Particularly on short documents Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

81 Conclusions Outline 1 WordNet, PageRank and Personalized PageRank 2 Random walks for WSD 3 Adapting WSD to domains 4 WSD on the biomedical domain 5 Random walks for similarity 6 Similarity and Information Retrieval 7 Conclusions Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

82 Conclusions Conclusions Knowledge-based method for similarity and WSD Based on random walks Exploits whole structure of underlying KB efficiently Performance: Similarity: best KB algorithm, comparable with 1.6 Tword, slightly below ESA WSD: Best KB algorithm S2AW, S3AW, Domains datasets WSD and domains: Better than supervised WSD when adapting to domains (Sports, Finance) Best KB algorithm in Biomedical texts Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

83 Conclusions Conclusions Knowledge-based method for similarity and WSD Based on random walks Exploits whole structure of underlying KB efficiently Performance: Similarity: best KB algorithm, comparable with 1.6 Tword, slightly below ESA WSD: Best KB algorithm S2AW, S3AW, Domains datasets WSD and domains: Better than supervised WSD when adapting to domains (Sports, Finance) Best KB algorithm in Biomedical texts Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

84 Conclusions Conclusions Applications: ad-hoc Information Retrieval performance gains and robustness Easily ported to other languages Provides cross-lingual similarity Only requirement of having a WordNet Publicly available at Both programs and data (WordNet, UMLS) Including program to construct graphs from new KB (e.g. Wikipedia) GPL license, open source, free Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

85 Conclusions Conclusions Applications: ad-hoc Information Retrieval performance gains and robustness Easily ported to other languages Provides cross-lingual similarity Only requirement of having a WordNet Publicly available at Both programs and data (WordNet, UMLS) Including program to construct graphs from new KB (e.g. Wikipedia) GPL license, open source, free Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

86 Conclusions Conclusions Applications: ad-hoc Information Retrieval performance gains and robustness Easily ported to other languages Provides cross-lingual similarity Only requirement of having a WordNet Publicly available at Both programs and data (WordNet, UMLS) Including program to construct graphs from new KB (e.g. Wikipedia) GPL license, open source, free Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

87 Conclusions Future work Domains and WSD: interrelation of domain classification and WSD Named Entity Disambiguation using Wikipedia Information Retrieval: other options to combine similarity information Moving to sentence similarity: Semeval task on Semantic Text Similarity Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

88 Conclusions Understanding Text with Knowledge-Bases and Random Walks Eneko Agirre ixa2.si.ehu.es/eneko IXA NLP Group University of the Basque Country MAVIR, 2011 Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

89 Conclusions References I Agirre, E., Arregi, X. and Otegi, A. (2010). Document Expansion Based on WordNet for Robust IR. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling) pp. 9 17,. Agirre, E., de Lacalle, O. L. and Soroa, A. (2009). Knowledge-Based WSD on Specific Domains: Performing better than Generic Supervised WSD. In Proceedings of IJCAI, Pasadena, USA. Agirre, E. and Soroa, A. (2009). Personalizing PageRank for Word Sense Disambiguation. In Proceedings of EACL-09, Athens, Greece. Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

90 Conclusions References II Agirre, E., Soroa, A., Alfonseca, E., Hall, K., Kravalova, J. and Pasca, M. (2009). A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. In Proceedings of annual meeting of the North American Chapter of the Association of Computational Linguistics (NAAC), Boulder, USA. Agirre, E., Soroa, A. and Stevenson, M. (2010). Graph-based Word Sense Disambiguation of Biomedical Documents. Bioinformatics 26, Eneko Agirre, Montse Cuadros, G. R. and Soroa, A. (2010). Exploring Knowledge Bases for Similarity. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC 10), (Nicoletta Calzolari (Conference Chair), Khalid Choukri, B. M. J. M. J. O. S. P. M. R. D. T., ed.), pp , European Language Resources Association (ELRA), Valletta, Malta. Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

91 Conclusions References III Otegi, A., Arregi, X. and Agirre, E. (2011). Query Expansion for IR using Knowledge-Based Relatedness. In Proceedings of the International Joint Conference on Natural Language Processing. Stevenson, M., Agirre, E. and Soroa, A. (2011). Exploiting Domain Information for Word Sense Disambiguation of Medical Documents. Journal of the American Medical Informatics Association In press, 1 6. Yeh, E., Ramage, D., Manning, C., Agirre, E. and Soroa, A. (2009). WikiWalk: Random walks on Wikipedia for Semantic Relatedness. In ACL workshop TextGraphs-4: Graph-based Methods for Natural Language Processing. Agirre (UBC) Knowledge-Bases and Random Walks MAVIR / 54

Knowledge-Based WSD on Specific Domains: Performing Better than Generic Supervised WSD

Knowledge-Based WSD on Specific Domains: Performing Better than Generic Supervised WSD Knowledge-Based WSD on Specific Domains: Performing Better than Generic Supervised WSD Eneko Agirre and Oier Lopez de Lacalle and Aitor Soroa Informatika Fakultatea, University of the Basque Country 20018,

More information

Word Sense Disambiguation as an Integer Linear Programming Problem

Word Sense Disambiguation as an Integer Linear Programming Problem Word Sense Disambiguation as an Integer Linear Programming Problem Vicky Panagiotopoulou 1, Iraklis Varlamis 2, Ion Androutsopoulos 1, and George Tsatsaronis 3 1 Department of Informatics, Athens University

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

Identifying free text plagiarism based on semantic similarity

Identifying free text plagiarism based on semantic similarity Identifying free text plagiarism based on semantic similarity George Tsatsaronis Norwegian University of Science and Technology Department of Computer and Information Science Trondheim, Norway [email protected]

More information

Automatic assignment of Wikipedia encyclopedic entries to WordNet synsets

Automatic assignment of Wikipedia encyclopedic entries to WordNet synsets Automatic assignment of Wikipedia encyclopedic entries to WordNet synsets Maria Ruiz-Casado, Enrique Alfonseca and Pablo Castells Computer Science Dep., Universidad Autonoma de Madrid, 28049 Madrid, Spain

More information

Natural Language Processing. Part 4: lexical semantics

Natural Language Processing. Part 4: lexical semantics Natural Language Processing Part 4: lexical semantics 2 Lexical semantics A lexicon generally has a highly structured form It stores the meanings and uses of each word It encodes the relations between

More information

Building the Multilingual Web of Data: A Hands-on tutorial (ISWC 2014, Riva del Garda - Italy)

Building the Multilingual Web of Data: A Hands-on tutorial (ISWC 2014, Riva del Garda - Italy) Building the Multilingual Web of Data: A Hands-on tutorial (ISWC 2014, Riva del Garda - Italy) Multilingual Word Sense Disambiguation and Entity Linking on the Web based on BabelNet Roberto Navigli, Tiziano

More information

Multilingual Central Repository version 3.0: upgrading a very large lexical knowledge base

Multilingual Central Repository version 3.0: upgrading a very large lexical knowledge base Multilingual Central Repository version 3.0: upgrading a very large lexical knowledge base Aitor González Agirre, Egoitz Laparra, German Rigau Basque Country University Donostia, Basque Country {aitor.gonzalez,

More information

Comparing Ontology-based and Corpusbased Domain Annotations in WordNet.

Comparing Ontology-based and Corpusbased Domain Annotations in WordNet. Comparing Ontology-based and Corpusbased Domain Annotations in WordNet. A paper by: Bernardo Magnini Carlo Strapparava Giovanni Pezzulo Alfio Glozzo Presented by: rabee ali alshemali Motive. Domain information

More information

Building a Spanish MMTx by using Automatic Translation and Biomedical Ontologies

Building a Spanish MMTx by using Automatic Translation and Biomedical Ontologies Building a Spanish MMTx by using Automatic Translation and Biomedical Ontologies Francisco Carrero 1, José Carlos Cortizo 1,2, José María Gómez 3 1 Universidad Europea de Madrid, C/Tajo s/n, Villaviciosa

More information

Exploiting Comparable Corpora and Bilingual Dictionaries. the Cross Language Text Categorization

Exploiting Comparable Corpora and Bilingual Dictionaries. the Cross Language Text Categorization Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization Alfio Gliozzo and Carlo Strapparava ITC-Irst via Sommarive, I-38050, Trento, ITALY {gliozzo,strappa}@itc.it

More information

ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS

ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS Gürkan Şahin 1, Banu Diri 1 and Tuğba Yıldız 2 1 Faculty of Electrical-Electronic, Department of Computer Engineering

More information

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Chapter 8. Final Results on Dutch Senseval-2 Test Data Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised

More information

An Efficient Database Design for IndoWordNet Development Using Hybrid Approach

An Efficient Database Design for IndoWordNet Development Using Hybrid Approach An Efficient Database Design for IndoWordNet Development Using Hybrid Approach Venkatesh P rabhu 2 Shilpa Desai 1 Hanumant Redkar 1 N eha P rabhugaonkar 1 Apur va N agvenkar 1 Ramdas Karmali 1 (1) GOA

More information

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde Statistical Verb-Clustering Model soft clustering: Verbs may belong to several clusters trained on verb-argument tuples clusters together verbs with similar subcategorization and selectional restriction

More information

Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision

Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision Viktor PEKAR Bashkir State University Ufa, Russia, 450000 [email protected] Steffen STAAB Institute AIFB,

More information

3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work

3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work Unsupervised Paraphrase Acquisition via Relation Discovery Takaaki Hasegawa Cyberspace Laboratories Nippon Telegraph and Telephone Corporation 1-1 Hikarinooka, Yokosuka, Kanagawa 239-0847, Japan [email protected]

More information

Semantic Relatedness Metric for Wikipedia Concepts Based on Link Analysis and its Application to Word Sense Disambiguation

Semantic Relatedness Metric for Wikipedia Concepts Based on Link Analysis and its Application to Word Sense Disambiguation Semantic Relatedness Metric for Wikipedia Concepts Based on Link Analysis and its Application to Word Sense Disambiguation Denis Turdakov, Pavel Velikhov ISP RAS [email protected], [email protected]

More information

Domain Adaptive Relation Extraction for Big Text Data Analytics. Feiyu Xu

Domain Adaptive Relation Extraction for Big Text Data Analytics. Feiyu Xu Domain Adaptive Relation Extraction for Big Text Data Analytics Feiyu Xu Outline! Introduction to relation extraction and its applications! Motivation of domain adaptation in big text data analytics! Solutions!

More information

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet Muhammad Atif Qureshi 1,2, Arjumand Younus 1,2, Colm O Riordan 1,

More information

Part 1: Link Analysis & Page Rank

Part 1: Link Analysis & Page Rank Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Exam on the 5th of February, 216, 14. to 16. If you wish to attend, please

More information

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Alessandra Giordani and Alessandro Moschitti Department of Computer Science and Engineering University of Trento Via Sommarive

More information

Identifying Focus, Techniques and Domain of Scientific Papers

Identifying Focus, Techniques and Domain of Scientific Papers Identifying Focus, Techniques and Domain of Scientific Papers Sonal Gupta Department of Computer Science Stanford University Stanford, CA 94305 [email protected] Christopher D. Manning Department of

More information

Enterprise Search Solutions Based on Target Corpus Analysis and External Knowledge Repositories

Enterprise Search Solutions Based on Target Corpus Analysis and External Knowledge Repositories Enterprise Search Solutions Based on Target Corpus Analysis and External Knowledge Repositories Thesis submitted in partial fulfillment of the requirements for the degree of M.S. by Research in Computer

More information

Clustering Connectionist and Statistical Language Processing

Clustering Connectionist and Statistical Language Processing Clustering Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised

More information

A stream computing approach towards scalable NLP

A stream computing approach towards scalable NLP A stream computing approach towards scalable NLP Xabier Artola, Zuhaitz Beloki, Aitor Soroa IXA group. University of the Basque Country. LREC, Reykjavík 2014 Table of contents 1

More information

Social Media Mining. Network Measures

Social Media Mining. Network Measures Klout Measures and Metrics 22 Why Do We Need Measures? Who are the central figures (influential individuals) in the network? What interaction patterns are common in friends? Who are the like-minded users

More information

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Open Domain Information Extraction. Günter Neumann, DFKI, 2012 Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for

More information

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Ling-Xiang Tang 1, Andrew Trotman 2, Shlomo Geva 1, Yue Xu 1 1Faculty of Science and Technology, Queensland University

More information

Analyzing survey text: a brief overview

Analyzing survey text: a brief overview IBM SPSS Text Analytics for Surveys Analyzing survey text: a brief overview Learn how gives you greater insight Contents 1 Introduction 2 The role of text in survey research 2 Approaches to text mining

More information

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg March 1, 2007 The catalogue is organized into sections of (1) obligatory modules ( Basismodule ) that

More information

1 o Semestre 2007/2008

1 o Semestre 2007/2008 Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction

More information

LABERINTO at ImageCLEF 2011 Medical Image Retrieval Task

LABERINTO at ImageCLEF 2011 Medical Image Retrieval Task LABERINTO at ImageCLEF 2011 Medical Image Retrieval Task Jacinto Mata, Mariano Crespo, Manuel J. Maña Dpto. de Tecnologías de la Información. Universidad de Huelva Ctra. Huelva - Palos de la Frontera s/n.

More information

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014 Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)

More information

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS

More information

NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR

NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR Arati K. Deshpande 1 and Prakash. R. Devale 2 1 Student and 2 Professor & Head, Department of Information Technology, Bharati

More information

Interactive Dynamic Information Extraction

Interactive Dynamic Information Extraction Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken

More information

Improving Question Retrieval in Community Question Answering Using World Knowledge

Improving Question Retrieval in Community Question Answering Using World Knowledge Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Improving Question Retrieval in Community Question Answering Using World Knowledge Guangyou Zhou, Yang Liu, Fang

More information

Social Network Mining

Social Network Mining Social Network Mining Data Mining November 11, 2013 Frank Takes ([email protected]) LIACS, Universiteit Leiden Overview Social Network Analysis Graph Mining Online Social Networks Friendship Graph Semantics

More information

Dynamical Clustering of Personalized Web Search Results

Dynamical Clustering of Personalized Web Search Results Dynamical Clustering of Personalized Web Search Results Xuehua Shen CS Dept, UIUC [email protected] Hong Cheng CS Dept, UIUC [email protected] Abstract Most current search engines present the user a ranked

More information

A Survey on Product Aspect Ranking Techniques

A Survey on Product Aspect Ranking Techniques A Survey on Product Aspect Ranking Techniques Ancy. J. S, Nisha. J.R P.G. Scholar, Dept. of C.S.E., Marian Engineering College, Kerala University, Trivandrum, India. Asst. Professor, Dept. of C.S.E., Marian

More information

A Sentiment Analysis Model Integrating Multiple Algorithms and Diverse. Features. Thesis

A Sentiment Analysis Model Integrating Multiple Algorithms and Diverse. Features. Thesis A Sentiment Analysis Model Integrating Multiple Algorithms and Diverse Features Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The

More information

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D. Data Mining on Social Networks Dionysios Sotiropoulos Ph.D. 1 Contents What are Social Media? Mathematical Representation of Social Networks Fundamental Data Mining Concepts Data Mining Tasks on Digital

More information

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari [email protected]

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it Web Mining Margherita Berardi LACAM Dipartimento di Informatica Università degli Studi di Bari [email protected] Bari, 24 Aprile 2003 Overview Introduction Knowledge discovery from text (Web Content

More information

Cross-Language Information Retrieval by Domain Restriction using Web Directory Structure

Cross-Language Information Retrieval by Domain Restriction using Web Directory Structure Cross-Language Information Retrieval by Domain Restriction using Web Directory Structure Fuminori Kimura Faculty of Culture and Information Science, Doshisha University 1 3 Miyakodani Tatara, Kyoutanabe-shi,

More information

Customer Intentions Analysis of Twitter Based on Semantic Patterns

Customer Intentions Analysis of Twitter Based on Semantic Patterns Customer Intentions Analysis of Twitter Based on Semantic Patterns Mohamed Hamroun [email protected] Mohamed Salah Gouider [email protected] Lamjed Ben Said [email protected] ABSTRACT

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

An Overview of Computational Advertising

An Overview of Computational Advertising An Overview of Computational Advertising Evgeniy Gabrilovich in collaboration with many colleagues throughout the company 1 What is Computational Advertising? New scientific sub-discipline that provides

More information

Kybots, knowledge yielding robots German Rigau IXA group, UPV/EHU http://ixa.si.ehu.es

Kybots, knowledge yielding robots German Rigau IXA group, UPV/EHU http://ixa.si.ehu.es KYOTO () Intelligent Content and Semantics Knowledge Yielding Ontologies for Transition-Based Organization http://www.kyoto-project.eu/ Kybots, knowledge yielding robots German Rigau IXA group, UPV/EHU

More information

Recognizing and Encoding Disorder Concepts in Clinical Text using Machine Learning and Vector Space Model *

Recognizing and Encoding Disorder Concepts in Clinical Text using Machine Learning and Vector Space Model * Recognizing and Encoding Disorder Concepts in Clinical Text using Machine Learning and Vector Space Model * Buzhou Tang 1,2, Yonghui Wu 1, Min Jiang 1, Joshua C. Denny 3, and Hua Xu 1,* 1 School of Biomedical

More information

Sense-Tagging Verbs in English and Chinese. Hoa Trang Dang

Sense-Tagging Verbs in English and Chinese. Hoa Trang Dang Sense-Tagging Verbs in English and Chinese Hoa Trang Dang Department of Computer and Information Sciences University of Pennsylvania [email protected] October 30, 2003 Outline English sense-tagging

More information

UMBC EBIQUITY-CORE: Semantic Textual Similarity Systems

UMBC EBIQUITY-CORE: Semantic Textual Similarity Systems UMBC EBIQUITY-CORE: Semantic Textual Similarity Systems Lushan Han, Abhay Kashyap, Tim Finin Computer Science and Electrical Engineering University of Maryland, Baltimore County Baltimore MD 21250 {lushan1,abhay1,finin}@umbc.edu

More information

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset. White Paper Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset. Using LSI for Implementing Document Management Systems By Mike Harrison, Director,

More information

ONTOLOGIES A short tutorial with references to YAGO Cosmina CROITORU

ONTOLOGIES A short tutorial with references to YAGO Cosmina CROITORU ONTOLOGIES p. 1/40 ONTOLOGIES A short tutorial with references to YAGO Cosmina CROITORU Unlocking the Secrets of the Past: Text Mining for Historical Documents Blockseminar, 21.2.-11.3.2011 ONTOLOGIES

More information

Keyphrase Extraction for Scholarly Big Data

Keyphrase Extraction for Scholarly Big Data Keyphrase Extraction for Scholarly Big Data Cornelia Caragea Computer Science and Engineering University of North Texas July 10, 2015 Scholarly Big Data Large number of scholarly documents on the Web PubMed

More information

Word Completion and Prediction in Hebrew

Word Completion and Prediction in Hebrew Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology

More information

Enriching the Crosslingual Link Structure of Wikipedia - A Classification-Based Approach -

Enriching the Crosslingual Link Structure of Wikipedia - A Classification-Based Approach - Enriching the Crosslingual Link Structure of Wikipedia - A Classification-Based Approach - Philipp Sorg and Philipp Cimiano Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe, Germany {sorg,cimiano}@aifb.uni-karlsruhe.de

More information

A Software Tool for Thesauri Management, Browsing and Supporting Advanced Searches

A Software Tool for Thesauri Management, Browsing and Supporting Advanced Searches J. Nogueras-Iso, J.A. Bañares, J. Lacasta, J. Zarazaga-Soria 105 A Software Tool for Thesauri Management, Browsing and Supporting Advanced Searches J. Nogueras-Iso, J.A. Bañares, J. Lacasta, J. Zarazaga-Soria

More information

TextRank: Bringing Order into Texts

TextRank: Bringing Order into Texts TextRank: Bringing Order into Texts Rada Mihalcea and Paul Tarau Department of Computer Science University of North Texas rada,tarau @cs.unt.edu Abstract In this paper, we introduce TextRank a graph-based

More information

Stock Market Prediction Using Data Mining

Stock Market Prediction Using Data Mining Stock Market Prediction Using Data Mining 1 Ruchi Desai, 2 Prof.Snehal Gandhi 1 M.E., 2 M.Tech. 1 Computer Department 1 Sarvajanik College of Engineering and Technology, Surat, Gujarat, India Abstract

More information

What Is This, Anyway: Automatic Hypernym Discovery

What Is This, Anyway: Automatic Hypernym Discovery What Is This, Anyway: Automatic Hypernym Discovery Alan Ritter and Stephen Soderland and Oren Etzioni Turing Center Department of Computer Science and Engineering University of Washington Box 352350 Seattle,

More information

An Overview of a Role of Natural Language Processing in An Intelligent Information Retrieval System

An Overview of a Role of Natural Language Processing in An Intelligent Information Retrieval System An Overview of a Role of Natural Language Processing in An Intelligent Information Retrieval System Asanee Kawtrakul ABSTRACT In information-age society, advanced retrieval technique and the automatic

More information

Text Mining for Health Care and Medicine. Sophia Ananiadou Director National Centre for Text Mining www.nactem.ac.uk

Text Mining for Health Care and Medicine. Sophia Ananiadou Director National Centre for Text Mining www.nactem.ac.uk Text Mining for Health Care and Medicine Sophia Ananiadou Director National Centre for Text Mining www.nactem.ac.uk The Need for Text Mining MEDLINE 2005: ~14M 2009: ~18M Overwhelming information in textual,

More information

PoS-tagging Italian texts with CORISTagger

PoS-tagging Italian texts with CORISTagger PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy [email protected] Abstract. This paper presents an evolution of CORISTagger [1], an high-performance

More information

AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-WORDS

AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-WORDS AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-WORDS Alok Ranjan Pal 1, 3, Anirban Kundu 2, 3, Abhay Singh 1, Raj Shekhar 1, Kunal Sinha 1 1 College of Engineering and Management,

More information

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks Firoj Alam 1, Anna Corazza 2, Alberto Lavelli 3, and Roberto Zanoli 3 1 Dept. of Information Eng. and Computer Science, University of Trento,

More information

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS Divyanshu Chandola 1, Aditya Garg 2, Ankit Maurya 3, Amit Kushwaha 4 1 Student, Department of Information Technology, ABES Engineering College, Uttar Pradesh,

More information

UEdin: Translating L1 Phrases in L2 Context using Context-Sensitive SMT

UEdin: Translating L1 Phrases in L2 Context using Context-Sensitive SMT UEdin: Translating L1 Phrases in L2 Context using Context-Sensitive SMT Eva Hasler ILCC, School of Informatics University of Edinburgh [email protected] Abstract We describe our systems for the SemEval

More information

Travis Goodwin & Sanda Harabagiu

Travis Goodwin & Sanda Harabagiu Automatic Generation of a Qualified Medical Knowledge Graph and its Usage for Retrieving Patient Cohorts from Electronic Medical Records Travis Goodwin & Sanda Harabagiu Human Language Technology Research

More information

Evaluation of a layered approach to question answering over linked data

Evaluation of a layered approach to question answering over linked data Evaluation of a layered approach to question answering over linked data Sebastian Walter 1, Christina Unger 1, Philipp Cimiano 1, and Daniel Bär 2 1 CITEC, Bielefeld University, Germany http://www.sc.cit-ec.uni-bielefeld.de/

More information

Taxonomies for Auto-Tagging Unstructured Content. Heather Hedden Hedden Information Management Text Analytics World, Boston, MA October 1, 2013

Taxonomies for Auto-Tagging Unstructured Content. Heather Hedden Hedden Information Management Text Analytics World, Boston, MA October 1, 2013 Taxonomies for Auto-Tagging Unstructured Content Heather Hedden Hedden Information Management Text Analytics World, Boston, MA October 1, 2013 About Heather Hedden Independent taxonomy consultant, Hedden

More information

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project Ahmet Suerdem Istanbul Bilgi University; LSE Methodology Dept. Science in the media project is funded

More information

Search Engine Based Intelligent Help Desk System: iassist

Search Engine Based Intelligent Help Desk System: iassist Search Engine Based Intelligent Help Desk System: iassist Sahil K. Shah, Prof. Sheetal A. Takale Information Technology Department VPCOE, Baramati, Maharashtra, India [email protected], [email protected]

More information

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

More information