Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Size: px

Start display at page:

Download "Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov"

Buddy Shaw
10 years ago
Views:

1 Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov

2 Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or linguistic usage patterns in an attempt to extract probably useful (although only probably correct) information Terms disambiguation Text mining Data mining Natural language processing 3/16/2014 Text Mining 2

attempt to extract probably useful (although only probably correct) information

3 Outline Introduction Mining plain text Extracting information for human consumption Assessing document similarity Extracting structured information Techniques Collocations Statistical inference Word sense disambiguation Part-of-speech tagging Tools 3/16/2014 Text Mining 3

structured information Techniques Collocations Statistical

4 Reference Cristopher D. Manning, Hinrich Schutze. Foundations of statistical natural language processing. The MIT Press, /16/2014 Text Mining 4

5 Miming Plain Text Extracting information for human consumption Text summarization Document retrieval Information retrieval Assessing document similarity Text categorization Document clustering Language identification Ascribing authorship Identifying key-phrases Extracting structured information Entity extraction Information extraction 3/16/2014 Text Mining 5

categorization Document clustering Language identification Ascribing authorship

6 Entity Extraction Named entities Names of people, places, organizations, products addresses, URLs Dates, numbers, sums of money Acronyms and their definition Multiword terms Dictionary-based approach Capitalization and punctuation pattern Regular expression Explicit grammars Heuristics Machine learning 3/16/2014 Text Mining 6

Multiword terms Dictionary-based approach Capitalization and punctuation pattern

7 Information Extraction Events with attributes Entity extraction Relationship extraction Co-reference ambiguity Syntactic parsing of the text Small finite-state grammars Machine learning 3/16/2014 Text Mining 7

ambiguity Syntactic parsing of the text Small

8 Looking at Text Low-level formatting issues Junk formatting/content Uppercase and lowercase Tokenization: what is a word? Graphic word Whitespace Problems Periods Single apostrophes Hyphenation Homographs Morphology Stemming Lemmatization Sentences 3/16/2014 Text Mining 8

Graphic word Whitespace Problems Periods Single apostrophes

9 Collocations Most frequently occurring bi-grams Part-of-speech filter Word collocation window Based on mean and variance of the offsets Filter out flat peaks Hypothesis testing P w 1 w 2 = P w 1 P w 2 t test Pearson s chi-squared test Likelihood ratios Mutual information 3/16/2014 Text Mining 9

out flat peaks Hypothesis testing P w 1 w 2 = P w 1 P w 2 t test Pearson

10 Statistical Inference: n-gram models P(w n w 1, w n-1 ) Markov assumption: Only the prior local context affects the word Statistical estimators P(w n w 1, w n-1 )= P(w 1, w n )/P(w 1, w n-1 ) Combining estimators Simple linear interpolation Katz back-off Maximum likelihood estimate P w 1,, w n = C w 1, w n N P w 1,, w n = C w 1, w n C w 1, w n 1 Laplace law P w 1,, w n = C w 1, w n +1 N+B Lidstone law P w 1,, w n = C w 1, w n +λ N+Bλ Held out estimation T r = w1,,w n :C 1 w 1,,wn =r C 2 w 1,, w n General linear interpolation Witten-Bell smoothing P w 1,, w n = T r TN r, where C w 1, w n P w 1,, w n = T r 01 Deleted estimation NN r 0 P w 1,, w n = T r 10 +T r 01 N N r 0 +N r 0 = r 3/16/2014 Text Mining 10

Laplace law P w 1,, w n = C w 1, w n +1 N+B Lidstone law P w 1,, w n = C w 1, w n +λ N+Bλ Held out estimation T r = w1,,w n :C 1 w 1,,wn =r C 2 w 1,, w n General linear

11 Word Sense Disambiguation Supervised disambiguation Bayesian classification Information-theoretic approach Unsupervised disambiguation EM algorithm for learning a word sense clustering Constraint-based Resource-based Dictionary-based Thesaurus-based 3/16/2014 Text Mining 11

disambiguation EM algorithm for learning a word sense clustering

12 Word Sense Disambiguation Bayesian classification use information from words in the context window to help in the disambiguation decision Bayes decision rule P s k c = P c s k P c P s k Naïve Bayes assumption P c s k = v j in c Maximum-likelihood estimation P v j s k = C v j, s k P s k = C s k C w t C v t, s k P v j s k Information-theoretic approach find a single contextual feature that reliably indicates which sense of the ambiguous word is being used Mutual information Flip-flop algorithm 3/16/2014 Text Mining 12

v j s k = C v j, s k P s k = C s k C w t C v t, s k P v j s k Information-theoretic approach find a single contextual feature that

13 Word Sense Disambiguation Disambiguation based on sense definitions Thesaurus-based disambiguation Walker approach Yarowsky approach 3/16/2014 Text Mining 13

14 Word Sense Disambiguation One sense per discourse: The sense of a target word is highly consistent within any given document. One sense per collocation: Nearby words provide strong and consistent clues to the sense of a target word, conditional on relative distance, order, and syntactic relationship. 3/16/2014 Text Mining 14

One sense per collocation: Nearby words provide strong and consistent clues to

15 Word Sense Disambiguation EM algorithm for learning a word sense clustering Parameters of the model Log-likelihood of the corpus C Initialize parameters of the model randomly E-step M-step Naïve Bayes assumption 3/16/2014 Text Mining 15

the corpus C Initialize parameters of the model randomly

16 Part-of-Speech Tagging Markov model taggers Probabilistic model Viterbi algorithm Transformation-based learning of tags 3/16/2014 Text Mining 16

17 Part-of-Speech Tagging Markov model taggers Probabilistic model Limited horizon P X i+1 = t j X 1,, X i = P(X i+1 = t j X i ) Time invariant P X i+1 = t j X i = P(X 2 = t j X 2 ) Optimal tags for a sentence Bayes rule arg max P t 1,n w 1,n = arg max P w 1,n t 1,n P t 1,n Words are independent on each other Word only depends on its tag arg max P w 1,n t 1,n P t 1,n Probability estimations = arg max P t k t j = C(tj, t k ) C(t j ) P w l t j = C(wl : t j ) C(t j ) n i=1 P w i t i P(t i t i 1 ) Viterbi algorithm Functions δ i j probability of being in state j (tag j) at word i φ i+1 j the most likely state (tag) at word i given that we are in state j at word i+1 Initialization Induction δ i+1 t j = max[δ i t k P t j t k P(w i+1 t j )] φ i+1 t j = arg max[δ i t k P t j t k P(w i+1 t j )] Termination and path-readout X n+1 = arg max δ n+1 (t j ) X i+1 = φ i+1 (X i+1 ) 3/16/2014 Text Mining 17

max P t k t j = C(tj, t k ) C(t j ) P w l t j = C(wl : t j ) C(t j ) n i=1 P w i t i P(t i t i 1 ) Viterbi algorithm Functions δ i j probability of being in state j (tag j) at word i φ i+1 j the most

18 Part-of-Speech Tagging Transformation-based learning of tags Replace tag t 1 with t 2 Extract rules while tagging error decreases 3/16/2014 Text Mining 18

19 Tools OpenNLP Sentence detector Tokenizer Name finder Document categorizer Part of speech tagger Chunker Parser Co-reference resolution GATE Tokenizer Gazetteer Sentence splitter Part of speech tagger Named entities transducer Co-reference tagger ctakes Sentence boundary detector Rule-based and context dependent tokenizer Normalizer Part-of-speech tagger Phrasal chunker Dictionary lookup annotator Context annotator Negation detector Dependency parser UIMA Component interfaces in an analytics pipeline Set of design patterns Data representations in-memory representation of annotations for highperformance analytics XML representation of annotations for integration with remote web services Development roles allowing tools to be used by users with diverse skills 3/16/2014 Text Mining 19

annotator Context annotator Negation detector Dependency parser UIMA Component interfaces in an analytics pipeline Set of design patterns Data representations in-memory representation of

Clustering Connectionist and Statistical Language Processing

Clustering Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised