Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov



Similar documents
Clustering Connectionist and Statistical Language Processing

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction

Course: Model, Learning, and Inference: Lecture 5

Natural Language Processing

Natural Language Processing in the EHR Lifecycle

POS Tagging 1. POS Tagging. Rule-based taggers Statistical taggers Hybrid approaches

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Speech and Language Processing

Introduction to IE with GATE

Brill s rule-based PoS tagger

Simple Language Models for Spam Detection

A Mixed Trigrams Approach for Context Sensitive Spell Checking

A Method for Automatic De-identification of Medical Records

Natural Language to Relational Query by Using Parsing Compiler

Schema documentation for types1.2.xsd

Tagging with Hidden Markov Models

ARABIC PERSON NAMES RECOGNITION BY USING A RULE BASED APPROACH

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Applications of Named Entity Recognition in Customer Relationship Management Systems

Principles of Data Mining by Hand&Mannila&Smyth

Terminology Extraction from Log Files

Machine Learning for natural language processing

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Language and Computation

Applying Co-Training Methods to Statistical Parsing. Anoop Sarkar anoop/

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

Semantic annotation of requirements for automatic UML class diagram generation

PoS-tagging Italian texts with CORISTagger

Special Topics in Computer Science

Context Grammar and POS Tagging

Survey Results: Requirements and Use Cases for Linguistic Linked Data

Shallow Parsing with Apache UIMA

CS 6740 / INFO Ad-hoc IR. Graduate-level introduction to technologies for the computational treatment of information in humanlanguage

RRSS - Rating Reviews Support System purpose built for movies recommendation

Reliable and Cost-Effective PoS-Tagging

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS

English Grammar Checker

Natural Language Processing. What s this story about?

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

Terminology Extraction from Log Files

How to make Ontologies self-building from Wiki-Texts

Chapter 7. Language models. Statistical Machine Translation

Word Completion and Prediction in Hebrew

Spam Filtering with Naive Bayesian Classification

Big Data, Machine Learning, Causal Models

Question Answering and Multilingual CLEF 2008

SENTIMENT ANALYSIS: A STUDY ON PRODUCT FEATURES

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives

The Prolog Interface to the Unstructured Information Management Architecture

Improving Word-Based Predictive Text Entry with Transformation-Based Learning

How To Rank Term And Collocation In A Newspaper

A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow

The Seven Practice Areas of Text Analytics

Activities. but I will require that groups present research papers

Why language is hard. And what Linguistics has to say about it. Natalia Silveira Participation code: eagles

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

An Adaptive Approach to Named Entity Extraction for Meeting Applications

Big Data Analytics and Optimization

Effective Self-Training for Parsing

UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE

CS 533: Natural Language. Word Prediction

Statistics Graduate Courses

Learning is a very general term denoting the way in which agents:

Learning Morphological Disambiguation Rules for Turkish

Using News Articles to Predict Stock Price Movements

Model-Based Cluster Analysis for Web Users Sessions

Facilitating Business Process Discovery using Analysis

Machine Learning and Statistics: What s the Connection?

Developing Methods and Heuristics with Low Time Complexities for Filtering Spam Messages

Atigeo at TREC 2012 Medical Records Track: ICD-9 Code Description Injection to Enhance Electronic Medical Record Search Accuracy

BANNER: AN EXECUTABLE SURVEY OF ADVANCES IN BIOMEDICAL NAMED ENTITY RECOGNITION

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

31 Case Studies: Java Natural Language Tools Available on the Web

A Comparison of Word- and Term-based Methods for Automatic Web Site Summarization

Using Knowledge Extraction and Maintenance Techniques To Enhance Analytical Performance

The PALAVRAS parser and its Linguateca applications - a mutually productive relationship

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

SOCIS: Scene of Crime Information System - IGR Review Report

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

CENG 734 Advanced Topics in Bioinformatics

TOOL OF THE INTELLIGENCE ECONOMIC: RECOGNITION FUNCTION OF REVIEWS CRITICS. Extraction and linguistic analysis of sentiments

Bayesian networks - Time-series models - Apache Spark & Scala

A De-identifier For Electronic Medical Records Based On A Heterogeneous Feature Set. Arya Tafvizi

Projektgruppe. Categorization of text documents via classification

Building a Question Classifier for a TREC-Style Question Answering System

Term extraction for user profiling: evaluation by the user

Probabilistic Latent Semantic Analysis (plsa)

Learning Translations of Named-Entity Phrases from Parallel Corpora

Transition-Based Dependency Parsing with Long Distance Collocations

Research Portfolio. Beáta B. Megyesi January 8, 2007

Learning outcomes. Knowledge and understanding. Competence and skills

Introduction to Information Extraction Technology

Chapter ML:XI (continued)

Language Modeling. Chapter Introduction

Social Media Text and Predictive Analytics

Natural Language Processing: A Model to Predict a Sequence of Words

From Terminology Extraction to Terminology Validation: An Approach Adapted to Log Files

Transcription:

Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov

Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or linguistic usage patterns in an attempt to extract probably useful (although only probably correct) information Terms disambiguation Text mining Data mining Natural language processing 3/16/2014 Text Mining 2

Outline Introduction Mining plain text Extracting information for human consumption Assessing document similarity Extracting structured information Techniques Collocations Statistical inference Word sense disambiguation Part-of-speech tagging Tools 3/16/2014 Text Mining 3

Reference Cristopher D. Manning, Hinrich Schutze. Foundations of statistical natural language processing. The MIT Press, 1999 3/16/2014 Text Mining 4

Miming Plain Text Extracting information for human consumption Text summarization Document retrieval Information retrieval Assessing document similarity Text categorization Document clustering Language identification Ascribing authorship Identifying key-phrases Extracting structured information Entity extraction Information extraction 3/16/2014 Text Mining 5

Entity Extraction Named entities Names of people, places, organizations, products E-mail addresses, URLs Dates, numbers, sums of money Acronyms and their definition Multiword terms Dictionary-based approach Capitalization and punctuation pattern Regular expression Explicit grammars Heuristics Machine learning 3/16/2014 Text Mining 6

Information Extraction Events with attributes Entity extraction Relationship extraction Co-reference ambiguity Syntactic parsing of the text Small finite-state grammars Machine learning 3/16/2014 Text Mining 7

Looking at Text Low-level formatting issues Junk formatting/content Uppercase and lowercase Tokenization: what is a word? Graphic word Whitespace Problems Periods Single apostrophes Hyphenation Homographs Morphology Stemming Lemmatization Sentences 3/16/2014 Text Mining 8

Collocations Most frequently occurring bi-grams Part-of-speech filter Word collocation window Based on mean and variance of the offsets Filter out flat peaks Hypothesis testing P w 1 w 2 = P w 1 P w 2 t test Pearson s chi-squared test Likelihood ratios Mutual information 3/16/2014 Text Mining 9

Statistical Inference: n-gram models P(w n w 1, w n-1 ) Markov assumption: Only the prior local context affects the word Statistical estimators P(w n w 1, w n-1 )= P(w 1, w n )/P(w 1, w n-1 ) Combining estimators Simple linear interpolation Katz back-off Maximum likelihood estimate P w 1,, w n = C w 1, w n N P w 1,, w n = C w 1, w n C w 1, w n 1 Laplace law P w 1,, w n = C w 1, w n +1 N+B Lidstone law P w 1,, w n = C w 1, w n +λ N+Bλ Held out estimation T r = w1,,w n :C 1 w 1,,wn =r C 2 w 1,, w n General linear interpolation Witten-Bell smoothing P w 1,, w n = T r TN r, where C w 1, w n P w 1,, w n = T r 01 Deleted estimation NN r 0 P w 1,, w n = T r 10 +T r 01 N N r 0 +N r 0 = r 3/16/2014 Text Mining 10

Word Sense Disambiguation Supervised disambiguation Bayesian classification Information-theoretic approach Unsupervised disambiguation EM algorithm for learning a word sense clustering Constraint-based Resource-based Dictionary-based Thesaurus-based 3/16/2014 Text Mining 11

Word Sense Disambiguation Bayesian classification use information from words in the context window to help in the disambiguation decision Bayes decision rule P s k c = P c s k P c P s k Naïve Bayes assumption P c s k = v j in c Maximum-likelihood estimation P v j s k = C v j, s k P s k = C s k C w t C v t, s k P v j s k Information-theoretic approach find a single contextual feature that reliably indicates which sense of the ambiguous word is being used Mutual information Flip-flop algorithm 3/16/2014 Text Mining 12

Word Sense Disambiguation Disambiguation based on sense definitions Thesaurus-based disambiguation Walker approach Yarowsky approach 3/16/2014 Text Mining 13

Word Sense Disambiguation One sense per discourse: The sense of a target word is highly consistent within any given document. One sense per collocation: Nearby words provide strong and consistent clues to the sense of a target word, conditional on relative distance, order, and syntactic relationship. 3/16/2014 Text Mining 14

Word Sense Disambiguation EM algorithm for learning a word sense clustering Parameters of the model Log-likelihood of the corpus C Initialize parameters of the model randomly E-step M-step Naïve Bayes assumption 3/16/2014 Text Mining 15

Part-of-Speech Tagging Markov model taggers Probabilistic model Viterbi algorithm Transformation-based learning of tags 3/16/2014 Text Mining 16

Part-of-Speech Tagging Markov model taggers Probabilistic model Limited horizon P X i+1 = t j X 1,, X i = P(X i+1 = t j X i ) Time invariant P X i+1 = t j X i = P(X 2 = t j X 2 ) Optimal tags for a sentence Bayes rule arg max P t 1,n w 1,n = arg max P w 1,n t 1,n P t 1,n Words are independent on each other Word only depends on its tag arg max P w 1,n t 1,n P t 1,n Probability estimations = arg max P t k t j = C(tj, t k ) C(t j ) P w l t j = C(wl : t j ) C(t j ) n i=1 P w i t i P(t i t i 1 ) Viterbi algorithm Functions δ i j probability of being in state j (tag j) at word i φ i+1 j the most likely state (tag) at word i given that we are in state j at word i+1 Initialization Induction δ i+1 t j = max[δ i t k P t j t k P(w i+1 t j )] φ i+1 t j = arg max[δ i t k P t j t k P(w i+1 t j )] Termination and path-readout X n+1 = arg max δ n+1 (t j ) X i+1 = φ i+1 (X i+1 ) 3/16/2014 Text Mining 17

Part-of-Speech Tagging Transformation-based learning of tags Replace tag t 1 with t 2 Extract rules while tagging error decreases 3/16/2014 Text Mining 18

Tools OpenNLP Sentence detector Tokenizer Name finder Document categorizer Part of speech tagger Chunker Parser Co-reference resolution GATE Tokenizer Gazetteer Sentence splitter Part of speech tagger Named entities transducer Co-reference tagger ctakes Sentence boundary detector Rule-based and context dependent tokenizer Normalizer Part-of-speech tagger Phrasal chunker Dictionary lookup annotator Context annotator Negation detector Dependency parser UIMA Component interfaces in an analytics pipeline Set of design patterns Data representations in-memory representation of annotations for highperformance analytics XML representation of annotations for integration with remote web services Development roles allowing tools to be used by users with diverse skills 3/16/2014 Text Mining 19