Introduction to Natural Language Processing. Hongning Wang

Similar documents
Data Mining Boot Camp 2. Text Mining

CS 6740 / INFO Ad-hoc IR. Graduate-level introduction to technologies for the computational treatment of information in humanlanguage

Natural Language to Relational Query by Using Parsing Compiler

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Timeline (1) Text Mining Master TKI. Timeline (2) Timeline (3) Overview. What is Text Mining?

Semantic analysis of text and speech

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Special Topics in Computer Science

Introduction. Philipp Koehn. 28 January 2016

Natural Language Database Interface for the Community Based Monitoring System *

CS4025: Pragmatics. Resolving referring Expressions Interpreting intention in dialogue Conversational Implicature

Overview of the TACITUS Project

Learning Translation Rules from Bilingual English Filipino Corpus

Comprendium Translator System Overview

English Grammar Checker

Automatic Knowledge Base Construction Systems. Dr. Daisy Zhe Wang CISE Department University of Florida September 3th 2014

Natural Language Processing. What s this story about?

Word Completion and Prediction in Hebrew

Introduction to formal semantics -

Introduction. BM1 Advanced Natural Language Processing. Alexander Koller. 17 October 2014

Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Ling 201 Syntax 1. Jirka Hana April 10, 2006

An Overview of a Role of Natural Language Processing in An Intelligent Information Retrieval System

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery

Overview of MT techniques. Malek Boualem (FT)

NATURAL LANGUAGE QUERY PROCESSING USING SEMANTIC GRAMMAR

How To Complete The Danish Masters Program In Lct

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Machine Learning for natural language processing

Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR

Processing: current projects and research at the IXA Group

Marie Dupuch, Frédérique Segond, André Bittar, Luca Dini, Lina Soualmia, Stefan Darmoni, Quentin Gicquel, Marie-Hélène Metzger

Empirical Machine Translation and its Evaluation

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

CIKM 2015 Melbourne Australia Oct. 22, 2015 Building a Better Connected World with Data Mining and Artificial Intelligence Technologies

Build Vs. Buy For Text Mining

NATURAL LANGUAGE TO SQL CONVERSION SYSTEM

31 Case Studies: Java Natural Language Tools Available on the Web

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

POS Tagsets and POS Tagging. Definition. Tokenization. Tagset Design. Automatic POS Tagging Bigram tagging. Maximum Likelihood Estimation 1 / 23

Building a Question Classifier for a TREC-Style Question Answering System

Customizing an English-Korean Machine Translation System for Patent Translation *

Outline of today s lecture

Information extraction from texts. Technical and business challenges

European Masters Program in Language and Communication Technologies (LCT) Module Handbook for Prospective Students

AN ARCHITECTURE OF AN INTELLIGENT TUTORING SYSTEM TO SUPPORT DISTANCE LEARNING

Semantic Analysis of Natural Language Queries Using Domain Ontology for Information Access from Database

Morphology. Morphology is the study of word formation, of the structure of words. 1. some words can be divided into parts which still have meaning

A chart generator for the Dutch Alpino grammar

Semantic annotation of requirements for automatic UML class diagram generation

Using NLP and Ontologies for Notary Document Management Systems

Tagging with Hidden Markov Models

Specialty Answering Service. All rights reserved.

SOCIS: Scene of Crime Information System - IGR Review Report

Activities. but I will require that groups present research papers

Survey Results: Requirements and Use Cases for Linguistic Linked Data

LESSON THIRTEEN STRUCTURAL AMBIGUITY. Structural ambiguity is also referred to as syntactic ambiguity or grammatical ambiguity.

Chunk Parsing. Steven Bird Ewan Klein Edward Loper. University of Melbourne, AUSTRALIA. University of Edinburgh, UK. University of Pennsylvania, USA

Multi language e Discovery Three Critical Steps for Litigating in a Global Economy

Master's projects at ITMO University. Daniil Chivilikhin PhD ITMO University

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Parsing Software Requirements with an Ontology-based Semantic Role Labeler

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

INF5820 Natural Language Processing - NLP. H2009 Jan Tore Lønning jtl@ifi.uio.no

Basic Parsing Algorithms Chart Parsing

A Natural Language Database Interface For SQL-Tutor

Presented to The Federal Big Data Working Group Meetup On 07 June 2014 By Chuck Rehberg, CTO Semantic Insights a Division of Trigent Software

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

Statistical Machine Translation

ARABIC PERSON NAMES RECOGNITION BY USING A RULE BASED APPROACH

Paraphrasing controlled English texts

A Knowledge-based System for Translating FOL Formulas into NL Sentences

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Domain Knowledge Extracting in a Chinese Natural Language Interface to Databases: NChiql

The Development of Multimedia-Multilingual Document Storage, Retrieval and Delivery System for E-Organization (STREDEO PROJECT)

Skills for Effective Business Communication: Efficiency, Collaboration, and Success

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Taxonomies in Practice Welcome to the second decade of online taxonomy construction

Language Arts Literacy Areas of Focus: Grade 6

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

Role of Text Mining in Business Intelligence

Any Town Public Schools Specific School Address, City State ZIP

A Method for Automatic De-identification of Medical Records

A Chart Parsing implementation in Answer Set Programming

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization

A Mixed Trigrams Approach for Context Sensitive Spell Checking

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

Information extraction from online XML-encoded documents

Automated Extraction of Security Policies from Natural-Language Software Documents

Natural Language Query Processing for Relational Database using EFFCN Algorithm

ISA OR NOT ISA: THE INTERLINGUAL DILEMMA FOR MACHINE TRANSLATION

IAI : Knowledge Representation

Cross-Lingual Concern Analysis from Multilingual Weblog Articles

Language Modeling. Chapter Introduction

Hybrid Strategies. for better products and shorter time-to-market

Transcription:

Introduction to Natural Language Processing Hongning Wang CS@UVa

What is NLP? كلب هو مطاردة صبي في الملعب. Arabic text How can a computer make sense out of this string? Morphology Syntax Semantics Pragmatics Discourse Inference - What are the basic units of meaning (words)? - What is the meaning of each word? - How are words related with each other? - What is the combined meaning of words? - What is the meta-meaning? (speech act) - Handling a large chunk of text - Making sense of everything CS@UVa CS6501: Text Mining 2

An example of NLP Semantic analysis Dog(d1). Boy(b1). Playground(p1). Chasing(d1,b1,p1). + A dog is chasing a boy on the playground. Det Noun Aux Verb Det Noun Prep Det Noun Noun Phrase Complex Verb Noun Phrase Scared(x) if Chasing(_,x,_). Scared(b1) Inference Verb Phrase Sentence Verb Phrase Noun Phrase Prep Phrase Lexical analysis (part-ofspeech tagging) Syntactic analysis (Parsing) A person saying this may be reminding another person to get the dog back Pragmatic analysis (speech act) CS@UVa CS6501: Text Mining 3

If we can do this for all the sentences in BAD NEWS: Automatically answer our emails Translate languages accurately Help us manage, summarize, and aggregate information Use speech as a UI (when needed) Talk to us / listen to us all languages, then Unfortunately, we cannot right now. General NLP = Complete AI CS@UVa CS6501: Text Mining 4

NLP is difficult!!!!!!! Natural language is designed to make human communication efficient. Therefore, We omit a lot of common sense knowledge, which we assume the hearer/reader possesses We keep a lot of ambiguities, which we assume the hearer/reader knows how to resolve This makes EVERY step in NLP hard Ambiguity is a killer! Common sense reasoning is pre-required CS@UVa CS6501: Text Mining 5

An example of ambiguity Get the cat with the gloves. CS@UVa CS6501: Text Mining 6

Examples of challenges Word-level ambiguity design can be a noun or a verb (Ambiguous POS) root has multiple meanings (Ambiguous sense) Syntactic ambiguity natural language processing (Modification) A man saw a boy with a telescope. (PP Attachment) Anaphora resolution John persuaded Bill to buy a TV for himself. (himself = John or Bill?) Presupposition He has quit smoking. implies that he smoked before. CS@UVa CS6501: Text Mining 7

Despite all the challenges, research in NLP has also made a lot of progress CS@UVa CS6501: Text Mining 8

A brief history of NLP Early enthusiasm (1950 s): Machine Translation Too ambitious Bar-Hillel report (1960) concluded that fully-automatic high-quality translation could not be accomplished without knowledge (Dictionary + Encyclopedia) Less ambitious applications (late 1960 s & early 1970 s): Limited success, failed to scale up Deep understanding in Speech recognition Dialogue (Eliza) Shallow understanding limited domain Inference and domain knowledge (SHRDLU= block world ) Real world evaluation (late 1970 s now) Story understanding (late 1970 s & early 1980 s) Knowledge representation Large scale evaluation of speech recognition, text retrieval, information extraction (1980 now) Robust component techniques Statistical approaches enjoy more success (first in speech recognition & Statistical language models retrieval, later others) Current trend: Boundary between statistical and symbolic approaches is disappearing. We need to use all the available knowledge Applications Application-driven NLP research (bioinformatics, Web, Question answering ) CS@UVa CS6501: Text Mining 9

The state of the art A dog is chasing a boy on the playground Det Noun Aux Verb Det Noun Prep Det Noun Noun Phrase Noun Phrase Complex Verb Noun Phrase POS Tagging: 97% Semantics: some aspects - Entity/relation extraction - Word sense disambiguation - Anaphora resolution Verb Phrase Sentence Verb Phrase Prep Phrase Parsing: partial >90% Inference:??? Speech act analysis:??? CS@UVa CS6501: Text Mining 10

Machine translation CS@UVa CS6501: Text Mining 11

Dialog systems Apple s siri system Google search CS@UVa CS6501: Text Mining 12

Information extraction Google Knowledge Graph Wiki Info Box CS@UVa CS6501: Text Mining 13

Information extraction CMU Never-Ending Language Learning YAGO Knowledge Base CS@UVa CS6501: Text Mining 14

Building a computer that understands text: The NLP pipeline CS@UVa CS6501: Text Mining 15

Tokenization/Segmentation Split text into words and sentences Task: what is the most likely segmentation /tokenization? There was an earthquake near D.C. I ve even felt it in Philadelphia, New York, etc. There + was + an + earthquake + near + D.C. I + ve + even + felt + it + in + Philadelphia, + New + York, + etc. CS@UVa CS6501: Text Mining 16

Part-of-Speech tagging Marking up a word in a text (corpus) as corresponding to a particular part of speech Task: what is the most likely tag sequence A + dog + is + chasing + a + boy + on + the + playground A + dog + is + chasing + a + boy + on + the + playground Det Noun Aux Verb Det Noun Prep Det Noun CS@UVa CS6501: Text Mining 17

Named entity recognition Determine text mapping to proper names Task: what is the most likely mapping Its initial Board of Visitors included U.S. Presidents Thomas Jefferson, James Madison, and James Monroe. Its initial Board of Visitors included U.S. Presidents Thomas Jefferson, James Madison, and James Monroe. Organization, Location, Person CS@UVa CS6501: Text Mining 18

Recap: perplexity The inverse of the likelihood of the test set as assigned by the language model, normalized by the number of words PP w 1,, w N = N 1 N p(w i w i 1,, w i n+1 ) i=1 N-gram language model CS@UVa CS 6501: Text Mining 19

Recap: an example of NLP Semantic analysis Dog(d1). Boy(b1). Playground(p1). Chasing(d1,b1,p1). + A dog is chasing a boy on the playground. Det Noun Aux Verb Det Noun Prep Det Noun Noun Phrase Complex Verb Noun Phrase Scared(x) if Chasing(_,x,_). Scared(b1) Inference Verb Phrase Sentence Verb Phrase Noun Phrase Prep Phrase Lexical analysis (part-ofspeech tagging) Syntactic analysis (Parsing) A person saying this may be reminding another person to get the dog back Pragmatic analysis (speech act) CS@UVa CS6501: Text Mining 20

Recap: NLP is difficult!!!!!!! Natural language is designed to make human communication efficient. Therefore, We omit a lot of common sense knowledge, which we assume the hearer/reader possesses We keep a lot of ambiguities, which we assume the hearer/reader knows how to resolve This makes EVERY step in NLP hard Ambiguity is a killer! Common sense reasoning is pre-required CS@UVa CS6501: Text Mining 21

Syntactic parsing Grammatical analysis of a given sentence, conforming to the rules of a formal grammar Task: what is the most likely grammatical structure A + dog + is + chasing + a + boy + on + the + playground Det Noun Aux Verb Det Noun Prep Det Noun Noun Phrase Complex Verb Noun Phrase Verb Phrase Noun Phrase Prep Phrase Verb Phrase Sentence CS@UVa CS6501: Text Mining 22

Relation extraction Identify the relationships among named entities Shallow semantic analysis Its initial Board of Visitors included U.S. Presidents Thomas Jefferson, James Madison, and James Monroe. 1. Thomas Jefferson Is_Member_Of Board of Visitors 2. Thomas Jefferson Is_President_Of U.S. CS@UVa CS6501: Text Mining 23

Logic inference Convert chunks of text into more formal representations Deep semantic analysis: e.g., first-order logic structures Its initial Board of Visitors included U.S. Presidents Thomas Jefferson, James Madison, and James Monroe. x (Is_Person(x) & Is_President_Of(x, U.S. ) & Is_Member_Of(x, Board of Visitors )) CS@UVa CS6501: Text Mining 24

Towards understanding of text Who is Carl Lewis? Did Carl Lewis break any records? CS@UVa CS6501: Text Mining 25

Major NLP applications Speech recognition: e.g., auto telephone call routing Text mining Text clustering Text classification Text summarization Our focus Topic modeling Question answering Language tutoring Spelling/grammar correction Machine translation Cross-language retrieval Restricted natural language Natural language user interface CS@UVa CS6501: Text Mining 26

NLP & text mining Better NLP => Better text mining Bad NLP => Bad text mining? Robust, shallow NLP tends to be more useful than deep, but fragile NLP. Errors in NLP can hurt text mining performance CS@UVa CS6501: Text Mining 27

How much NLP is really needed? Tasks Dependency on NLP Scalability Classification Clustering Summarization Extraction Topic modeling Translation Dialogue Question Answering Inference Speech Act CS@UVa CS6501: Text Mining 28

So, what NLP techniques are the most useful for text mining? Statistical NLP in general. The need for high robustness and efficiency implies the dominant use of simple models CS@UVa CS6501: Text Mining 29

What you should know Different levels of NLP Challenges in NLP NLP pipeline CS@UVa CS6501: Text Mining 30