Definition of CL 1. Introduction to Advanced Natural Language Processing (NLP) Definition of CL 2. Definition of CL 3. Definition of CL 4

Similar documents
Activities. but I will require that groups present research papers

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Identifying Focus, Techniques and Domain of Scientific Papers

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

Ling 201 Syntax 1. Jirka Hana April 10, 2006

SYNTAX: THE ANALYSIS OF SENTENCE STRUCTURE

NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR

Syntax: Phrases. 1. The phrase

Clustering Connectionist and Statistical Language Processing

LESSON THIRTEEN STRUCTURAL AMBIGUITY. Structural ambiguity is also referred to as syntactic ambiguity or grammatical ambiguity.

Language and Computation

Simple maths for keywords

CS 6740 / INFO Ad-hoc IR. Graduate-level introduction to technologies for the computational treatment of information in humanlanguage

Natural Language Database Interface for the Community Based Monitoring System *

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

Outline of today s lecture

Special Topics in Computer Science

Learning Translation Rules from Bilingual English Filipino Corpus

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Building a Question Classifier for a TREC-Style Question Answering System

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

31 Case Studies: Java Natural Language Tools Available on the Web

Teaching Vocabulary to Young Learners (Linse, 2005, pp )

Comprendium Translator System Overview

Study Plan for Master of Arts in Applied Linguistics

Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013

ISA OR NOT ISA: THE INTERLINGUAL DILEMMA FOR MACHINE TRANSLATION

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

A Mixed Trigrams Approach for Context Sensitive Spell Checking

A chart generator for the Dutch Alpino grammar

Natural Language to Relational Query by Using Parsing Compiler

Context Grammar and POS Tagging

Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

D2.4: Two trained semantic decoders for the Appointment Scheduling task

Overview of MT techniques. Malek Boualem (FT)

Role of NLP in Production and Comprehension to Communicate and Understand Human Language

Introduction. BM1 Advanced Natural Language Processing. Alexander Koller. 17 October 2014

NATURAL LANGUAGE QUERY PROCESSING USING SEMANTIC GRAMMAR

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Parts of Speech. Skills Team, University of Hull

Statistical Machine Translation

Hybrid Strategies. for better products and shorter time-to-market

What s in a Lexicon. The Lexicon. Lexicon vs. Dictionary. What kind of Information should a Lexicon contain?

Master of Arts in Linguistics Syllabus

Sentence Structure/Sentence Types HANDOUT

A. Schedule: Reading, problem set #2, midterm. B. Problem set #1: Aim to have this for you by Thursday (but it could be Tuesday)

How To Rank Term And Collocation In A Newspaper

Chunk Parsing. Steven Bird Ewan Klein Edward Loper. University of Melbourne, AUSTRALIA. University of Edinburgh, UK. University of Pennsylvania, USA

Differences in linguistic and discourse features of narrative writing performance. Dr. Bilal Genç 1 Dr. Kağan Büyükkarcı 2 Ali Göksu 3

A Swedish Grammar for Word Prediction

TOP 10 RESOURCES FOR TEACHERS OF ENGLISH LANGUAGE LEARNERS. Melissa McGavock Director of Bilingual Education

CS 533: Natural Language. Word Prediction

Build Vs. Buy For Text Mining

The Oxford Learner s Dictionary of Academic English

209 THE STRUCTURE AND USE OF ENGLISH.

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge

Empirical Machine Translation and its Evaluation

Semantic annotation of requirements for automatic UML class diagram generation

Information extraction from online XML-encoded documents

An Overview of Applied Linguistics

Terminology Extraction from Log Files

Paraphrasing controlled English texts

Annotation Guidelines for Dutch-English Word Alignment

English Grammar Checker

Morphology. Morphology is the study of word formation, of the structure of words. 1. some words can be divided into parts which still have meaning

I m sorry, Dave, I m afraid I can t do that :

Semantic analysis of text and speech

Sentence Blocks. Sentence Focus Activity. Contents

Syntactic Theory. Background and Transformational Grammar. Dr. Dan Flickinger & PD Dr. Valia Kordoni

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Problem of the Month: Fair Games

Parsing Software Requirements with an Ontology-based Semantic Role Labeler

Artificial Intelligence Exam DT2001 / DT2006 Ordinarie tentamen

Tagging with Hidden Markov Models

Brill s rule-based PoS tagger

Machine Learning for natural language processing

Word Completion and Prediction in Hebrew

Beyond single words: the most frequent collocations in spoken English

An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System

Frames and Commonsense. Winston, Chapter 10

Lecture 1: OT An Introduction

USABILITY OF A FILIPINO LANGUAGE TOOLS WEBSITE

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Summarizing and Paraphrasing

Lecture 9. Phrases: Subject/Predicate. English 3318: Studies in English Grammar. Dr. Svetlana Nuernberg

Why language is hard. And what Linguistics has to say about it. Natalia Silveira Participation code: eagles

UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE

NATURAL LANGUAGE TO SQL CONVERSION SYSTEM

A Method for Automatic De-identification of Medical Records

What Is Linguistics? December 1992 Center for Applied Linguistics

A System for Labeling Self-Repairs in Speech 1

Presented to The Federal Big Data Working Group Meetup On 07 June 2014 By Chuck Rehberg, CTO Semantic Insights a Division of Trigent Software

ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS

Comparing Ontology-based and Corpusbased Domain Annotations in WordNet.

Overview of the TACITUS Project

Language Modeling. Chapter Introduction

Get Ready for IELTS Writing. About Get Ready for IELTS Writing. Part 1: Language development. Part 2: Skills development. Part 3: Exam practice

PP-Attachment. Chunk/Shallow Parsing. Chunk Parsing. PP-Attachment. Recall the PP-Attachment Problem (demonstrated with XLE):

Transcription:

Definition of CL 1 Introduction to Advanced Natural Language Processing (NLP) L645 Advanced NLP Fall 2009 Adapted from Manning and Schütze (1999) and Abney (1996) Computational linguistics is the scientific study of language from a computational perspective. Computational linguists are interested in providing computational models of various kinds of linguistic phenomena. These models may be knowledge-based ( handcrafted ) or data- driven ( statistical or empirical ). Work in computational linguistics is in some cases motivated from a scientific perspective in that one is trying to provide a computational explanation for a particular linguistic or psycholinguistic phenomenon; and in other cases the motivation may be more purely technological in that one wants to provide a working component of a speech or natural language system. http://www.aclweb.org/archive/what.html 2 Definition of CL 2 Definition of CL 3 Computational linguistics is the application of linguistic theories and computational techniques to problems of natural language processing. http://www.ba.umist.ac.uk/public/departments/ registrars/academicoffice/uga/lang.htm Computational linguistics is the science of language with particular attention given to the processing complexity constraints dictated by the human cognitive architecture. Like most sciences, computational linguistics also has engineering applications. http://www.cs.tcd.ie/courses/csll/csllcourse.html 3 4 Definition of CL 4 Short History Computational linguistics is the study of computer systems for understanding and generating natural language. Ralph Grishman, Computational Linguistics: An Introduction, Cambridge University Press 1986. 1950s: Machine Translation 1960s: Chomsky 1964: ALPAC Report 1966: ELIZA 1973: SHRDLU 1980s: knowledge-based CL 1990s: statistical / machine learning approaches in CL 5 6

Early 1950s ALPAC Report Machine Translation (MT): one of the earliest applications of computers Major players: US and USSR Russian to English and reverse Georgetown University, Washington system: translated sample texts in 1954 euphoria - a lot of funding, many groups in US, USSR BUT: the system could not scale up assessed research results of groups working on MTs conclusions: MT not possible in near future funding should cease for MT! basic research should be supported word by word translation does not work, linguistic knowledge is needed 7 8 ELIZA SHRDLU the first chatterbot a computer program that mimics human conversation author: Joseph Weizenbaum Massachusetts Institute of Technology simulation of a (Rogerian) therapist user types in some statement or set of statements in natural language ELIZA then analyzes the user s statement and generates some response basic technology: pattern matching USER: ELIZA: You don t argue with me. WHY DO YOU THINK I DON T ARGUE WITH YOU interaction with a robot in a block world. author: Terry Winograd MIT the user can: ask the robot to manipulate the blocks ask it about the blocks configurations ask it about its reasoning update facts understands language in a limited domain by using syntactic parsing and semantic reasoning large scale grammar of English + parser procedural semantics for words and phrases 9 10 Knowledge-Based CL Outline proof of concept & manually-written rules linguistic/logic paradigm extensively pursued later: development of linguistic formalisms (Lexical Functional Grammar, Head- Driven Phrase Structure Grammar, Tree Adjoining Grammar etc.) Limitations: not robust enough few applications not scalable How to approach NLP Motivating Probability The Ambiguity of Language Data-driven NLP Corpora: using corpora for simple analysis Addressing this limitations led to the more recent statistical approaches 11 12

Approaching NLP Competence and Performance There are two main approaches to doing work in NLP: Theory-driven ( rationalist): working from a theoretical framework, come up with a scheme for an NLP task e.g., parse a sentence using a handwritten HPSG grammar Data-driven ( empiricist): working from some data (and some framework), derive a scheme for an NLP task e.g., parse a sentence using a grammar derived from a corpus The difference is often really a matter of degree; this course is more data-driven Competence = knowledge of language found in the mind Performance = the actual use of language in the world Data-driven NLP focuses more on performance because we deal with naturallyoccurring data That is not to say that competence is not an issue: linguistic competence issues can inform processing of performance data Generally, we will not deal with the contexts surrounding performance, e.g., sociolinguistic & pragmatic factors 13 14 Data-driven NLP Is language categorical? Data-driven NLP often learns patterns from naturally-occurring data Instead of writing rules, have computer learn rules / regularities This type of learning is generally probabilistic Notice that linguistics is generally not a probabilistic field Aside from engineering considerations, are we justified in treating langauge propbabilistically? Abney (1996) gives several arguments for this... Are the language phenomena we observe categorical, e.g., are sentences completely grammatical or completely ungrammatical? Are the following sentences equally (un)grammatical? (1) a. John I believe Sally said Bill believed Sue saw. b. What did Sally whisper that she had secretly read? c. Who did Jo think said John saw him? d. The boys read Mary s stories about each other. e. I considered John (as) a good candidate. Probabilistic modeling could give a degree of grammaticality 15 16 Is language non-categorical? Language acquisition, change, and variation Consider the following clear part-of-speech distinction: (2) We will review that decision in the near future. (adjective) (3) He lives near the station. (preposition) But what about the following: (4) We live nearer the water than you thought. A comparative (-er) form, like an adjective Takes an object noun phrase, like a preposition Probabilistic modeling could assign likelihood of category assignment Language acquisition: child uses grammatical constructions with varying frequency trying out rule possibilities with different probabilities Language change: gradual changes a certain proportion of the population is using new constuctions Language variation: Dialect continua and typological generalizations e.g., postpositions in verb-initial languages are more common than prepositions in verb-final languages Ripe for probabilistic modeling 17 18

Cognitive fuzziness Rarity of usage Consider the following (Abney 1996): What does it mean to be tall? For people, it is typically around six feet, although judgments vary. But a tall glass of milk is much less than six feet. The meaning shifts depending on the context. But even then, it is not clear that it has a fixed meaning. Meanings can be associated with certain probabilities (5) The a are of I. (6) John saw Mary. The a are of I is an acceptable noun phrase (): a and I are labels on a map, and are is measure of area John saw Mary is ambiguous between a sentence (S) and an : a type of saw (a John saw) which picks out the Mary we are talking about (cf. Typhoid Mary) We don t get these readings right away because they re rare usages of these words Rarity needs to be defined probabilistically 19 20 Wide-coverage of rules The Ambiguity of Language Grammar rules have a funny way of working sometimes and not others Typically, if a noun is premodified by both an adjective and another noun, the adjective must precede the modifying noun (7) tall shoe rack (8) *shoe tall rack But not always: (9) a Kleene-star (N) transitive (A) closure (10) highland (N) igneous (A) formations If language is categorical and you have a rule which allows N A N, then you have to do something to prevent shoe tall rack. 21 Language is ambiguous in a variety of ways: Word senses: e.g., bank Word categories: e.g., can Semantic scope: e.g., All cats hate a dog. Syntactic structure: e.g., I shot the elephants in my pajamas. and your grammar could be ambiguous in ways you may not even realize Often, however, of all the ambiguous choices, one is the best 22 Syntactic Ambiguity Less intuitive analyses You may not even recognize every sentence as being ambiguous (11) Our company is training workers S S S Our company Aux is Our company V is Adj Our company Aux V training workers is V training workers training workers 23 24

How statistics helps Data-driven NLP To summarize so far, probabilities can help with the following: Disambiguation: weight syntactic rules to come up with the most likely parse Can also come up with structural parsing preferences, e.g., longest match for constituents Degrees of grammaticality: some sentences are better than others, without a hard-and-fast grammatical judgment Naturalness: verbs have certain selectional preferences; collocational knowledge is a part of language (English thick accent vs. German starker Akzent); etc. Learning on the fly: people gradually adapt to new structures and meanings To pick the best parse, we need probabilistic information, but where do we get such information? We can induce this information from language data, potentially data annotated with linguistic information. i.e., linguistic performance, not competence Exercise: Determine the different possible analyses of this naturally-occurring phrase: some white trash version of Shania karaoke Skim the internet and find an ambiguous phrase with only one sensible interpretation. 25 26 Corpora We will becoming familiar with processing large texts, i.e., corpora There are several different ways of looking at a corpus: how balanced it is, how large it is, if it s freely available Corpora are often annotated with lingusitic mark-up, such as part-of-speech labels or syntactic annotation These corpora will serve as our data from which to learn probabilities Corpora are not the only lexical resources out there; dictionaries (e.g., WordNet) are also important, but these are often derived from corpora Using corpora for simple analysis Word counts We can use corpora to give us some basic information about word occurrences Count word types = number of distinct words there are in the corpus Count word tokens = number of actual word occurrences in the corpus; multiple occurrences of the same word type are counted each time If we compare word types and tokens, we see that there are: a few word types which occur a large number of times (often function words) a large number of word types which occur only a few times or only once 27 28 Zipf s Law Using corpora for simple analysis Collocations This idea is formulated in Zipf s Law = the frequency (f) of a word is inversely proportional to its rank (r) (12) a. fr = k, where k is some constant, or f = k r (Zipf) b. f = P (r + ρ) B, where P, ρ, and B are parameters which measure a text s richness (Mandelbrot) Mandelbrot adjusted Zipf s Law to better handle high and low ranking words; with B = 1 and ρ = 0, it is identical to Zipf s Law (where P = k). Important insight: most words are rare! Collocations are phrases (often two-word units) which are more than the sum of their parts often should be their own dictionary entry e.g., disk drive or make up Firth: You shall know a word by the company it keeps These are important for machine translation (MT) and information retrieval (IR), among other applications They can be obtained automatically for corpora by seeing which two-word sequences (bigrams) occur most often, after normalizing for overall word frequency 29 30

Using corpora for simple analysis Keyword in Context (KWIC) Planning a KWIC program We can also use corpora to build a concordance, which shows how a key word is being used in a context. The librarian showed off - gentleman teachers showed off with lip and showed the vacancy Exercise: Without using a particular programming language, how would you write a KWIC program? Assume: you re given a large corpus in text format What are the steps involved to show the words in context? By analyzing the key words in context, we can categorize the contexts and find, e.g., the syntactic frames for a word agent showed off (PP[with/in] manner ) agent showed ( recipient ) ( content ) 31 32 References Abney, Steven (1996). Statistical Methods and Linguistics. In Judith Klavans and Philip Resnik (eds.), The Balancing Act: Combining Symbolic and Statistical Approaches, Cambridge, MA: The MIT Press. Manning, Christopher D. and Hinrich Schütze (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA: The MIT Press. 33