Statistical Natural Language Processing: an introduction

Similar documents
Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Study Plan for Master of Arts in Applied Linguistics

Master of Arts in Linguistics Syllabus

209 THE STRUCTURE AND USE OF ENGLISH.

Language Technology based on Big Data: Current Situation and Future Perspectives

An Overview of Applied Linguistics

CS 6740 / INFO Ad-hoc IR. Graduate-level introduction to technologies for the computational treatment of information in humanlanguage

Master of Arts Program in Linguistics for Communication Department of Linguistics Faculty of Liberal Arts Thammasat University

Why major in linguistics (and what does a linguist do)?

How To Complete The Danish Masters Program In Lct

1. Introduction 1.1 Contact

Introduction. BM1 Advanced Natural Language Processing. Alexander Koller. 17 October 2014

Reading Competencies

Robust Methods for Automatic Transcription and Alignment of Speech Signals

Text Mining - Scope and Applications

Language and Computation

CS 6220: Data Mining Techniques Course Project Description

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Clustering Connectionist and Statistical Language Processing

Scandinavian Dialect Syntax Transnational collaboration, data collection, and resource development

Semantic analysis of text and speech

Online Catalogue

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

European Masters Program in Language and Communication Technologies (LCT) Module Handbook for Prospective Students

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

St. Petersburg College. RED 4335/Reading in the Content Area. Florida Reading Endorsement Competencies 1 & 2. Reading Alignment Matrix

The Seven Practice Areas of Text Analytics

Automatic Text Analysis Using Drupal

University of Massachusetts Boston Applied Linguistics Graduate Program. APLING 601 Introduction to Linguistics. Syllabus

Bachelor in Deaf Studies

Skills for Effective Business Communication: Efficiency, Collaboration, and Success

Psychology G4470. Psychology and Neuropsychology of Language. Spring 2013.

THE BACHELOR S DEGREE IN SPANISH

A. Schedule: Reading, problem set #2, midterm. B. Problem set #1: Aim to have this for you by Thursday (but it could be Tuesday)

ÄSSA12, No English Translation Available, 30 credits Svenska som andraspråk 1 A, gy, 30 högskolepoäng First Cycle / Grundnivå

UNIVERSITY OF PUERTO RICO RIO PIEDRAS CAMPUS COLLEGE OF HUMANITIES DEPARTMENT OF ENGLISH

University of Khartoum. Faculty of Arts. Department of English. MA in Teaching English to Speakers of Other Languages (TESOL) by Courses

Identifying Focus, Techniques and Domain of Scientific Papers

English Grammar Checker

School of Computer Science

Annotation in Language Documentation

An Introduction to Data Mining

The course is included in the CPD programme for teachers II.

MASTER OF PHILOSOPHY IN ENGLISH AND APPLIED LINGUISTICS

Linguistics 2288B Introductory General Linguistics

Teaching Formal Methods for Computational Linguistics at Uppsala University

Appendices master s degree programme Artificial Intelligence

How To Teach Reading

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Master of Science in Artificial Intelligence

Graduate Co-op Students Information Manual. Department of Computer Science. Faculty of Science. University of Regina

Language Modeling. Chapter Introduction

A Proposal for the use of Artificial Intelligence in Spend-Analytics

Statistical Machine Translation

Contemporary Linguistics

Processing: current projects and research at the IXA Group

STANDARDS FOR ENGLISH-AS-A-SECOND LANGUAGE TEACHERS

Machine Learning for natural language processing

Graduate School of Informatics

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Learning is a very general term denoting the way in which agents:

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Syntactic Theory. Background and Transformational Grammar. Dr. Dan Flickinger & PD Dr. Valia Kordoni

Data at the SFB "Mehrsprachigkeit"

DATA MINING FOR BUSINESS INTELLIGENCE. Data Mining For Business Intelligence: MIS 382N.9/MKT 382 Professor Maytal Saar-Tsechansky

Hybrid Strategies. for better products and shorter time-to-market

Natural Language Processing. What s this story about?

Career info session Nov. 17th, / 24

Culture and Language. What We Say Influences What We Think, What We Feel and What We Believe

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

COURSE SYLLABUS ESU 561 ASPECTS OF THE ENGLISH LANGUAGE. Fall 2014

Prosodic Phrasing: Machine and Human Evaluation

in Language, Culture, and Communication

The primary goals of the M.A. TESOL Program are to impart in our students:

NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR

An Overview of a Role of Natural Language Processing in An Intelligent Information Retrieval System

A System for Labeling Self-Repairs in Speech 1

Learning outcomes. Knowledge and understanding. Competence and skills

The Prolog Interface to the Unstructured Information Management Architecture

Historical Linguistics. Diachronic Analysis. Two Approaches to the Study of Language. Kinds of Language Change. What is Historical Linguistics?

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

An Arabic Text-To-Speech System Based on Artificial Neural Networks

Technical Report. Overview. Revisions in this Edition. Four-Level Assessment Process

AN ARCHITECTURE OF AN INTELLIGENT TUTORING SYSTEM TO SUPPORT DISTANCE LEARNING

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

DEGREE PLAN INSTRUCTIONS FOR COMPUTER ENGINEERING

Special Topics in Computer Science

Introduction to Pattern Recognition

USTC Course for students entering Clemson F2013 Equivalent Clemson Course Counts for Clemson MS Core Area. CPSC 822 Case Study in Operating Systems

Introduction. Philipp Koehn. 28 January 2016

What Is Linguistics? December 1992 Center for Applied Linguistics

Web 3.0 image search: a World First

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Curriculum for the basic subject at master s level in. IT and Cognition, the 2013 curriculum. Adjusted 2014

31 Case Studies: Java Natural Language Tools Available on the Web

Identifying Thesis and Conclusion Statements in Student Essays to Scaffold Peer Review

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

CSCI-599 DATA MINING AND STATISTICAL INFERENCE

Transcription:

Statistical Natural Language Processing: an introduction +contents of the 2016 course 1

Modeling language Language is complex, adaptive system Storing and processing text and speech Large datasets We want to make systems that ' understand' Take into account language related phenomena Building models about natural language using large data sets 2

Statistical Natural Language Processing Methodological basis: machine learning, pattern recognition, probability theory, statistics and signal processing Related fields: computational linguistics, corpus linguistics, phonetics, speech processing, discourse analysis, cognitive science, artificial intelligence 3

What's in a language? Phonetics and phonology: the physical sounds the patterns of sounds Morphology: The different building blocks of words Syntax: The grammatical structure Semantics: The meaning of words Pragmatics, discourse, spoken interaction... 4

Application areas Information retrieval Text clustering and classification Automatic speech recognition Statistical machine translation Natural language interfaces Word sense disambiguation Syntactic parsing... 5

Information retrieval 6

PageRank algorithm 7

Text clustering 8

Speech recognition 9

Natural language interfaces 10

Machine translation 11

Machine translation: large probabilistic models 12

Discussion Discuss 10 mins in groups of three or four: What kind of Natural Language Processing applications would be useful in your daily life? Are there applications you already use? How do they work? What does not work? 13

Complexity of languages A large proportion of modern human activity in its different forms is based on the use of language Large variation: morphology and syntactic structures Complexity of natural language(s) More than 6000 languages, many more dialects Each language a large number of different word forms Each word is understood differently by each speaker of a language at least to some degree 14

Languages in the internet www.internetworldstats.com 15

EU languages 16

Challenges of segmentation Modeling morphology - segmenting words istua "to sit", istuutua to sit down, Istun "I sit", istahdan "I sit down for a while" istahtaisin "I would sit down for a while" istahtaisinko? "should I sit down for a while?" istahtaisinkohan? "I wonder if I should sit down for a while?" Where are the word boundaries? 17

Challenge of modeling syntax 18

Challenge: semantics How to model the meaning of words? Semantic similarity: Vector space models Understanding the meaning of words? Subjectivity: learning language through individual life paths and thus end up having different ways of understanding and producing language. > How is successful communication possible? 19

Challenge of ambiguity break, cut, run, play, make, light, set, hold, clear, give, draw, take, fall, pass, head, etc. (http://muse.dillfrog.com/ambiguous_words.php) "haku" N ELA PL (of search) "hauki" N ELA PL (of pike) "hauis" N PTV SG (part of biceps) (www.lingsoft.fi) Big children and adults saw a man with a telescope 20

Example: color naming 21

Complex concepts: e.g. concept of computation 22

Different cultural contexts 23

Challenge of encoding world knowledge For good performance, world knowledge is needed Quantitatively this is challenging Qualitatively there are also many problems (mapping between language and the world is complex, cf. examples above) Note: world is essentially dynamic, continuous and multimodal,- symbolic systems are not 24

Corpus-based methods Corpora are large collections of text Annotated: add knowledge about words or structure into corpus Or just plain text Statistical information on Distribution of words and parts of words Structure Word similarity Allow us to build models and test hypotheses Allow us to explore Choose the best models based on statistics 25

Read more Manning & Schütze: Foundations of Statistical Natural language processing Chapter 1: Chapter 2: Probability and Information Theory basics Exercises will be on these topics this week 26

T-61.5020 Statistical natural language processing 5 cr, graded 1 5 based on exam + project work 10 lectures: Wed 12:15 14:00 in T3, Jan 20 Mar 30 10 exercise sessions: Thu 14:15 16:00 in T5 Text book: C. Manning, H. Schütze, 1999. Foundations of Statistical Natural Language Processing. MIT Press. http://nlp.stanford.edu/fsnlp/ Home page: https://mycourses.aalto.fi/course/view.php?id=8832 27

Course personnel Responsible professor & lecturer: Assistant & exercises: Stig-Arne Grönroos Project work: Krista Lagus Expert lecturers: Oskar Kohonen, Kalle Palomäki, Mari-Sanna Paukkeri, Matti Varjokallio, Teemu Ruokolainen, Sami Virpioja, Juho Rousu 28

Goals To learn how statistical and adaptive methods are used in information retrieval, machine translation, text mining, speech processing and related areas to process natural language data To learn how to apply the basic methods and techniques for clustering, classification, hidden Markov models and Bayesian models to model natural language 29

Lectures in the course 20 Jan 2016 1: / 27 Jan 2016 2: Sentence level processing / Oskar Kohonen 03 Feb 2016 3: Speech recognition / Kalle Palomäki 10 Feb 2016 4: Term project and Sentiment analysis / Krista Lagus 24 Feb 2016 5: Vector spaces & Information retrieval / Mari-Sanna Paukkeri 02 Mar 2016 6: Statistical language models / 09 Mar 2016 7: Morpheme-level processing / Matti Varjokallio 16 Mar 2016 8: Tagging / Teemu Ruokolainen 23 Mar 2016 9: Statistical machine translation / Sami Virpioja 30 Mar 2016 10: Text classification using kernel methods / Juho Rousu 30

How to pass the course? Participate actively in each lecture, read the corresponding material and ask questions to learn the basics Participate actively in each exercise session after each lecture to learn how to solve the problems, in practice Participate actively in project work to learn to apply your knowledge Participate in the examination to show how well you have learned the topics of the course 31

Questions? Responsible professor & lecturer: Assistant & exercises: Stig-Arne Grönroos Project work: Krista Lagus Emails: firstname.lastname@aalto.fi Home page: https://mycourses.aalto.fi/course/view.php?id=8832 32