Statistical Natural Language Processing: an introduction +contents of the 2016 course 1
Modeling language Language is complex, adaptive system Storing and processing text and speech Large datasets We want to make systems that ' understand' Take into account language related phenomena Building models about natural language using large data sets 2
Statistical Natural Language Processing Methodological basis: machine learning, pattern recognition, probability theory, statistics and signal processing Related fields: computational linguistics, corpus linguistics, phonetics, speech processing, discourse analysis, cognitive science, artificial intelligence 3
What's in a language? Phonetics and phonology: the physical sounds the patterns of sounds Morphology: The different building blocks of words Syntax: The grammatical structure Semantics: The meaning of words Pragmatics, discourse, spoken interaction... 4
Application areas Information retrieval Text clustering and classification Automatic speech recognition Statistical machine translation Natural language interfaces Word sense disambiguation Syntactic parsing... 5
Information retrieval 6
PageRank algorithm 7
Text clustering 8
Speech recognition 9
Natural language interfaces 10
Machine translation 11
Machine translation: large probabilistic models 12
Discussion Discuss 10 mins in groups of three or four: What kind of Natural Language Processing applications would be useful in your daily life? Are there applications you already use? How do they work? What does not work? 13
Complexity of languages A large proportion of modern human activity in its different forms is based on the use of language Large variation: morphology and syntactic structures Complexity of natural language(s) More than 6000 languages, many more dialects Each language a large number of different word forms Each word is understood differently by each speaker of a language at least to some degree 14
Languages in the internet www.internetworldstats.com 15
EU languages 16
Challenges of segmentation Modeling morphology - segmenting words istua "to sit", istuutua to sit down, Istun "I sit", istahdan "I sit down for a while" istahtaisin "I would sit down for a while" istahtaisinko? "should I sit down for a while?" istahtaisinkohan? "I wonder if I should sit down for a while?" Where are the word boundaries? 17
Challenge of modeling syntax 18
Challenge: semantics How to model the meaning of words? Semantic similarity: Vector space models Understanding the meaning of words? Subjectivity: learning language through individual life paths and thus end up having different ways of understanding and producing language. > How is successful communication possible? 19
Challenge of ambiguity break, cut, run, play, make, light, set, hold, clear, give, draw, take, fall, pass, head, etc. (http://muse.dillfrog.com/ambiguous_words.php) "haku" N ELA PL (of search) "hauki" N ELA PL (of pike) "hauis" N PTV SG (part of biceps) (www.lingsoft.fi) Big children and adults saw a man with a telescope 20
Example: color naming 21
Complex concepts: e.g. concept of computation 22
Different cultural contexts 23
Challenge of encoding world knowledge For good performance, world knowledge is needed Quantitatively this is challenging Qualitatively there are also many problems (mapping between language and the world is complex, cf. examples above) Note: world is essentially dynamic, continuous and multimodal,- symbolic systems are not 24
Corpus-based methods Corpora are large collections of text Annotated: add knowledge about words or structure into corpus Or just plain text Statistical information on Distribution of words and parts of words Structure Word similarity Allow us to build models and test hypotheses Allow us to explore Choose the best models based on statistics 25
Read more Manning & Schütze: Foundations of Statistical Natural language processing Chapter 1: Chapter 2: Probability and Information Theory basics Exercises will be on these topics this week 26
T-61.5020 Statistical natural language processing 5 cr, graded 1 5 based on exam + project work 10 lectures: Wed 12:15 14:00 in T3, Jan 20 Mar 30 10 exercise sessions: Thu 14:15 16:00 in T5 Text book: C. Manning, H. Schütze, 1999. Foundations of Statistical Natural Language Processing. MIT Press. http://nlp.stanford.edu/fsnlp/ Home page: https://mycourses.aalto.fi/course/view.php?id=8832 27
Course personnel Responsible professor & lecturer: Assistant & exercises: Stig-Arne Grönroos Project work: Krista Lagus Expert lecturers: Oskar Kohonen, Kalle Palomäki, Mari-Sanna Paukkeri, Matti Varjokallio, Teemu Ruokolainen, Sami Virpioja, Juho Rousu 28
Goals To learn how statistical and adaptive methods are used in information retrieval, machine translation, text mining, speech processing and related areas to process natural language data To learn how to apply the basic methods and techniques for clustering, classification, hidden Markov models and Bayesian models to model natural language 29
Lectures in the course 20 Jan 2016 1: / 27 Jan 2016 2: Sentence level processing / Oskar Kohonen 03 Feb 2016 3: Speech recognition / Kalle Palomäki 10 Feb 2016 4: Term project and Sentiment analysis / Krista Lagus 24 Feb 2016 5: Vector spaces & Information retrieval / Mari-Sanna Paukkeri 02 Mar 2016 6: Statistical language models / 09 Mar 2016 7: Morpheme-level processing / Matti Varjokallio 16 Mar 2016 8: Tagging / Teemu Ruokolainen 23 Mar 2016 9: Statistical machine translation / Sami Virpioja 30 Mar 2016 10: Text classification using kernel methods / Juho Rousu 30
How to pass the course? Participate actively in each lecture, read the corresponding material and ask questions to learn the basics Participate actively in each exercise session after each lecture to learn how to solve the problems, in practice Participate actively in project work to learn to apply your knowledge Participate in the examination to show how well you have learned the topics of the course 31
Questions? Responsible professor & lecturer: Assistant & exercises: Stig-Arne Grönroos Project work: Krista Lagus Emails: firstname.lastname@aalto.fi Home page: https://mycourses.aalto.fi/course/view.php?id=8832 32