INF4820, Algorithms for AI and NLP: More Common Lisp Vector Spaces

Similar documents

Curriculum Map. Discipline: Computer Science Course: C++

Search Engines. Stephen Shaw 18th of February, Netsoc

Chapter 15 Functional Programming Languages

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

Functional Programming. Functional Programming Languages. Chapter 14. Introduction

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Manifold Learning Examples PCA, LLE and ISOMAP

Symbol Tables. Introduction

Lecture 9. Semantic Analysis Scoping and Symbol Table

Moving from CS 61A Scheme to CS 61B Java

TF-IDF. David Kauchak cs160 Fall 2009 adapted from:

Mining a Corpus of Job Ads

THREE DIMENSIONAL GEOMETRY

Syntax Check of Embedded SQL in C++ with Proto

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Programming Languages CIS 443

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

COMPUTATIONAL DATA ANALYSIS FOR SYNTAX

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

NLP Lab Session Week 3 Bigram Frequencies and Mutual Information Scores in NLTK September 16, 2015

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

Signal Processing First Lab 01: Introduction to MATLAB. 3. Learn a little about advanced programming techniques for MATLAB, i.e., vectorization.

QUERYING THE COMPONENT DATA OF A GRAPHICAL CADASTRAL DATABASE USING VISUAL LISP PROGRAM

TECHNOLOGY Computer Programming II Grade: 9-12 Standard 2: Technology and Society Interaction

Chapter 4 One Dimensional Kinematics

Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Adaptive Context-sensitive Analysis for JavaScript

ADVANCED SCHOOL OF SYSTEMS AND DATA STUDIES (ASSDAS) PROGRAM: CTech in Computer Science

Search and Information Retrieval

I PUC - Computer Science. Practical s Syllabus. Contents

Programming Exercise 3: Multi-class Classification and Neural Networks

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

Data Deduplication in Slovak Corpora

The C Programming Language course syllabus associate level

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Glossary of Object Oriented Terms

Selected Topics in Applied Machine Learning: An integrating view on data analysis and learning algorithms

Statistical Machine Translation: IBM Models 1 and 2

1 Introduction. 2 An Interpreter. 2.1 Handling Source Code

Bachelor of Games and Virtual Worlds (Programming) Subject and Course Summaries

COMP 356 Programming Language Structures Notes for Chapter 4 of Concepts of Programming Languages Scanning and Parsing

VISUAL GUIDE to. RX Scripting. for Roulette Xtreme - System Designer 2.0

3. INNER PRODUCT SPACES

v w is orthogonal to both v and w. the three vectors v, w and v w form a right-handed set of vectors.

Data quality in Accounting Information Systems

Simple Language Models for Spam Detection

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

Web Document Clustering

Gamma Distribution Fitting

2) Write in detail the issues in the design of code generator.

TECHNICAL UNIVERSITY OF CRETE DATA STRUCTURES FILE STRUCTURES

Adaption of Statistical Filtering Techniques

Projektgruppe. Categorization of text documents via classification

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Technical Report. The KNIME Text Processing Feature:

PL / SQL Basics. Chapter 3

Basic Lisp Operations

CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.

Variable Base Interface

Lexical analysis FORMAL LANGUAGES AND COMPILERS. Floriano Scioscia. Formal Languages and Compilers A.Y. 2015/2016

Simple maths for keywords

Lesson 15 - Fill Cells Plugin

Discrete Math in Computer Science Homework 7 Solutions (Max Points: 80)

Linear Algebra Notes for Marsden and Tromba Vector Calculus

PGR Computing Programming Skills

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

STATISTICA Formula Guide: Logistic Regression. Table of Contents

CS 241 Data Organization Coding Standards

w ki w kj k=1 w2 ki k=1 w2 kj F. Aiolli - Sistemi Informativi 2007/2008

Magit-Popup User Manual

9.4. The Scalar Product. Introduction. Prerequisites. Learning Style. Learning Outcomes

Measurement and Metrics Fundamentals. SE 350 Software Process & Product Quality

Some programming experience in a high-level structured programming language is recommended.

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

5.1 Database Schema Schema Generation in SQL

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

Programming Languages

UNDERSTANDING THE TWO-WAY ANOVA

Log-Linear Models. Michael Collins

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Introduction to Statistical Computing in Microsoft Excel By Hector D. Flores; and Dr. J.A. Dobelman

Data Mining: Algorithms and Applications Matrix Math Review

Neovision2 Performance Evaluation Protocol

Introduction to Information Retrieval

Course Title: Software Development

Section 1.1. Introduction to R n

SAP InfiniteInsight Explorer Analytical Data Management v7.0

Analysis of Binary Search algorithm and Selection Sort algorithm

USC Marshall School of Business Marshall Information Services

C++ Programming Language

MAS 500 Intelligence Tips and Tricks Booklet Vol. 1

Image Compression through DCT and Huffman Coding Technique

II. RELATED WORK. Sentiment Mining

Transcription:

INF4820, Algorithms for AI and NLP: More Common Lisp Vector Spaces Erik Velldal University of Oslo Sept. 4, 2012

Topics for today 2 More Common Lisp More data types: Arrays, sequences, hash tables, and structures. Iteration with loop Vector space models Spatial models for representing data The distributional hypothesis Semantic spaces

Arrays 3 Integer-indexed container (elements count from zero)? (setf *foo* (make-array 5)) #(nil nil nil nil nil)? (setf (aref *foo* 0) 42) 42? *foo* #(42 nil nil nil nil) Can be fixed-sized or adjustable. Can also represent grids of multiple dimensions:? (setf *foo* (make-array (2 5) :initial-element 0)) #((0 0 0 0 0) (0 0 0 0 0))? (incf (aref *foo* 1 2)) 1 0 1 2 3 4 0 0 0 0 0 0 1 0 0 1 0 0

Arrays: Specializations and generalizations Specialized array types: strings and bit-vectors. Arrays and lists are subtypes of the abstract data type sequence. CL provides a large library of sequence functions, e.g.:? (length "foobarbaz") 9? (elt "foobarbaz" 8) #\z? (subseq "foobarbaz" 3 6) "bar"? (substitute #\x #\b "foobarbaz") "fooxarxaz"? (find 42 (1 2 1 3 1 0)) nil? (position 1 #(1 2 1 2 3 1 2 3 4)) 0? (count 1 #(1 2 1 3 1 0)) 3? (remove 1 (1 2 1 3 1 0)) (2 3 0)? (count-if # (lambda (x) (equalp (elt x 0) #\f)) ("foo" "bar" (b a z) "foom")) 2? (remove-if # evenp #(1 2 3 4 5)) #(1 3 5)? (every # evenp #(1 2 3 4 5)) NIL? (some # evenp #(1 2 3 4 5)) T? (sort (1 2 1 3 1 0) # <) (0 1 1 1 2 3) 4

Hash tables 5 While lists are inefficient for indexing large data sets, and arrays restricted to numeric keys, hash tables efficiently handles a large number of (almost) arbitrary type keys. Any of the four (built-in) equality tests can be used for key comparison (a more restricted test = more efficient access).? (defparameter *table* (make-hash-table :test # equal)) *table*? (gethash "foo" *table*) nil? (setf (gethash "foo" *table*) 42) 42 Useful trick for testing, inserting and updating in one go (specyfing 0 as the default value):? (incf (gethash "bar" *table* 0)) 1? (gethash "bar" *table*) 1 Hash table iteration: use maphash or specialized loop directives.

Structures / structs 6 defstruct creates a new abstract data type with named slots. Encapsulates a group of related data (i.e. an object ). Each structure type is a new type distinct from all existing Lisp types. Defines a new constructor, slot accessors, and a type predicate.? (defstruct cd artist title) CD? (setf *foo* (make-cd :artist "Elvis" :title "Blue Hawaii")) #S(CD :ARTIST "Elvis" :TITLE "Blue Hawaii")? (listp *foo*) nil? (cd-p *foo*) t? (setf (cd-title *foo*) "G.I. Blues") "G.I. Blues"? *foo* #S(CD :ARTIST "Elvis" :TITLE "G.I. Blues")

If you can t see the forest for the trees... 7... or can t even see the trees for the parentheses. A Lisp specialty: Uniformity Lisp beginners can sometimes find the syntax overwhelming. What s with all the parentheses? For seasoned Lispers the beauty lies in the fact that there s hardly any syntax at all (beyond the abstract data type of lists). Lisp code is a Lisp data structure. Lisp programs are trees of sexps (sometimes compared to the abstract syntax trees created internally by the parser/compiler for other languages). Makes it easier to write code that generates code: macros.

Macros 8 Pitch: programs that generate programs. Macros provide a way for our code to manipulate itself (before it s passed to the compiler). Can implement transformations that allow us to extend the syntax of the language. Allows us to control (or even prevent) the evaluation of arguments. We ve already used some built-in Common Lisp macros: and, or, if, cond, defun, setf, etc. We might get back to writing macros ourselves later in the course, but for now let s just look at perhaps the best example of how macros can redefine the syntax of the language for good or for worse, depending on who you ask: loop.

Iteration with loop 9 We ve talked about recursion as a powerful control structure...... but at times iteration comes more natural. While there is always dolist and dotimes for simple iteration, the loop macro is much more versatile. (defun odd-numbers (list) (loop for number in list when (oddp number) collect number)) Illustrates the power of macros: loop is basically a mini-language for iteration. Goodbye uniformity; Different syntax based on special keywords. Lisp-guru Paul Graham on loop: one of the worst flaws in CL. But non-lispy as it may be, loop is extremely general and powerful!

loop: a few more examples 10? (loop for i from 10 to 50 by 10 collect i) (10 20 30 40 50)? (loop for i below 10 when (oddp i) sum i) 25? (loop for x across "foo" collect x) (#\f #\o #\o)? (loop with foo = (a b c d) for i in foo for j from 0 until (eq i c) do (format t "~a = ~a ~%" j i)) 0 = A 1 = B

loop: a few more examples 11? (loop for i below 10 if (evenp i) collect i into evens else collect i into odds finally (return (list evens odds))) ((0 2 4 6 8) (1 3 5 7 9))? (loop for value being the hash-values of *my-hash* using (hash-key key) do (format t "~&~a -> ~a" key value))

Input and Output 12 Reading and writing is mediated through streams. The symbol t indicates the default stream, the terminal.? (format t "~a is the ~a.~%" 42 "answer") 42 is the answer. nil (read-line stream nil) reads one line of text from stream, returning it as a string. (read stream nil) reads one well-formed s-expression. The second reader argument asks to return nil on end-of-file. (with-open-file (stream "sample.txt" :direction :input) (loop for line = (read-line stream nil) while line do (format t "~a~%" line)))

Good Lisp style Bottom-up design In Lisp, you don t just write your program down toward the language, you also build the language up toward your program (Paul Graham; http://www.paulgraham.com/progbot.html) Extend the language to fit your problem! Instead of trying to solve everything with one large function: Build your program by layers of smaller functions. Eliminate repetition and patterns. Related: Define abstraction barriers. Separate the code that uses a given data abstraction from the code that implement that data abstraction. Promotes code re-use: Makes the code shorter and easier to read, debug and maintain. Somewhat more mundane: Adhere to the time-honored 80 column rule. Close multiple parens on the same line. Use Emacs auto-indentation (TAB). 13

And now... 14 VECTOR SPACE MODELS

Vector space model 15 A model for representing data based on a spatial metaphor. Each object is represented as a vector (or point) positioned in a coordinate system. Each coordinate (or dimension or axis) of the space corresponds to some descriptive and measurable property (feature) of the objects. When we want to measure the similarity of two objects, we can measure their geometrical distance/closeness in the model. Vector representations are foundational to a wide range of ML methods.

Semantic spaces 16 A semantic space is a vector space model where the points represent words, where the dimensions represent context of use, and where we d like the distance between points to reflect the semantic similarity of the words they represent. AKA distributional semantic models (DSM) and word space models. Some choices and issues: Usage = meaning? How do we define context? How do we define the vector values/weights? How do we measure similarity?

The Distributional Hypothesis 17 AKA The Contextual Theory of Meaning Meaning is use. (Wittgenstein, 1953) You shall know a word by the company it keeps. (Firth, 1968) The meaning of entities, and the meaning of grammatical relations among them, is related to the restriction of combinations of these entities relative to other entities. (Harris, 1968) He was feeling seriously hung over after drinking too many shots of retawerif at the party last night.

Distributional methods 18 Distributional view on lexical semantics. The idea: Record contexts across large collections of texts (corpora) to characterize word meaning. Motivation: Can compare the meaning of words by comparing their contexts. No need for prior knowledge! Each word o i represented by a tuple (vector) of features f 1,..., f n, where each f j records some property of the observed contexts of o i. But before we start looking at how to compare the feature vectors, we first need to define context and word.

Defining context 19 Let s say we want to extract features for the target bread in: I bake bread for breakfast. Context windows Context = neighborhood of ±n words left/right of the focus word. Bag-of-Words (BoW); ignoring the linear ordering of the words. Features: {I, bake, for, breakfast} Grammatical context Context = the grammatical relations to other words. Intuition: When words combine in a construction they often impose semantic constraints on each-other. Requires deeper linguistic analysis than simple BoW approaches. Features: {dir_obj(bake), prep_for(breakfast)}

Defining context (cont d) 20 What is a word? Tokenization: breaking text up into words or other meaningful units. Different levels of abstraction and morphological normalization: Stop-words What to do with case, numbers, punctuation, compounds,...? Full-form words vs. stemming vs. lemmatization... It s a common strategy to filter out closed-class words or function words by using a so-called stop-list. The idea is that only content words are relevant. Example: The programmer s programs had been programmed. Full-forms: the programmer s programs had been programmed. Lemmas: the programmer s program have be program. W/ stop-list: programmer program program Stems: program program program

Different contexts different similarities 21 What do we mean by similar? The type of context dictates the type of semantic similarity. Relatedness vs. sameness. Or domain vs. content. Similarity in domain : {car, road, gas, service, traffic, driver, license} Similarity in content: {car, train, bicycle, truck, vehicle, airplane, buss} While broader definitions of context (e.g. sentence-level BoW) tend to give clues for domain-based relatedness, more fine-grained grammatical contexts give clues for content-based similarity.

Feature vectors 22 A vector space model is defined by a system of n dimensions objects are represented as real valued vectors in the space R n. Our observations of words in context must be encoded numerically: Each context feature is mapped to a dimension j [1, n]. For a given word, the value of a given feature is its number of co-occurrences for the corresponding context across our corpus. Let the set of n features describing the lexical contexts of a word o i be represented as a feature vector F(o i ) = f i = f i1,..., f in. For example, assume that the ith word is cake and the jth feature is OBJ_OF(bake), then fij = f (cake, OBJ_OF(bake)) = 4 would mean that we have observed cake to be the object of the verb bake in our corpus 4 times.

Word context association 23 We want our feature vectors to reflect which contexts are the most salient or relevant for each word. Problem: Raw co-occurrence frequencies alone are not good indicators of relevance. Consider the noun wine as a direct object of the verbs buy and pour: f (wine, OBJ_OF(buy)) = 14 f (wine, OBJ_OF(pour)) = 8... but the feature OBJ_OF(buy) seems more indicative of the semantics of wine than OBJ_OF(buy). Solution: Weight the counts by an association function, normalizing our observed frequencies for chance co-occurrence. There s a range of different association measures in use, and most take the form of a statistical test of dependence; e.g. pointwise mutual information, log odds ratio, the t-test, log likelihood,...

Pointwise Mutual Information 24 Defines the association between a feature f and an observation o as a likelihood ratio of their joint probability and the product of their marginal probabilities: P(f, o) I (f, o) = log 2 P(f )P(o) = log P(f )P(o f ) 2 P(f )P(o) P(o f ) = log 2 P(o) Perfect independence: P(f, o) = P(f )P(o) and I (f, o) = 0. Perfect dependence: If f and o always occur together then P(o f ) = 1 and I (f, o) = log 2 1/P(o). A smaller marginal probability P(o) leads to a larger association score I (f, o). Overestimates the correlation of rare events.

The Log Odds Ratio 25 Measures the magnitude of association between an observed object o and a feature f independently of their marginal probabilities: log θ(f, o) = log P(f, o)/p(f, o) P( f, o)/p( f, o) θ(f, o) expresses how much the chance of observing o increases when the feature f is present. log θ(f, o) > 0 means the probability of seeing o increases when f is present. log θ = 0 indicates distributional independence.

Negative Correlations 26 Negatively correlated pairs (f, o) are usually ignored when measuring word context associations (e.g. if log θ(f, o) < 0). Unreliable estimates about negative correlations in sparse data. Both unobserved or negatively correlated co-occurrence pairs are assumed to have zero association. We will use X = { x 1,..., x k } to denote the set of association vectors that results from applying the association weighting. That is, x i = A (f i1 ),..., A (f in ), where for example A = log θ (i.e. the log odds ratio).

Euclidean distance 27 Vector space models let us compute the semantic similarity of words in terms of spatial proximity. So how do we do that then? One standard metric is the Euclidean distance: d( x, y) = n ( x i y i ) 2 Computes the length (or norm) of the difference of the vectors. The Euclidean norm of a vector is: i=1 x = n i=1 x2 i = x x Intuitive interpretation: The distance between two points corresponds to the length of a straight line connecting them.

Euclidean distance and length bias 28 However, a potential problem with Euclidean distance is that it is very sensitive to extreme values and the length of the vectors. As vectors of words with different frequencies will tend to have different length, the frequency will also affect the similarity judgment.

Overcoming length bias by normalization 29 Note that, although our association weighting to some degree already normalizes the differences in frequency, words with initially long frequency vectors, will also tend to have longer association vectors. One way to reduce effect of frequency / length is to first normalize all our vectors to have unit length, i.e.: x = 1 (Can be achieved by simply dividing each element by the length.)

Cosine similarity Another way to deal with length bias: use the cosine measure. Computes similarity as a function of the angle between the vectors: cos( x, y) = Constant range between 0 and 1. i x i y i x i 2 y i 2 i i = x y x y Avoids the arbitrary scaling caused by dimensionality, frequency or the range of the association measure A. As the angle between the vectors shortens, the cosine approaches 1. When applied to normalized vectors, the cosine can be simplified to the dot product alone: n cos( x, y) = x y = x i y i i=1 The same relative rank order as the Euclidean distance for unit vectors! 30

Next Week 31 More on vector space models Dealing with sparse vectors Computing neighbor relations in the semantic space Representing classes Representing class membership Classification algorithms KNN-classification / c-means, etc. Reading: The chapter Vector Space Classification (sections 14-14.4) in Manning, Raghavan & Schütze (2008); http://informationretrieval.org/.

31 Firth, J. R. (1968). A synopsis of linguistic theory. In F. R. Palmer (Ed.), Selected papers of j. r. firth: 1952 1959. Longman. Harris, Z. S. (1968). Mathematical structures of language. New York: Wiley. Wittgenstein, L. (1953). Philosophical investigations. Oxford: Blackwell.