INF4820, Algorithms for AI and NLP: More Common Lisp Vector Spaces Erik Velldal University of Oslo Sept. 4, 2012
Topics for today 2 More Common Lisp More data types: Arrays, sequences, hash tables, and structures. Iteration with loop Vector space models Spatial models for representing data The distributional hypothesis Semantic spaces
Arrays 3 Integer-indexed container (elements count from zero)? (setf *foo* (make-array 5)) #(nil nil nil nil nil)? (setf (aref *foo* 0) 42) 42? *foo* #(42 nil nil nil nil) Can be fixed-sized or adjustable. Can also represent grids of multiple dimensions:? (setf *foo* (make-array (2 5) :initial-element 0)) #((0 0 0 0 0) (0 0 0 0 0))? (incf (aref *foo* 1 2)) 1 0 1 2 3 4 0 0 0 0 0 0 1 0 0 1 0 0
Arrays: Specializations and generalizations Specialized array types: strings and bit-vectors. Arrays and lists are subtypes of the abstract data type sequence. CL provides a large library of sequence functions, e.g.:? (length "foobarbaz") 9? (elt "foobarbaz" 8) #\z? (subseq "foobarbaz" 3 6) "bar"? (substitute #\x #\b "foobarbaz") "fooxarxaz"? (find 42 (1 2 1 3 1 0)) nil? (position 1 #(1 2 1 2 3 1 2 3 4)) 0? (count 1 #(1 2 1 3 1 0)) 3? (remove 1 (1 2 1 3 1 0)) (2 3 0)? (count-if # (lambda (x) (equalp (elt x 0) #\f)) ("foo" "bar" (b a z) "foom")) 2? (remove-if # evenp #(1 2 3 4 5)) #(1 3 5)? (every # evenp #(1 2 3 4 5)) NIL? (some # evenp #(1 2 3 4 5)) T? (sort (1 2 1 3 1 0) # <) (0 1 1 1 2 3) 4
Hash tables 5 While lists are inefficient for indexing large data sets, and arrays restricted to numeric keys, hash tables efficiently handles a large number of (almost) arbitrary type keys. Any of the four (built-in) equality tests can be used for key comparison (a more restricted test = more efficient access).? (defparameter *table* (make-hash-table :test # equal)) *table*? (gethash "foo" *table*) nil? (setf (gethash "foo" *table*) 42) 42 Useful trick for testing, inserting and updating in one go (specyfing 0 as the default value):? (incf (gethash "bar" *table* 0)) 1? (gethash "bar" *table*) 1 Hash table iteration: use maphash or specialized loop directives.
Structures / structs 6 defstruct creates a new abstract data type with named slots. Encapsulates a group of related data (i.e. an object ). Each structure type is a new type distinct from all existing Lisp types. Defines a new constructor, slot accessors, and a type predicate.? (defstruct cd artist title) CD? (setf *foo* (make-cd :artist "Elvis" :title "Blue Hawaii")) #S(CD :ARTIST "Elvis" :TITLE "Blue Hawaii")? (listp *foo*) nil? (cd-p *foo*) t? (setf (cd-title *foo*) "G.I. Blues") "G.I. Blues"? *foo* #S(CD :ARTIST "Elvis" :TITLE "G.I. Blues")
If you can t see the forest for the trees... 7... or can t even see the trees for the parentheses. A Lisp specialty: Uniformity Lisp beginners can sometimes find the syntax overwhelming. What s with all the parentheses? For seasoned Lispers the beauty lies in the fact that there s hardly any syntax at all (beyond the abstract data type of lists). Lisp code is a Lisp data structure. Lisp programs are trees of sexps (sometimes compared to the abstract syntax trees created internally by the parser/compiler for other languages). Makes it easier to write code that generates code: macros.
Macros 8 Pitch: programs that generate programs. Macros provide a way for our code to manipulate itself (before it s passed to the compiler). Can implement transformations that allow us to extend the syntax of the language. Allows us to control (or even prevent) the evaluation of arguments. We ve already used some built-in Common Lisp macros: and, or, if, cond, defun, setf, etc. We might get back to writing macros ourselves later in the course, but for now let s just look at perhaps the best example of how macros can redefine the syntax of the language for good or for worse, depending on who you ask: loop.
Iteration with loop 9 We ve talked about recursion as a powerful control structure...... but at times iteration comes more natural. While there is always dolist and dotimes for simple iteration, the loop macro is much more versatile. (defun odd-numbers (list) (loop for number in list when (oddp number) collect number)) Illustrates the power of macros: loop is basically a mini-language for iteration. Goodbye uniformity; Different syntax based on special keywords. Lisp-guru Paul Graham on loop: one of the worst flaws in CL. But non-lispy as it may be, loop is extremely general and powerful!
loop: a few more examples 10? (loop for i from 10 to 50 by 10 collect i) (10 20 30 40 50)? (loop for i below 10 when (oddp i) sum i) 25? (loop for x across "foo" collect x) (#\f #\o #\o)? (loop with foo = (a b c d) for i in foo for j from 0 until (eq i c) do (format t "~a = ~a ~%" j i)) 0 = A 1 = B
loop: a few more examples 11? (loop for i below 10 if (evenp i) collect i into evens else collect i into odds finally (return (list evens odds))) ((0 2 4 6 8) (1 3 5 7 9))? (loop for value being the hash-values of *my-hash* using (hash-key key) do (format t "~&~a -> ~a" key value))
Input and Output 12 Reading and writing is mediated through streams. The symbol t indicates the default stream, the terminal.? (format t "~a is the ~a.~%" 42 "answer") 42 is the answer. nil (read-line stream nil) reads one line of text from stream, returning it as a string. (read stream nil) reads one well-formed s-expression. The second reader argument asks to return nil on end-of-file. (with-open-file (stream "sample.txt" :direction :input) (loop for line = (read-line stream nil) while line do (format t "~a~%" line)))
Good Lisp style Bottom-up design In Lisp, you don t just write your program down toward the language, you also build the language up toward your program (Paul Graham; http://www.paulgraham.com/progbot.html) Extend the language to fit your problem! Instead of trying to solve everything with one large function: Build your program by layers of smaller functions. Eliminate repetition and patterns. Related: Define abstraction barriers. Separate the code that uses a given data abstraction from the code that implement that data abstraction. Promotes code re-use: Makes the code shorter and easier to read, debug and maintain. Somewhat more mundane: Adhere to the time-honored 80 column rule. Close multiple parens on the same line. Use Emacs auto-indentation (TAB). 13
And now... 14 VECTOR SPACE MODELS
Vector space model 15 A model for representing data based on a spatial metaphor. Each object is represented as a vector (or point) positioned in a coordinate system. Each coordinate (or dimension or axis) of the space corresponds to some descriptive and measurable property (feature) of the objects. When we want to measure the similarity of two objects, we can measure their geometrical distance/closeness in the model. Vector representations are foundational to a wide range of ML methods.
Semantic spaces 16 A semantic space is a vector space model where the points represent words, where the dimensions represent context of use, and where we d like the distance between points to reflect the semantic similarity of the words they represent. AKA distributional semantic models (DSM) and word space models. Some choices and issues: Usage = meaning? How do we define context? How do we define the vector values/weights? How do we measure similarity?
The Distributional Hypothesis 17 AKA The Contextual Theory of Meaning Meaning is use. (Wittgenstein, 1953) You shall know a word by the company it keeps. (Firth, 1968) The meaning of entities, and the meaning of grammatical relations among them, is related to the restriction of combinations of these entities relative to other entities. (Harris, 1968) He was feeling seriously hung over after drinking too many shots of retawerif at the party last night.
Distributional methods 18 Distributional view on lexical semantics. The idea: Record contexts across large collections of texts (corpora) to characterize word meaning. Motivation: Can compare the meaning of words by comparing their contexts. No need for prior knowledge! Each word o i represented by a tuple (vector) of features f 1,..., f n, where each f j records some property of the observed contexts of o i. But before we start looking at how to compare the feature vectors, we first need to define context and word.
Defining context 19 Let s say we want to extract features for the target bread in: I bake bread for breakfast. Context windows Context = neighborhood of ±n words left/right of the focus word. Bag-of-Words (BoW); ignoring the linear ordering of the words. Features: {I, bake, for, breakfast} Grammatical context Context = the grammatical relations to other words. Intuition: When words combine in a construction they often impose semantic constraints on each-other. Requires deeper linguistic analysis than simple BoW approaches. Features: {dir_obj(bake), prep_for(breakfast)}
Defining context (cont d) 20 What is a word? Tokenization: breaking text up into words or other meaningful units. Different levels of abstraction and morphological normalization: Stop-words What to do with case, numbers, punctuation, compounds,...? Full-form words vs. stemming vs. lemmatization... It s a common strategy to filter out closed-class words or function words by using a so-called stop-list. The idea is that only content words are relevant. Example: The programmer s programs had been programmed. Full-forms: the programmer s programs had been programmed. Lemmas: the programmer s program have be program. W/ stop-list: programmer program program Stems: program program program
Different contexts different similarities 21 What do we mean by similar? The type of context dictates the type of semantic similarity. Relatedness vs. sameness. Or domain vs. content. Similarity in domain : {car, road, gas, service, traffic, driver, license} Similarity in content: {car, train, bicycle, truck, vehicle, airplane, buss} While broader definitions of context (e.g. sentence-level BoW) tend to give clues for domain-based relatedness, more fine-grained grammatical contexts give clues for content-based similarity.
Feature vectors 22 A vector space model is defined by a system of n dimensions objects are represented as real valued vectors in the space R n. Our observations of words in context must be encoded numerically: Each context feature is mapped to a dimension j [1, n]. For a given word, the value of a given feature is its number of co-occurrences for the corresponding context across our corpus. Let the set of n features describing the lexical contexts of a word o i be represented as a feature vector F(o i ) = f i = f i1,..., f in. For example, assume that the ith word is cake and the jth feature is OBJ_OF(bake), then fij = f (cake, OBJ_OF(bake)) = 4 would mean that we have observed cake to be the object of the verb bake in our corpus 4 times.
Word context association 23 We want our feature vectors to reflect which contexts are the most salient or relevant for each word. Problem: Raw co-occurrence frequencies alone are not good indicators of relevance. Consider the noun wine as a direct object of the verbs buy and pour: f (wine, OBJ_OF(buy)) = 14 f (wine, OBJ_OF(pour)) = 8... but the feature OBJ_OF(buy) seems more indicative of the semantics of wine than OBJ_OF(buy). Solution: Weight the counts by an association function, normalizing our observed frequencies for chance co-occurrence. There s a range of different association measures in use, and most take the form of a statistical test of dependence; e.g. pointwise mutual information, log odds ratio, the t-test, log likelihood,...
Pointwise Mutual Information 24 Defines the association between a feature f and an observation o as a likelihood ratio of their joint probability and the product of their marginal probabilities: P(f, o) I (f, o) = log 2 P(f )P(o) = log P(f )P(o f ) 2 P(f )P(o) P(o f ) = log 2 P(o) Perfect independence: P(f, o) = P(f )P(o) and I (f, o) = 0. Perfect dependence: If f and o always occur together then P(o f ) = 1 and I (f, o) = log 2 1/P(o). A smaller marginal probability P(o) leads to a larger association score I (f, o). Overestimates the correlation of rare events.
The Log Odds Ratio 25 Measures the magnitude of association between an observed object o and a feature f independently of their marginal probabilities: log θ(f, o) = log P(f, o)/p(f, o) P( f, o)/p( f, o) θ(f, o) expresses how much the chance of observing o increases when the feature f is present. log θ(f, o) > 0 means the probability of seeing o increases when f is present. log θ = 0 indicates distributional independence.
Negative Correlations 26 Negatively correlated pairs (f, o) are usually ignored when measuring word context associations (e.g. if log θ(f, o) < 0). Unreliable estimates about negative correlations in sparse data. Both unobserved or negatively correlated co-occurrence pairs are assumed to have zero association. We will use X = { x 1,..., x k } to denote the set of association vectors that results from applying the association weighting. That is, x i = A (f i1 ),..., A (f in ), where for example A = log θ (i.e. the log odds ratio).
Euclidean distance 27 Vector space models let us compute the semantic similarity of words in terms of spatial proximity. So how do we do that then? One standard metric is the Euclidean distance: d( x, y) = n ( x i y i ) 2 Computes the length (or norm) of the difference of the vectors. The Euclidean norm of a vector is: i=1 x = n i=1 x2 i = x x Intuitive interpretation: The distance between two points corresponds to the length of a straight line connecting them.
Euclidean distance and length bias 28 However, a potential problem with Euclidean distance is that it is very sensitive to extreme values and the length of the vectors. As vectors of words with different frequencies will tend to have different length, the frequency will also affect the similarity judgment.
Overcoming length bias by normalization 29 Note that, although our association weighting to some degree already normalizes the differences in frequency, words with initially long frequency vectors, will also tend to have longer association vectors. One way to reduce effect of frequency / length is to first normalize all our vectors to have unit length, i.e.: x = 1 (Can be achieved by simply dividing each element by the length.)
Cosine similarity Another way to deal with length bias: use the cosine measure. Computes similarity as a function of the angle between the vectors: cos( x, y) = Constant range between 0 and 1. i x i y i x i 2 y i 2 i i = x y x y Avoids the arbitrary scaling caused by dimensionality, frequency or the range of the association measure A. As the angle between the vectors shortens, the cosine approaches 1. When applied to normalized vectors, the cosine can be simplified to the dot product alone: n cos( x, y) = x y = x i y i i=1 The same relative rank order as the Euclidean distance for unit vectors! 30
Next Week 31 More on vector space models Dealing with sparse vectors Computing neighbor relations in the semantic space Representing classes Representing class membership Classification algorithms KNN-classification / c-means, etc. Reading: The chapter Vector Space Classification (sections 14-14.4) in Manning, Raghavan & Schütze (2008); http://informationretrieval.org/.
31 Firth, J. R. (1968). A synopsis of linguistic theory. In F. R. Palmer (Ed.), Selected papers of j. r. firth: 1952 1959. Longman. Harris, Z. S. (1968). Mathematical structures of language. New York: Wiley. Wittgenstein, L. (1953). Philosophical investigations. Oxford: Blackwell.