Non-Markovian Logic-Probabilistic Modeling and Inference

Non-Markovian Logic-Probabilistic Modeling and Inference Eduardo Menezes de Morais Glauber De Bona Marcelo Finger Department of Computer Science University of São Paulo São Paulo, Brazil WL4AI 2015

What we are trying to solve Part-of-speech Tagging PoS Tagging associates each word in a sentence to a part-of-speech tag Example The/D horse/n raced/vb-p past/p the/d barn/n Determiner (D), Nouns (N), Past tense verb (VB-P), Past participle verb (VB-PP), Preposition (P), Adjective (ADJ) To resolve ambiguity, current algorithms inspect adjacent tags Only a local context is used (usually 2 or 3 words) to avoid an exponential blowup

What we are trying to solve Part-of-speech Tagging Some ambiguities can only be solved with unbounded distance dependency Example The horse raced/vb-p past the barn The horse raced/vb-pp past the barn fell/vb-p Determiner (D), Nouns (N), Past tense verb (VB-P), Past participle verb (VB-PP), Preposition (P), Adjective (ADJ) We hope to find a way to treat these dependencies without exponential blowup with Probabilistic Logic

What we are trying to solve Part-of-speech Tagging Some ambiguities can only be solved with unbounded distance dependency Example The horse raced/vb-p past the barn The horse raced/vb-pp past the barn fell/vb-p The horse raced/vb-pp past the old red barn fell/vb-p Determiner (D), Nouns (N), Past tense verb (VB-P), Past participle verb (VB-PP), Preposition (P), Adjective (ADJ) We hope to find a way to treat these dependencies without exponential blowup with Probabilistic Logic

Talk outline Outline Motivation Probabilistic Logic Hidden Markov Models Encoding Hidden Markov Models with PSAT Non-markovian extension Conclusion

Next Topic Motivation What we are trying to solve Talk outline Probabilistic Logic Definition Properties Hidden Markov Models A algorithm for POS-Tagging Encoding Hidden Markov Models with PSAT The constraints Non-markovian extension Abandoning the Markov property Extra Constraints Conclusion

Definition The language Logical variables or atoms: P = {x 1,..., x n } Connectives:,,. Formulas (L P ) are inductively composed form atoms using connectives A probabilistic knowledge base is a finite set of probabilistic assignments Γ = {P(ϕ i ψ i ) = p i 1 i m} where ϕ i, ψ i L P and p i [0, 1]

Definition Semantics V is the set of all 2 n propositional valuations (possible worlds) v : P {0, 1} generalized for any propositional formula v : L P {0, 1} A probability distribution over propositional valuations π : V [0, 1] π(v i ) = 1 2 n i=1 Probability of a formula α according to π P π (α) = {π(v i ) v i (α) = 1}

Definition The PSAT Problem Probabilistic Satisfiability: are these restrictions consistent? Is there a probability distribution over the possible worlds that makes all the probabilistic assignments true?

Definition The PSAT Problem Probabilistic Satisfiability: are these restrictions consistent? Is there a probability distribution over the possible worlds that makes all the probabilistic assignments true? Given Γ = {P(ϕ i ψ i ) = p i 1 i m}. Is there a π such that P π (ϕ i ψ i ) = p i P π (ψ i ), for all 1 i m?

Definition The OPSAT Problem Similar problem: OPSAT A formula / Γ has a range of probabilities, one for each π Given Γ = {P(ϕ i ψ i ) = p i 1 i m}. Which π satisfies P π (ϕ i ψ i ) = p i P π (ψ i ), for all 1 i m and maximizes/minimizes the probability of some formula µ / Γ? Note the lack of assumptions of a priori statistical independence

Properties The PSAT Problem: An Algebraic Formalization Vector of probabilities p of dimension k 1 (given) Consider a large matrix A k 2 n = [a ij ] (computed) a ij = v j (α i ) {0, 1} π: probability distribution of exponential size PSAT: decide if there is vector π of dimension 2 n 1 such that Aπ = p πi = 1 π 0

Properties The PSAT Problem: An Algebraic Formalization Vector of probabilities p of dimension k 1 (given) Consider a large matrix A k 2 n = [a ij ] (computed) a ij = v j (α i ) {0, 1} π: probability distribution of exponential size OPSAT: decide if there is vector π of dimension 2 n 1 such that max/min Aπ = p πi = 1 π 0 P(µ)

Properties PSAT is NP-complete PSAT is NP-complete: [Georgakopoulos & Kavvadias & Papadimitriou 1988] A PSAT problem has a solution, then there is a solution π with at most k + 1 elements π i > 0 Carathéodory s Lemma So PSAT has a polynomial size witness 1 1 π 1 0/1 0/1..... π 2. 0/1 0/1 π k+1 = 1 p 1. p k

A algorithm for POS-Tagging Hidden Markov Models HMMs have been successfully used on language processing tasks as early as the 1970s Stochastic process whose states cannot be observed directly A sequence of symbols produced by another stochastic process is observed The sequence of symbols produced dependent only on the current hidden state

A algorithm for POS-Tagging Markov Chains 0.1 0.6 Sunny Rainy 0.55 0.3 0.15 0.25 0.25 0.3 Cloudy 0.5

A algorithm for POS-Tagging Hidden Markov Model 0.1 0.6 Sunny Rainy 0.55 0.3 0.15 0.25 0.25 0.3 Cloudy 0.1 0.4 0.4 0.3 0.2 0.5 0.4 0.3 0.2 0.7 Walk Shop Play

A algorithm for POS-Tagging Inference on HMMs On a POS-tagging context, the words are the observations and the tags are the states We want to discover the sequence of states that most likely have produced a sequence of observations (decoding). max T P(W H, T ) Efficiently solved using the Viterbi algorithm Good performance, but uses a very limited context

Encoding Hidden Markov Models with PSAT To remove some of the restrictions of HMMs, first lets model it as a PSAT problem Given a sentence o 1,..., o k, define proposition variables tij true iff word i has tag j wi true iff word i is o i

The constraints Attributes of an HMM Given a HMM H = (A, B, q) 1. Probability of transition between states A = [a ij ], a ij = P(t k = s j t k 1 = s i ) 2. Probability of the first state q = [q i ], q i = P(t 1 = s i ) 3. Probability of producing observations B = [b ih ], b ih = P(w k = o h t k = s i )

The constraints Probabilistic constraints 0. Only one state per word, m states total m P(t i,j t i,k ) = 0 P( t i,j ) = 1 1. Probability of transition between states P(t i,j t i 1,k ) = a k,j 2. Probability of the first state P(t 1,j ) = q j 3. Probability of producing observations P(w i t i,j ) = b j,w(i) j=1

The constraints Properties of an HMM A HMM H = (A, B, q) has the following properties: 4. Markov property ( lack of memory ) P(t k t k 1 ) = P(t k t k 1,..., t 1 ) 5. Independence property (the observations depend only on the current state) P(w k t k ) = P(w k t M,..., t 1, w k 1,..., w 1 )

The constraints Probabilistic constraints 4. Markov property ( lack of memory ) P(t i,hi t i 1,hi 1 t 1,h1 ) = P(t i,hi t i 1,hi 1 ), for any T. 5. Independence property (the observations depend only on the current state) P(W t 1,h1 t M,hM ) = M b hi,w(i), for any T. Expressing those two properties requires an exponential number of formulas i=1

The constraints Using constraints 0-5 we create a PSAT instance Γ H such that if π is satisfies Γ H : P π (T W ) has a unique probability measure for any π This probability measure is the same probability as the HMM entails

Abandoning the Markov property Extending Markov Models The objective of representing HMMs as a PSAT instance is to modify HMMs removing some of its limitations Since our objective is to treat long distance dependencies, we can remove the Markov property Removing the Independence Property leads to bad performance (other probabilities become irrelevant) To avoid an exponential number of probabilistic assignments, we also restrict the Independence Property

Abandoning the Markov property Restricting the Independence Property We limit the Independence to a l-sized window 6. P(w i w i+l 1 t i,hi t i+l 1,hi+l 1 ) = = = i+l 1 j=i i+l 1 j=i P(w j t j,hj ), for any T k and 1 i n l + 1 b hj,w(j), for any T k and 1 i n l + 1.

Abandoning the Markov property Choosing the right model P π (T W ) now is not a single value, but an interval To choose the best π, we maximize P π (W ) OPSAT problem OPSAT solution is a distribution over possible worlds, each world encodes a tagging Tagging: choose world with the maximum probability

Abandoning the Markov property Algorithm 5.1 Performing inference in a relaxed HMM Input: A HMM H and a list of observed words (w 1,..., w m ) Output: A tag sequence T with maximum a posteriori probability. 1: Γ H { probability assignments in (0)-(3) } 2: Γ l Γ H { probability assignments in (6)} 3: (π, P W ) OPSAT (Γ l, P π (W )) 4: T argmax T {P π (T W )} 5: return T

Extra Constraints Adding more information We can quantify the influence of a tag on all subsequent tags, independent of distance 7. P(t i,hi t j,hj ) = r hi,h j,i j With these new constraints the problem may become unsatisfiable

Extra Constraints Adding more information We can quantify the influence of a tag on all subsequent tags, independent of distance 7. P(t i,hi j<i t j,k ) = r h i,k With these new constraints the problem may become unsatisfiable

Extra Constraints Measuring Inconsistency Since the knowledge base may be inconsistent, we can try to minimize the inconsistency Error variables ε i = P π (ϕ i ψ i ) p i P π (ψ i ) Minimize p-norm of errors ε p = m p ε i p i=1 Tratable (linear program) if p = 0 or p =

Extra Constraints Algorithm 5.2 Inconsistency-tolerant inference Input: A HMM H, list of observed words (w 1,..., w m ) and a set of extra constraints Ψ. Output: A tag sequence T with maximum a posteriori probability. { probability assignments in (0)-(3) } 1: Γ H 2: Γ l Γ H { probability assignments in (6) } 3: Γ + l Γ l Ψ //Ψ has the form of (7) 4: (π ε, E) OPSAT ε (Γ + l, ε p) 5: Γ ++ l Γ + l { ε p = E} 6: (π, P W ) OPSAT ε (Γ ++ l, P π (W )) 7: T argmax T {P π (T W )} 8: return T

Non-local phenomena can be modeled by Probabilistic Propositional Logic theories avoiding exponential explosion With the extra flexibility comes inconsistent theories and worse complexity Implementation is under way Future work will deal with combining local and non-local dependencies