Non-Markovian Logic-Probabilistic Modeling and Inference

Similar documents
Tagging with Hidden Markov Models

CS510 Software Engineering

Logic in general. Inference rules and theorem proving

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Testing LTL Formula Translation into Büchi Automata

POS Tagsets and POS Tagging. Definition. Tokenization. Tagset Design. Automatic POS Tagging Bigram tagging. Maximum Likelihood Estimation 1 / 23

A linear algebraic method for pricing temporary life annuities

We shall turn our attention to solving linear systems of equations. Ax = b

LECTURE 4. Last time: Lecture outline

Near Optimal Solutions

Development of dynamically evolving and self-adaptive software. 1. Background

Introduction to Logic in Computer Science: Autumn 2006

How To Understand The Relation Between Simplicity And Probability In Computer Science

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction

Optimization Modeling for Mining Engineers

Removing Partial Inconsistency in Valuation- Based Systems*

Discrete Optimization

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Introduction to Markov Chain Monte Carlo

Statistical Machine Translation

5.1 Bipartite Matching

Learning is a very general term denoting the way in which agents:

What is Linear Programming?

Hardware Implementation of Probabilistic State Machine for Word Recognition

Training and evaluation of POS taggers on the French MULTITAG corpus

NP-Completeness and Cook s Theorem

Bayesian Statistics: Indian Buffet Process

Natural Language Database Interface for the Community Based Monitoring System *

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

Conditional Random Fields: An Introduction

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Network (Tree) Topology Inference Based on Prüfer Sequence

Cell Phone based Activity Detection using Markov Logic Network

Software Modeling and Verification

Introduction to Algorithmic Trading Strategies Lecture 2

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1

Transportation Polytopes: a Twenty year Update

On the Unique Games Conjecture

Ling 201 Syntax 1. Jirka Hana April 10, 2006

Monitoring Metric First-order Temporal Properties

Matrix-Chain Multiplication

IBM Research Report. A Pricing Problem under Monge Property

Chapter 5. Phrase-based models. Statistical Machine Translation

Reliable and Cost-Effective PoS-Tagging

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Chapter 6. Decoding. Statistical Machine Translation

Introduction to Online Learning Theory

Discuss the size of the instance for the minimum spanning tree problem.

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

Diablo Valley College Catalog

24. The Branch and Bound Method

Version Spaces.

Error Log Processing for Accurate Failure Prediction. Humboldt-Universität zu Berlin

Linear Programming I

Markov Decision Processes for Ad Network Optimization

Fixed-Point Logics and Computation

A Learning Based Method for Super-Resolution of Low Resolution Images

Coding and decoding with convolutional codes. The Viterbi Algor

Lecture 2 Matrix Operations

English. Universidad Virtual. Curso de sensibilización a la PAEP (Prueba de Admisión a Estudios de Posgrado) Parts of Speech. Nouns.

Big Ideas in Mathematics

Hidden Markov Models

Activity recognition in ADL settings. Ben Kröse

Course: Model, Learning, and Inference: Lecture 5

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

Inferring Probabilistic Models of cis-regulatory Modules. BMI/CS Spring 2015 Colin Dewey

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Similarity Search and Mining in Uncertain Spatial and Spatio Temporal Databases. Andreas Züfle

ON THE DEGREE OF MAXIMALITY OF DEFINITIONALLY COMPLETE LOGICS

Perron vector Optimization applied to search engines

6.080/6.089 GITCS Feb 12, Lecture 3

3.1 Solving Systems Using Tables and Graphs

Gambling and Data Compression

Maximum Likelihood Graph Structure Estimation with Degree Distributions

CHAPTER 2 Estimating Probabilities

Algorithmic Software Verification

Computational Methods for Database Repair by Signed Formulae

Applied Algorithm Design Lecture 5

On the Modeling and Verification of Security-Aware and Process-Aware Information Systems

Model Checking: An Introduction

Approximation Algorithms

DATA MINING IN FINANCE

System BV is NP-complete

How To Find An Optimal Search Protocol For An Oblivious Cell

Least-Squares Intersection of Lines

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 2

Math 1050 Khan Academy Extra Credit Algebra Assignment

STATISTICS AND DATA ANALYSIS IN GEOLOGY, 3rd ed. Clarificationof zonationprocedure described onpp

Software Verification and Testing. Lecture Notes: Temporal Logics

Mathematics Course 111: Algebra I Part IV: Vector Spaces

Universality in the theory of algorithms and computer science

Principles of Data Mining by Hand&Mannila&Smyth

CS MidTerm Exam 4/1/2004 Name: KEY. Page Max Score Total 139

EXCEL SOLVER TUTORIAL

PoS-tagging Italian texts with CORISTagger

Pupil SPAG Card 1. Terminology for pupils. I Can Date Word

Complexity Theory. IE 661: Scheduling Theory Fall 2003 Satyaki Ghosh Dastidar

Permutation Betting Markets: Singleton Betting with Extra Information

Transcription:

Non-Markovian Logic-Probabilistic Modeling and Inference Eduardo Menezes de Morais Glauber De Bona Marcelo Finger Department of Computer Science University of São Paulo São Paulo, Brazil WL4AI 2015

What we are trying to solve Part-of-speech Tagging PoS Tagging associates each word in a sentence to a part-of-speech tag Example The/D horse/n raced/vb-p past/p the/d barn/n Determiner (D), Nouns (N), Past tense verb (VB-P), Past participle verb (VB-PP), Preposition (P), Adjective (ADJ) To resolve ambiguity, current algorithms inspect adjacent tags Only a local context is used (usually 2 or 3 words) to avoid an exponential blowup

What we are trying to solve Part-of-speech Tagging Some ambiguities can only be solved with unbounded distance dependency Example The horse raced/vb-p past the barn The horse raced/vb-pp past the barn fell/vb-p Determiner (D), Nouns (N), Past tense verb (VB-P), Past participle verb (VB-PP), Preposition (P), Adjective (ADJ) We hope to find a way to treat these dependencies without exponential blowup with Probabilistic Logic

What we are trying to solve Part-of-speech Tagging Some ambiguities can only be solved with unbounded distance dependency Example The horse raced/vb-p past the barn The horse raced/vb-pp past the barn fell/vb-p The horse raced/vb-pp past the old red barn fell/vb-p Determiner (D), Nouns (N), Past tense verb (VB-P), Past participle verb (VB-PP), Preposition (P), Adjective (ADJ) We hope to find a way to treat these dependencies without exponential blowup with Probabilistic Logic

Talk outline Outline Motivation Probabilistic Logic Hidden Markov Models Encoding Hidden Markov Models with PSAT Non-markovian extension Conclusion

Next Topic Motivation What we are trying to solve Talk outline Probabilistic Logic Definition Properties Hidden Markov Models A algorithm for POS-Tagging Encoding Hidden Markov Models with PSAT The constraints Non-markovian extension Abandoning the Markov property Extra Constraints Conclusion

Definition The language Logical variables or atoms: P = {x 1,..., x n } Connectives:,,. Formulas (L P ) are inductively composed form atoms using connectives A probabilistic knowledge base is a finite set of probabilistic assignments Γ = {P(ϕ i ψ i ) = p i 1 i m} where ϕ i, ψ i L P and p i [0, 1]

Definition Semantics V is the set of all 2 n propositional valuations (possible worlds) v : P {0, 1} generalized for any propositional formula v : L P {0, 1} A probability distribution over propositional valuations π : V [0, 1] π(v i ) = 1 2 n i=1 Probability of a formula α according to π P π (α) = {π(v i ) v i (α) = 1}

Definition The PSAT Problem Probabilistic Satisfiability: are these restrictions consistent? Is there a probability distribution over the possible worlds that makes all the probabilistic assignments true?

Definition The PSAT Problem Probabilistic Satisfiability: are these restrictions consistent? Is there a probability distribution over the possible worlds that makes all the probabilistic assignments true? Given Γ = {P(ϕ i ψ i ) = p i 1 i m}. Is there a π such that P π (ϕ i ψ i ) = p i P π (ψ i ), for all 1 i m?

Definition The OPSAT Problem Similar problem: OPSAT A formula / Γ has a range of probabilities, one for each π Given Γ = {P(ϕ i ψ i ) = p i 1 i m}. Which π satisfies P π (ϕ i ψ i ) = p i P π (ψ i ), for all 1 i m and maximizes/minimizes the probability of some formula µ / Γ? Note the lack of assumptions of a priori statistical independence

Properties The PSAT Problem: An Algebraic Formalization Vector of probabilities p of dimension k 1 (given) Consider a large matrix A k 2 n = [a ij ] (computed) a ij = v j (α i ) {0, 1} π: probability distribution of exponential size PSAT: decide if there is vector π of dimension 2 n 1 such that Aπ = p πi = 1 π 0

Properties The PSAT Problem: An Algebraic Formalization Vector of probabilities p of dimension k 1 (given) Consider a large matrix A k 2 n = [a ij ] (computed) a ij = v j (α i ) {0, 1} π: probability distribution of exponential size OPSAT: decide if there is vector π of dimension 2 n 1 such that max/min Aπ = p πi = 1 π 0 P(µ)

Properties PSAT is NP-complete PSAT is NP-complete: [Georgakopoulos & Kavvadias & Papadimitriou 1988] A PSAT problem has a solution, then there is a solution π with at most k + 1 elements π i > 0 Carathéodory s Lemma So PSAT has a polynomial size witness 1 1 π 1 0/1 0/1..... π 2. 0/1 0/1 π k+1 = 1 p 1. p k

Next Topic Motivation What we are trying to solve Talk outline Probabilistic Logic Definition Properties Hidden Markov Models A algorithm for POS-Tagging Encoding Hidden Markov Models with PSAT The constraints Non-markovian extension Abandoning the Markov property Extra Constraints Conclusion

A algorithm for POS-Tagging Hidden Markov Models HMMs have been successfully used on language processing tasks as early as the 1970s Stochastic process whose states cannot be observed directly A sequence of symbols produced by another stochastic process is observed The sequence of symbols produced dependent only on the current hidden state

A algorithm for POS-Tagging Markov Chains 0.1 0.6 Sunny Rainy 0.55 0.3 0.15 0.25 0.25 0.3 Cloudy 0.5

A algorithm for POS-Tagging Hidden Markov Model 0.1 0.6 Sunny Rainy 0.55 0.3 0.15 0.25 0.25 0.3 Cloudy 0.1 0.4 0.4 0.3 0.2 0.5 0.4 0.3 0.2 0.7 Walk Shop Play

A algorithm for POS-Tagging Inference on HMMs On a POS-tagging context, the words are the observations and the tags are the states We want to discover the sequence of states that most likely have produced a sequence of observations (decoding). max T P(W H, T ) Efficiently solved using the Viterbi algorithm Good performance, but uses a very limited context

Next Topic Motivation What we are trying to solve Talk outline Probabilistic Logic Definition Properties Hidden Markov Models A algorithm for POS-Tagging Encoding Hidden Markov Models with PSAT The constraints Non-markovian extension Abandoning the Markov property Extra Constraints Conclusion

Encoding Hidden Markov Models with PSAT To remove some of the restrictions of HMMs, first lets model it as a PSAT problem Given a sentence o 1,..., o k, define proposition variables tij true iff word i has tag j wi true iff word i is o i

The constraints Attributes of an HMM Given a HMM H = (A, B, q) 1. Probability of transition between states A = [a ij ], a ij = P(t k = s j t k 1 = s i ) 2. Probability of the first state q = [q i ], q i = P(t 1 = s i ) 3. Probability of producing observations B = [b ih ], b ih = P(w k = o h t k = s i )

The constraints Probabilistic constraints 0. Only one state per word, m states total m P(t i,j t i,k ) = 0 P( t i,j ) = 1 1. Probability of transition between states P(t i,j t i 1,k ) = a k,j 2. Probability of the first state P(t 1,j ) = q j 3. Probability of producing observations P(w i t i,j ) = b j,w(i) j=1

The constraints Properties of an HMM A HMM H = (A, B, q) has the following properties: 4. Markov property ( lack of memory ) P(t k t k 1 ) = P(t k t k 1,..., t 1 ) 5. Independence property (the observations depend only on the current state) P(w k t k ) = P(w k t M,..., t 1, w k 1,..., w 1 )

The constraints Probabilistic constraints 4. Markov property ( lack of memory ) P(t i,hi t i 1,hi 1 t 1,h1 ) = P(t i,hi t i 1,hi 1 ), for any T. 5. Independence property (the observations depend only on the current state) P(W t 1,h1 t M,hM ) = M b hi,w(i), for any T. Expressing those two properties requires an exponential number of formulas i=1

The constraints Using constraints 0-5 we create a PSAT instance Γ H such that if π is satisfies Γ H : P π (T W ) has a unique probability measure for any π This probability measure is the same probability as the HMM entails

Next Topic Motivation What we are trying to solve Talk outline Probabilistic Logic Definition Properties Hidden Markov Models A algorithm for POS-Tagging Encoding Hidden Markov Models with PSAT The constraints Non-markovian extension Abandoning the Markov property Extra Constraints Conclusion

Abandoning the Markov property Extending Markov Models The objective of representing HMMs as a PSAT instance is to modify HMMs removing some of its limitations Since our objective is to treat long distance dependencies, we can remove the Markov property Removing the Independence Property leads to bad performance (other probabilities become irrelevant) To avoid an exponential number of probabilistic assignments, we also restrict the Independence Property

Abandoning the Markov property Restricting the Independence Property We limit the Independence to a l-sized window 6. P(w i w i+l 1 t i,hi t i+l 1,hi+l 1 ) = = = i+l 1 j=i i+l 1 j=i P(w j t j,hj ), for any T k and 1 i n l + 1 b hj,w(j), for any T k and 1 i n l + 1.

Abandoning the Markov property Choosing the right model P π (T W ) now is not a single value, but an interval To choose the best π, we maximize P π (W ) OPSAT problem OPSAT solution is a distribution over possible worlds, each world encodes a tagging Tagging: choose world with the maximum probability

Abandoning the Markov property Algorithm 5.1 Performing inference in a relaxed HMM Input: A HMM H and a list of observed words (w 1,..., w m ) Output: A tag sequence T with maximum a posteriori probability. 1: Γ H { probability assignments in (0)-(3) } 2: Γ l Γ H { probability assignments in (6)} 3: (π, P W ) OPSAT (Γ l, P π (W )) 4: T argmax T {P π (T W )} 5: return T

Extra Constraints Adding more information We can quantify the influence of a tag on all subsequent tags, independent of distance 7. P(t i,hi t j,hj ) = r hi,h j,i j With these new constraints the problem may become unsatisfiable

Extra Constraints Adding more information We can quantify the influence of a tag on all subsequent tags, independent of distance 7. P(t i,hi j<i t j,k ) = r h i,k With these new constraints the problem may become unsatisfiable

Extra Constraints Measuring Inconsistency Since the knowledge base may be inconsistent, we can try to minimize the inconsistency Error variables ε i = P π (ϕ i ψ i ) p i P π (ψ i ) Minimize p-norm of errors ε p = m p ε i p i=1 Tratable (linear program) if p = 0 or p =

Extra Constraints Algorithm 5.2 Inconsistency-tolerant inference Input: A HMM H, list of observed words (w 1,..., w m ) and a set of extra constraints Ψ. Output: A tag sequence T with maximum a posteriori probability. { probability assignments in (0)-(3) } 1: Γ H 2: Γ l Γ H { probability assignments in (6) } 3: Γ + l Γ l Ψ //Ψ has the form of (7) 4: (π ε, E) OPSAT ε (Γ + l, ε p) 5: Γ ++ l Γ + l { ε p = E} 6: (π, P W ) OPSAT ε (Γ ++ l, P π (W )) 7: T argmax T {P π (T W )} 8: return T

Next Topic Motivation What we are trying to solve Talk outline Probabilistic Logic Definition Properties Hidden Markov Models A algorithm for POS-Tagging Encoding Hidden Markov Models with PSAT The constraints Non-markovian extension Abandoning the Markov property Extra Constraints Conclusion

Non-local phenomena can be modeled by Probabilistic Propositional Logic theories avoiding exponential explosion With the extra flexibility comes inconsistent theories and worse complexity Implementation is under way Future work will deal with combining local and non-local dependencies