Basic Parsing Algorithms Chart Parsing Seminar Recent Advances in Parsing Technology WS 2011/2012 Anna Schmidt
Talk Outline Chart Parsing Basics Chart Parsing Algorithms Earley Algorithm CKY Algorithm Basics BitPar: Efficient Implementation of CKY
Chart Parsing Basics
Chart Parsing Basics First proposed by Martin Kay Dynamic programming approach Partial results of the computation are stored and (re)used later if needed Same problem is not solved more than once Operates on a CFG Functionality: Recogniser / Parser in this talk focus on recogniser functionality
Main Components Chart Edges Agenda
Component: Chart Is a well-formed substring table (WFST) Stores partial and complete analyses of substrings Information stored in one triangular half of a two-dimensional array of (n+1)*(n+1) n*n Can also be understood as a (directed) graph Vertices: positions between input words 0 Mary 1 feeds 2 the 3 otter 4 Edges connecting vertices Allows no duplicate entries
Component: Edge Data structure storing information about a particular step in the parsing process Inhabit cells of the chart Contain Start and end position in input string A dotted rule Can also contain edge probability
Component: Edge A dotted rule consists of Left hand side (LHS) = non-terminal symbol Right hand side (RHS) = non-terminal or terminal symbol A dot between RHS symbols indicating which constituents have already been found Edges can be Active / incomplete: dot not the last element of RHS Inactive / complete: dot is last element of RHS Example: S NP VP (0,1)
Component: Agenda Organises the order in which tasks are executed Here all tasks (edges) are collected before being put on the chart Ordering of agenda determines what is processed first Therefore also which parse is found first Queue, stack, ordering with respect to probabilities,
Parsing Strategies Kay differentiates parsing strategies along two dimensions: Bottom-up versus top-down Directed versus undirected Directed bottom-up Only build edges for phrases that can actually be incorporated into a higher level structure Left-Corner Parser Directed top-down Only build a new (active) edge if the next word of the input can be used to extend such an edge Earley Undirected varieties: No such restrictions Undirected Bottom-Up: CKY
Parsing Strategies Ways of achieving directedness: Reachability Table: Contains for each non-terminal N the set of all symbols that can be the first element of a string dominated by N For example: NP can start with DET, N, ADJ, but not with V Rule selection table: M*N table where M = non-terminals excluding pre-terminals N = all non-terminals Contains all grammar rules applicable in a situation where M is the 'upper' and N is the 'lower' symbol
Chart Parsing: Advantages No repeated computation of same subproblem Deals well with left-recursive grammars Deals well with ambiguity No backtracking necessary
Earley Algorithm
Earley Algorithm Proposed by Jay Earley Top down search Can handle all CFGs Efficient: O(n3) in the general case Faster for particular types of grammar
Terminology In his paper, Earley does not use the notion of a 'chart' He represents the parsing process as sets of states Index of each state set = end position of all states in the set A state largely corresponds to an edge - Contains dotted rule - Pointer to start position - End position can be derived from state set
Terminology Formalisms are very similar Examples easier to follow when represented in charts So we will stick with 'chart' representations
Algorithm Components Initialization Predictor Scanner Completer Algorithm operates on one half of an array of size (n+1)*(n+1)
0 Mary 1 feeds 2 the 3 otter 4 eos 5 Initialise 0 1 2 3 4 5 0 X S eos 1 2 3 4 5
0 Mary 1 feeds 2 the 3 otter 4 eos 5 Predict 0 1 2 3 4 5 0 X S eos S NP VP NP N NP DET N N Mary DET the 1 2 3 4 5
0 Mary 1 feeds 2 the 3 otter 4 eos 5 Scan 0 1 2 3 4 5 0 X S eos S NP VP NP N NP DET N N Mary DET the 1 2 N Mary 3 4 5
0 Mary 1 feeds 2 the 3 otter 4 eos 5 Complete 0 1 2 3 4 5 0 X S eos S NP VP NP N NP DET N N Mary DET the 1 2 N Mary NP N S NP VP 3 4 5
0 Mary 1 feeds 2 the 3 otter 4 eos 5 Predict 0 1 2 3 4 5 0 X S eos S NP VP NP N NP DET N N Mary DET the N Mary NP N S NP VP 1 VP V NP V feeds 2 3 4 5
0 Mary 1 feeds 2 the 3 otter 4 eos 5 Scan 0 1 2 3 4 5 0 X S eos S NP VP NP N NP DET N N Mary DET the N Mary NP N S NP VP 1 VP V NP V feeds V feeds 2 3 4 5
0 Mary 1 feeds 2 the 3 otter 4 eos 5 Complete 0 1 2 3 4 5 0 X S eos S NP VP NP N NP DET N N Mary DET the N Mary NP N S NP VP 1 VP V NP V feeds 2 V feeds VP V NP 3 4 5
0 Mary 1 feeds 2 the 3 otter 4 eos 5 Predict 0 1 2 3 4 5 0 X S eos S NP VP NP N NP DET N N Mary DET the N Mary NP N S NP VP 1 VP V NP V feeds V feeds VP V NP 2 NP N NP DET N N Mary DET the 3 4 5
0 Mary 1 feeds 2 the 3 otter 4 eos 5 Scan 0 1 2 3 4 5 0 X S eos S NP VP NP N NP DET N N Mary DET the N Mary NP N S NP VP 1 VP V NP V feeds V feeds VP V NP 2 NP N NP DET N N Mary DET the DET the 3 4 5
0 Mary 1 feeds 2 the 3 otter 4 eos 5 Complete 0 1 2 3 4 5 0 X S eos S NP VP NP N NP DET N N Mary DET the N Mary NP N S NP VP 1 VP V NP V feeds V feeds VP V NP 2 NP N NP DET N N Mary DET the DET the NP DET N 3 4 5
0 Mary 1 feeds 2 the 3 otter 4 eos 5 Predict 0 1 2 3 4 5 0 X S eos S NP VP NP N NP DET N N Mary DET the N Mary NP N S NP VP 1 VP V NP V feeds V feeds VP V NP 2 NP N NP DET N N Mary DET the DET the NP DET N 3 N Mary 4 5
0 Mary 1 feeds 2 the 3 otter 4 eos 5 Scan 0 1 2 3 4 5 0 X S eos S NP VP NP N NP DET N N Mary DET the N Mary NP N S NP VP 1 VP V NP V feeds V feeds VP V NP 2 NP N NP DET N N Mary DET the DET the NP DET N 3 N Mary N otter 4 5
0 Mary 1 feeds 2 the 3 otter 4 eos 5 Complete 0 1 2 3 4 5 0 X S eos S NP VP NP N NP DET N N Mary DET the N Mary NP N S NP VP 1 VP V NP V feeds V feeds VP V NP 2 NP N NP DET N N Mary DET the DET the NP DET N NP DET N 3 N Mary N otter 4 5
0 Mary 1 feeds 2 the 3 otter 4 eos 5 Complete 0 1 2 3 4 5 0 X S eos S NP VP NP N NP DET N N Mary DET the N Mary NP N S NP VP 1 VP V NP V feeds V feeds VP V NP 2 NP N NP DET N N Mary DET the DET the NP DET N VP V NP NP DET N 3 N Mary N otter 4 5
0 Mary 1 feeds 2 the 3 otter 4 eos 5 Complete 0 1 2 3 4 5 0 X S eos S NP VP NP N NP DET N N Mary DET the N Mary NP N S NP VP S NP VP 1 VP V NP V feeds V feeds VP V NP 2 NP N NP DET N N Mary DET the DET the NP DET N VP V NP NP DET N 3 N Mary N otter 4 5
0 Mary 1 feeds 2 the 3 otter 4 eos 5 Complete 0 1 2 3 4 5 0 X S eos S NP VP NP N NP DET N N Mary DET the N Mary NP N S NP VP S NP VP X S eos 1 VP V NP V feeds V feeds VP V NP 2 NP N NP DET N N Mary DET the DET the NP DET N VP V NP NP DET N 3 N Mary N otter 4 5
0 Mary 1 feeds 2 the 3 otter 4 eos 5 Predict 0 1 2 3 4 5 0 X S eos S NP VP NP N NP DET N N Mary DET the N Mary NP N S NP VP S NP VP X S eos 1 VP V NP V feeds V feeds VP V NP 2 NP N NP DET N N Mary DET the DET the NP DET N VP V NP NP DET N 3 N Mary N otter 4 eos eos 5
0 Mary 1 feeds 2 the 3 otter 4 eos 5 Scan 0 1 2 3 4 5 0 X S eos S NP VP NP N NP DET N N Mary DET the N Mary NP N S NP VP S NP VP X S eos 1 VP V NP V feeds V feeds VP V NP VP V NP 2 NP N NP DET N N Mary DET the DET the NP DET N NP DET N 3 N Mary N otter 4 eos eos eos eos 5
0 Mary 1 feeds 2 the 3 otter 4 eos 5 Complete 0 1 2 3 4 5 0 X S eos S NP VP NP N NP DET N N Mary DET the N Mary NP N S NP VP S NP VP X S eos X S eos 1 VP V NP V feeds V feeds VP V NP 2 NP N NP DET N N Mary DET the DET the NP DET N VP V NP NP DET N 3 N Mary N otter 4 eos eos eos eos 5
Lookahead Component In original paper, Earley proposes the use of a lookahead string for each state which represents the allowed successor for LHS Prevents completer from processing a state if lookahead string and next word of input do not match Remember Kay's directed top-down strategy?
CKY: Basics
CKY Basics Proposed by John Cocke, Daniel H. Younger, and Tadao Kasami (independently) Bottom-up search Incremental Grammar must be in Chomsky normal form (CNF) Complexity O(n3) Chart: (upper triangle of) array of size n*n
CKY Algorithm: Idea Initialise upper triangle of a chart of size n*n From upper left to lower right corner of chart: Go to the next cell in the diagonal Fill in POS tag of next word in input string Each time a POS tag has been filled in, go up cell by cell and build larger constituents that end at the current end position
1 2 3 4 1 2 3 4
0 Mary 1 feeds 2 the 3 otter 4 1 2 3 4 1 2 3 4 S NP VP NP N NP DET N VP V NP N Mary otter V feeds DET the
0 Mary 1 feeds 2 the 3 otter 4 1 2 3 4 1 N NP 2 3 4 S NP VP NP N NP DET N VP V NP N Mary otter V feeds DET the
0 Mary 1 feeds 2 the 3 otter 4 1 2 3 4 1 N NP 2 V 3 4 S NP VP NP N NP DET N VP V NP N Mary otter V feeds DET the
0 Mary 1 feeds 2 the 3 otter 4 1 2 3 4 1 N NP 2 V 3 4 S NP VP NP N NP DET N VP V NP N Mary otter V feeds DET the
0 Mary 1 feeds 2 the 3 otter 4 1 2 3 4 1 N NP 2 V 3 DET 4 S NP VP NP N NP DET N VP V NP N Mary otter V feeds DET the
0 Mary 1 feeds 2 the 3 otter 4 1 2 3 4 1 N NP 2 V 3 DET 4 S NP VP NP N NP DET N VP V NP N Mary otter V feeds DET the
0 Mary 1 feeds 2 the 3 otter 4 1 2 3 4 1 N NP 2 V 3 DET 4 S NP VP NP N NP DET N VP V NP N Mary otter V feeds DET the
0 Mary 1 feeds 2 the 3 otter 4 1 2 3 4 1 N NP 2 V 3 DET 4 N NP S NP VP NP N NP DET N VP V NP N Mary otter V feeds DET the
0 Mary 1 feeds 2 the 3 otter 4 1 2 3 4 1 N NP 2 V 3 DET NP 4 N NP S NP VP NP N NP DET N VP V NP N Mary otter V feeds DET the
0 Mary 1 feeds 2 the 3 otter 4 1 2 3 4 1 N NP 2 V VP 3 DET NP 4 N NP S NP VP NP N NP DET N VP V NP N Mary otter V feeds DET the
0 Mary 1 feeds 2 the 3 otter 4 1 2 3 4 1 N NP 2 V VP 3 DET NP 4 N NP S S NP VP NP N NP DET N VP V NP N Mary otter V feeds DET the
CKY: BitPar
BitPar: Basics Proposed by Helmut Schmid Bit-vector-based parser Efficiently implements a CKY-style algorithm Uses bit vector operations to parallelise parsing operations Idea: Don't try to decrease number of edges that are built, instead minimise cost of building edges Especially useful if all analyses are needed
BitPar: Requirements Restrictions on Context Free Grammar Must be in CNF Must be ε-free Chain rules allowed Precomputed for each non-terminal N: Set of non-terminals that are derivable from N via chain rules Set is stored in the bit vector chainvec[n] Set includes N itself
Background: Bitwise AND and OR AND 0101 & 0011 = 0001 OR 0101 0011 = 0111 Both corresponding bits must equal 1 At least one of corresponding bits must equal 1
BitPar: Chart Chart = three-dimensional bit array chart [start position b] [end position e] = [011000...] [b] [e] contains a bit vector with one bit for each non-terminal Bit is set to 1 if non-terminal was inserted 0 otherwise Chart initialised with all bits = 0
Filling the Chart: POS Tags Inserting POS tags into a cell of the diagonal: For each non-terminal N that can be rewritten as the word at the current position Do a bitwise OR of Bits inhabiting the chart cell chainvec[n] N and all its chain derivations are inserted in just one operation
Mary feeds the otter 1 2 3 4 1 011000 000000 000000 000000 2 000000 000010 000000 000000 3 000000 000000 000000 000000 4 000000 000000 000000 000000 S, NP, N, VP, V, DET
Mary feeds the otter 1 2 3 4 1 011000 000000? 000000 000000 2 000000 000010 000000 000000 3 000000 000000 000000 000000 4 000000 000000 000000 000000 S, NP, N, VP, V, DET
Filling the Chart: Larger Constituents Conceptually: Determine if several cells can be combined to form a higher level constituent labeled N For this: Loop over grammar rules with LHS = N, extract RHS (consisting of RHS1, RHS2) Loop over all possible combinations of cells that together could contain the substructure of N and determine whether they contain RHS1 and RHS2 respectively
Filling the Chart: Larger Constituents This has to be done For each super-diagonal cell For each non-terminal For all corresponding grammar rules For all possible cell combinations that could constitute a substructure of N This is a time-consuming process BUT: The same functionality can be achieved by a single AND operation on two bit vectors
Internally: Filling the Chart: Larger Constituents Can a given non-terminal LHS be inserted into a given chart cell [b] [e]? Get RHS1, RHS2 from grammar Vector 1 Contains bits stored in chart [ b ] [ b..b+1..e-1 ] [ RHS1 ] Vector 2 Contains bits stored in chart [ b+1..b+2..e ] [ e ] [ RHS2 ]
Filling the Chart: Larger Constituents If a bitwise AND operation on the two new vectors produces one bit = 1 A valid substructure for LHS has been found LHS can be inserted into the chart cell Let's look at an example
Mary feeds the otter 1 2 3 4 1 011000 000000 000000 000000 2 000000 000010 000000 000000 Example: Lets determine if NP should go into cell [3] [4]. 3 000000 000000 000001 000000?? 4 000000 000000 000000 011000 S, NP, N, VP, V, DET
Should NP go into [3] [4]? First, we consult the grammar We find a rule NP DET N, so allowed right-hand sides for NP are RHS1 = DET RHS2 = N Reminder: Rules v1 = chart [ b ] [ b.. b+1.. e-1] [ RHS1 ] v2 = chart [ b+1.. b+2.. e ] [ e ] [ RHS2 ] Vector1 = 1 chart [3] [3] = RHS1 = DET? yes, so insert 1 Vector2 = 1 chart [4] [4] = RHS2 = N? yes, so insert 1 Vector1 AND Vector2 = 1, so insert NP
Mary feeds the otter 1 2 3 4 1 011000 000000 000000 000000 2 000000 000010 000000 000000 Example: Lets determine if NP should go into cell [3] [4]. Yes! 3 000000 000000 000001 010000? 4 000000 000000 000000 011000 S, NP, N, VP, V, DET
Thank you for your attention!
References Earley, Jay: An efficient context-free parsing algorithm. Communications of the ACM, 13(2):94 102, 1970. Jurafsky, Daniel and Martin, James H.: 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics. 2nd edition. Prentice-Hall Kay, Martin: Algorithm schemata and data structures in syntactic processing. In Readings in natural language processing, pages 35 70. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1986. Kay, Martin: Lecture Slides of the Course 'Basic Algorithms for Computational Linguistics' http://www.coli.uni-saarland.de/courses/algorithms-11/ Schmid, Helmut: Efficient Parsing of Highly Ambiguous Context-Free Grammars with Bit Vectors. In Proceedings of Coling 2004, pages 162 168, Geneva, Switzerland, 2004. Wirén, Mats: A Comparison of Rule-Invocation Strategies in Context-Free Chart Parsing
Initialization introduces a new non-terminal start symbol X and a new end symbol EOS adds EOS to the end of the input string for each root symbol R of the grammar: add to the chart[0,0] an edge of the form: X. R EOS
Predictor for all non-terminals N directly following a dot (in the current state set): and for each grammar rule with N as LHS: add a new edge with LHS = N RHS according to grammar, but dot first element of RHS start and end = end of original state
Scanner for all terminal symbols immediately following a dot: compare terminal symbol with input string starting at end position of current edge if they match: add new edge to the chart with dot moved over the terminal symbol end position incremented by 1
Completer If the dot is last element of a production with LHS of type T find edges that are still waiting for a constituent of the type T end where the complete edge is starting Add to the chart an edge with dot moved over T end position = end position of completed edge