ANNIC: Annotations in Context. Niraj Aswani, Valentin Tablan Thomas Heitz University of Sheffield

ANNIC: Annotations in Context Niraj Aswani, Valentin Tablan Thomas Heitz University of Sheffield

ANNIC Motivation Need for a corpus analysis tool Useful for authoring of IE patterns for rules is an IR engine that can search over: Document Content Meta-data (Annotation types, features and values) for example: Person.gender== male 2

ANNIC is based on Apache Lucene technology. can index any document supported by GATE is integrated in GATE as Searchable Serial DataStore (SSD) has an advanced GUI that provides: view of annotation mark-ups over the matched patterns Interactive way of developing new patterns e.g. title followed by noun that is always in upper case? Annotation statistics 3

How does it work? Integrated in GATE as Searchable Serial Datastore (SSD) Initialization Where to store What to Index and what to exclude Context boundary (e.g. restricted within sentence or paragraph boundaries) Index actions linked with Datastore actions When document is saved, index or re-index if already indexed When document is deleted, delete it from the index 4

Query Language JAPE Pattern syntax String within quotes or without quotes e.g. ubuntu {AnnotationType} e.g. {Person} {AnnotationType == string} e.g. {Organization == University of Sheffield } {AT.featureName==value} e.g. {Person.gender == male} {AT.feature==value, AT.feature==value} e.g. {Token.orth == upperinitial, Token.length == 3 } 5

Query Language Klene Operator + and * but they need to be quantified {Person}{Token}*3{Organization} find all Person and Organization annotations within upto 3 tokens of each other Logical (OR) operator {A}({B} {C}) - ({A}{B}) ({A}{C}) Order and presence of query terms is very important 6

DEMO! 7

Hands-on-exercise Populate corpus with documents Process with ANNIE, making output of all PRs to be ANNIC annotation set Create Searchable datastore, supplying needed parameters Store corpus there Go to search tab on datastore Enter some sample queries: {Person} Check what annotations are around (e.g. Organization} Expand pattern to find people near Organizations 8

Index Generation-Approach I Based on Start Offsets Mr Symond works for Creative Arts in LA T1 T2 T3 T4 T5 T6 T7 T8 Title LastName Person Organization Location Token Stream T1 Person Title T2 LastName T3 T4 T5 Organization T6 T7 T8 Location {Title} {LastName} works for {Organization} T T {Person} {LastName} works for {Organization} T F {Title} {LastName} works for ({Token})+3 {Location} T T {Title} {LastName} works for {Organization} {Token} {Location} F T 9

Index Generation-Approach II Based on End Offsets Mr Symond works for Creative Arts in LA T1 T2 T3 T4 T5 T6 T7 T8 Title LastName Person Organization Location Token Stream T1 Title T2 LastName Person T3 T4 T5 T6 Organization T7 T8 Location {Title} {LastName} works for {Organization} F T {Person} {LastName} works for {Organization} F F {Title} {LastName} works for ({Token})+3 {Location} T T {Title} {LastName} works for {Organization} {Token} {Location} F T 10

Index Generation-Approach III Based on Start + End Offsets Mr Token string orth root pos Mr upperinitial mr NNP Symonds Token string Symonds orth upperinitial root symonds pos NNP Term Token Token.string == Mr Token.orth == upperinitial Token.root == mr Token.pos == NNP Start Offset 1 1 End Offset Person gender male Person Person.gender == male 1 2 Token Token.string == Symonds Token.orth == upperinitial Token.root == Symonds Token.pos == NNP 2 2 11

Index Generation-Approach III Based on Start + End Offsets Mr Symond works for Creative Arts in LA T1 T2 T3 T4 T5 T6 T7 T8 Title LastName Person Organization Location Token Stream T1 Person.eo=T2 Title.eo=T1 T2 LastName.eo=T2 T3 T4 T5 Organization.eo=T6 T6 T7 T8 Location.eo=T8 12

Search Optimization {Title} {LastName} works for {Organization} {Token} {Location} Parse query into N sub-queries such that every sub-query matches ({Token})* {Non-Token} expression Q1 = {Title}, Q2 = {LastName}, Q3 = works for {Organization}, Q4 = {Token} {Location} Q2 is searched only within the result set of Q1 If Q1 returns 3 hits H1, H2 and H3, three queries are formed for Q2 Q2.so = H1.eo + 1 H2.eo + 1 H3.eo + 1 Q3 is searched only within the result set of Q2 If Q2 says only H1 and H3 are correct Q3.so = H1.eo + 1 H3.eo + 1 Q4 is searched only within the result set of Q3 If Q3 says only H1 is valid Q4.so = H1.eo + 1 13